@INPROCEEDINGS{mukherjee-awadallah-2020-ust,
title = "Uncertainty-aware Self-training for Few-shot Text Classification",
author = "Subhabrata Mukherjee and Ahmed Hassan Awadallah",
booktitle = "NeurIPS",
year = "2020",
pages = {21199--21212}
}
Notice:
这篇论文在arXiv上面的标题为 《Uncertainty-aware Self-training for Text Classification with Few Labels》,推测是发表后更改的。
Recent success of pre-trained language models crucially hinges on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire or difficult to access for many applications. We study selftraining as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck by making use of large-scale unlabeled data for the target task. Standard self-training mechanism randomly samples instances from the unlabeled pool to generate pseudo-labels and augment labeled data. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning. Specifically, we propose (i) acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout, and (ii) learning mechanism leveraging model confidence for self-training. As an application, we focus on text classification with five benchmark datasets. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labels with an aggregate accuracy of 91% and improvement of up to 12% over baselines.
预训练语言模型最近的成功关键取决于对下游任务的大量标记数据进行微调,这些数据通常获取成本昂贵或对于许多应用程序来说难以访问。我们将自我训练研究为最早的半监督学习方法之一,通过利用大规模未标记数据来完成目标任务,从而减少注释瓶颈。标准的自训练机制从未标记池中随机采样实例以生成伪标签并增强标记数据。我们提出了一种利用贝叶斯深度学习的最新进展,结合底层神经网络的不确定性估计来改进自我训练的方法。具体来说,我们提出(i)获取函数利用蒙特卡罗(MC)Dropout从未标记池中选择实例,以及(ii)利用模型置信度进行自我训练的学习机制。作为一个应用程序,我们专注于使用五个基准数据集进行文本分类。我们展示了我们的方法,每个任务仅利用每类 20-30 个标记样本进行训练和验证,其性能在完全监督的预训练语言模型的 3% 以内,该语言模型在数千个标签上进行了微调,总体准确度为 91%,并且改进了比基线高出 12%。
Uncertain
和 Self-training
的结合
Self-training process:
min
?
W
E
x
l
,
y
l
∈
D
l
[
?
log
?
p
(
y
l
∣
x
l
;
W
)
]
+
λ
E
x
u
∈
S
u
,
S
u
?
D
u
E
y
~
p
(
y
∣
x
u
;
W
?
)
[
?
log
?
p
(
y
∣
x
u
;
W
)
]
(1)
\begin{split} & \min_W{\mathbb{E}_{x_l,y_l \in D_l}[-\log{p(y_l|x_l;W)}]} \\ &+ \lambda \mathbb{E}_{x_u \in S_u, S_u \subset D_u} \mathbb{E}_{y \sim p(y|x _u;W^*)}[-\log p(y|x_u;W)] \end{split}\tag{1}
?Wmin?Exl?,yl?∈Dl??[?logp(yl?∣xl?;W)]+λExu?∈Su?,Su??Du??Ey~p(y∣xu?;W?)?[?logp(y∣xu?;W)]?(1)
Uncertain-aware Self-training process:
min
?
W
,
θ
E
x
l
,
y
l
∈
D
l
[
?
log
?
p
(
y
l
∣
x
l
;
W
)
]
+
λ
E
x
u
∈
S
u
,
S
u
?
D
u
E
W
~
~
q
θ
(
W
?
)
E
y
~
p
(
y
∣
f
W
~
(
x
u
)
)
[
?
log
?
p
(
y
∣
f
W
(
x
u
)
)
]
(2)
\begin{split} & \min_{W, \theta}{\mathbb{E}_{x_l,y_l \in D_l}[-\log{p(y_l|x_l;W)}]} \\ &+ \lambda \mathbb{E}_{x_u \in S_u, S_u \subset D_u} \mathbb{E}_{\widetilde{W} \sim q_\theta(W^*)}\mathbb{E}_{y \sim p(y|f^{\widetilde{W}}(x_u))}[-\log p(y|f^{W}(x_u))] \end{split}\tag{2}
?W,θmin?Exl?,yl?∈Dl??[?logp(yl?∣xl?;W)]+λExu?∈Su?,Su??Du??EW
~qθ?(W?)?Ey~p(y∣fW
(xu?))?[?logp(y∣fW(xu?))]?(2)
其中:
Dropout distribution
,是一种预估模型不确定性的一种方案,也叫做 Monte-Carlo Dropout
。Account for the teacher uncertain for the pseudo-labels in terms of their predictive variance:
min ? W , θ E x l , y l ∈ D l [ ? log ? p ( y l ∣ x l ; W ) ] + λ E x u ∈ S u , S u ? D u E W ~ ~ q θ ( W ? ) E y ~ p ( y ∣ f W ~ ( x u ) ) [ log ? p ( y ∣ f W ( x u ) ) ? log ? V a r ( y ) ] \begin{split} & \min_{W, \theta}{\mathbb{E}_{x_l,y_l \in D_l}[-\log{p(y_l|x_l;W)}]} \\ &+ \lambda \mathbb{E}_{x_u \in S_u, S_u \subset D_u} \mathbb{E}_{\widetilde{W} \sim q_\theta(W^*)}\mathbb{E}_{y \sim p(y|f^{\widetilde{W}}(x_u))}[\log p(y|f^{W}(x_u)) \cdot \log Var(y)] \end{split} ?W,θmin?Exl?,yl?∈Dl??[?logp(yl?∣xl?;W)]+λExu?∈Su?,Su??Du??EW ~qθ?(W?)?Ey~p(y∣fW (xu?))?[logp(y∣fW(xu?))?logVar(y)]?
其中:
D ( X ) = V a r ( X ) = E ( X 2 ) ? [ E ( X ) ] 2 D(X) = Var(X) = E(X^2) - [E(X)]^2 D(X)=Var(X)=E(X2)?[E(X)]2
注意的是,在代码实现中,
σ
2
\sigma^2
σ2表示数据本身存在的噪声,这一步不在置信度考量范围,实际上也没有对此建模。
伪代码:
回过头来看伪代码,就很清楚了,这里还有几点想要说明一下:
大厂出品必属精品。我读下来本文的核心就是将不确定性(主要是模型不确定性)融入了Self-training中,数学符号语言很丰富,值得学习。