Q函数是EM算法中的一个重要函数,全称为“期望完全数据对数似然函数”。它的作用是在E步中计算出完全数据的对数似然函数的期望值,以便在M步中求出模型参数的最大似然估计值。
在之前的一篇文章中,为大家介绍了李航教授《统计学习方法》中求解三硬币模型的推导过程,其中使用的EM算法是从一个Q函数直接展开求解的,限于篇幅,文章并未展示证明过程,本篇文章作为上一篇文章以及《统计学习方法-第九章-179页》推导的补充,详细推导Q函数的由来。
我们已知关于参数
θ
\theta
θ的似然函数
L
(
θ
)
=
log
?
P
(
Y
∣
θ
)
=
log
?
P
(
Y
,
θ
)
P
(
θ
)
=
log
?
∑
Z
P
(
Y
,
θ
,
Z
)
P
(
θ
)
=
log
?
∑
Z
P
(
Y
,
θ
,
Z
)
P
(
θ
)
=
log
?
∑
Z
P
(
Y
,
Z
∣
θ
)
=
log
?
∑
Z
P
(
Y
,
Z
,
θ
)
P
(
Z
,
θ
)
?
P
(
Z
,
θ
)
P
(
θ
)
L(\theta)=\log P(Y \mid \theta) \\ =\log \frac{P(Y, \theta)}{ P(\theta)} \\=\log \frac{ \sum_Z P(Y, \theta,Z)}{ P(\theta)} \\=\log \sum_Z \frac{ P(Y, \theta,Z)}{ P(\theta)}=\log \sum_Z P(Y, Z \mid \theta) \\=\log \sum_Z \frac{ P(Y, Z, \theta)}{ P(Z,\theta) }\cdot \frac{ P(Z, \theta)}{ P(\theta) }
L(θ)=logP(Y∣θ)=logP(θ)P(Y,θ)?=logP(θ)∑Z?P(Y,θ,Z)?=logZ∑?P(θ)P(Y,θ,Z)?=logZ∑?P(Y,Z∣θ)=logZ∑?P(Z,θ)P(Y,Z,θ)??P(θ)P(Z,θ)?
即
L
(
θ
)
=
log
?
∑
Z
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
L(\theta)=\log \sum_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta)
L(θ)=logZ∑?P(Y∣Z,θ)?P(Z∣θ)
假设第i次参数取
θ
(
i
)
\theta^{(i)}
θ(i) ,我们希望优化后
L
(
θ
)
>
L
(
θ
(
i
)
)
L(\theta)>L(\theta^{(i)})
L(θ)>L(θ(i))
于是可以作差
即
L
(
θ
)
?
L
(
θ
(
i
)
)
=
log
?
Σ
Z
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
?
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \Sigma_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta)-P\left(Y \mid \theta^{(i)}\right)
L(θ)?L(θ(i))=logΣZ?P(Y∣Z,θ)?P(Z∣θ)?P(Y∣θ(i))
第一项可以凑一个分式出来
L
(
θ
)
?
L
(
θ
(
i
)
)
=
log
?
(
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
?
log
?
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right)
L(θ)?L(θ(i))=log(ΣZ?P(Z∣Y,θ(i))?P(Z∣Y,θ(i))P(Y∣Z,θ)?P(Z∣θ)?)?logP(Y∣θ(i))
利用
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
=
1
\sum_Z P \left(Z \mid Y, \theta^{(i)}\right)=1
∑Z?P(Z∣Y,θ(i))=1的特性,第二项乘以这一串,可以得到
L
(
θ
)
?
L
(
θ
(
i
)
)
=
log
?
(
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
?
log
?
P
(
Y
∣
θ
(
i
)
)
?
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right)
L(θ)?L(θ(i))=log(ΣZ?P(Z∣Y,θ(i))?P(Z∣Y,θ(i))P(Y∣Z,θ)?P(Z∣θ)?)?logP(Y∣θ(i))?ΣZ?P(Z∣Y,θ(i))
利用
J
e
n
s
e
n
Jensen
Jensen不等式
l
o
g
∑
j
λ
j
?
y
j
?
∑
j
λ
j
?
l
o
g
y
j
log\sum_{j}\lambda_j \cdot y_j \geqslant \sum_j \lambda_j \cdot log y_j
log∑j?λj??yj??∑j?λj??logyj?,其中
λ
?
0
,
∑
j
λ
j
=
1
\lambda \geqslant 0,\sum_j \lambda_j =1
λ?0,∑j?λj?=1
可知
?
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
θ
(
i
)
)
?
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
\geqslant \sum_Z P \left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right)
?Z∑?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))P(Y∣Z,θ)?P(Z∣θ)??logP(Y∣θ(i))?ΣZ?P(Z∣Y,θ(i))
=
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
[
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
p
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
θ
(
i
)
)
]
=\sum_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot\left[\log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{p\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right)\right]
=Z∑?P(Z∣Y,θ(i))?[logp(Z∣Y,θ(i))P(Y∣Z,θ)?P(Z∣θ)??logP(Y∣θ(i))]
=
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
=\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
=ΣZ?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))?P(Y∣θ(i))P(Y∣Z,θ)?P(Z∣θ)?
即此时
L
(
θ
)
?
L
(
θ
(
i
)
)
?
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right) \geqslant \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
L(θ)?L(θ(i))?ΣZ?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))?P(Y∣θ(i))P(Y∣Z,θ)?P(Z∣θ)?
即
L
(
θ
)
?
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
L(\theta) \geqslant L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
L(θ)?L(θ(i))+ΣZ?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))?P(Y∣θ(i))P(Y∣Z,θ)?P(Z∣θ)?
令
B
(
θ
,
θ
(
i
)
)
=
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)=L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
B(θ,θ(i))=L(θ(i))+ΣZ?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))?P(Y∣θ(i))P(Y∣Z,θ)?P(Z∣θ)?
此时
B
(
θ
,
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)
B(θ,θ(i)) 是
L
(
θ
)
L(\theta)
L(θ) 的下界,使
B
(
θ
,
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)
B(θ,θ(i)) 最大化的
θ
\theta
θ 也可使
L
(
θ
)
L\left( \theta\right)
L(θ) 最大
于是我们的目标是
θ
(
i
+
1
)
=
argmax
?
θ
B
(
θ
,
θ
(
i
)
)
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} B\left(\theta, \theta^{(i)}\right)
θ(i+1)=θargmax?B(θ,θ(i))
也即
θ
(
i
+
1
)
=
argmax
?
θ
[
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
]
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} \left[L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{\left.P(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)}\right]
θ(i+1)=θargmax?[L(θ(i))+ΣZ?P(Z∣Y,θ(i))?logP(Z∣Y,θ(i))?P(Y∣θ(i))P(Y∣Z,θ)?P(Z∣θ)?]
可把
L
(
θ
(
i
)
)
、
P
(
Z
∣
Y
,
θ
(
i
)
)
、
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
L\left( \theta^{(i)}\right)、P\left(Z \mid Y, \theta^{(i)}\right)、 P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)
L(θ(i))、P(Z∣Y,θ(i))、P(Z∣Y,θ(i))?P(Y∣θ(i)) 三项视为常数
且已知
P
(
Z
∣
Y
,
θ
(
i
)
)
?
P
(
Y
∣
θ
(
i
)
)
>
0
P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)>0
P(Z∣Y,θ(i))?P(Y∣θ(i))>0 ,这一项从分母去掉,不影响求最大值,注意这里的
P
(
Z
∣
Y
,
θ
(
i
)
)
\left.P(Z \mid Y, \theta^{(i)}\right)
P(Z∣Y,θ(i)) 不能省略,因为它是
∑
\sum
∑ 后面中的每一项的系数
于是
θ
(
i
+
1
)
=
argmax
?
θ
[
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
]
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}}\left[\Sigma_Z P\left(Z \mid Y,{ \theta}^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta)\right]
θ(i+1)=θargmax?[ΣZ?P(Z∣Y,θ(i))?logP(Y∣Z,θ)?P(Z∣θ)]
我们令
Q
(
θ
,
θ
(
i
)
)
=
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
?
log
?
P
(
Y
∣
Z
,
θ
)
?
P
(
Z
∣
θ
)
Q\left(\theta, \theta^{(i)}\right)=\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta)
Q(θ,θ(i))=ΣZ?P(Z∣Y,θ(i))?logP(Y∣Z,θ)?P(Z∣θ)
即
θ
(
i
+
1
)
=
argmax
?
θ
Q
(
θ
,
θ
(
i
+
1
)
)
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} Q\left(\theta, \theta^{(i+1)}\right)
θ(i+1)=θargmax?Q(θ,θ(i+1))
其中 Q ( θ , θ ( i ) ) Q\left(\theta, \theta^{(i)}\right) Q(θ,θ(i)) 就是所谓的 Q Q Q 函数
[1].EM算法求解三硬币模型参数推导