??偶然学习KL散度,突然发现python里面KL散度的实现有很多种耶,一时就地懵圈,各处查阅资料,终于理解了,主要从代码实现和公式的角度,整理记录一下神奇的stats.entropy、special.rel_entr、special.kl_div、F.kl_div与nn.KLDivLoss吧。
??
??KL散度(Kullback-Leibler divergence)用于度量两个概率分布的相似度,可作为经典损失函数,设有 P {P} P 为真实分布, Q {Q} Q 为近似分布,若为离散随机变量,则公式表示为:
D
K
L
(
P
∣
∣
Q
)
=
∑
i
P
(
i
)
I
n
(
P
(
i
)
Q
(
i
)
)
{D_{KL}}\left( {P||Q} \right) = \sum\limits_i {P\left( i \right)} {\mathop{\rm In}\nolimits} \left( {\frac{{P\left( i \right)}}{{Q\left( i \right)}}} \right)
DKL?(P∣∣Q)=i∑?P(i)In(Q(i)P(i)?) ??若为连续随机变量,则公式表示为:
D
K
L
(
P
∣
∣
Q
)
=
∫
?
∞
∞
p
(
x
)
I
n
(
P
(
x
)
Q
(
x
)
)
d
x
{D_{KL}}\left( {P||Q} \right) = \int_{ - \infty }^\infty {p\left( x \right)} {\mathop{\rm In}\nolimits} \left( {\frac{{P\left( x \right)}}{{Q\left( x \right)}}} \right)dx
DKL?(P∣∣Q)=∫?∞∞?p(x)In(Q(x)P(x)?)dx ??KL散度要求输入的概率分布之和为1,因此在实际计算时,需要确保概率分布满足这个条件。另外,KL散度并不是一个对称函数,即
D
K
L
(
P
∣
∣
Q
)
{{D_{KL}}\left( {P||Q} \right)}
DKL?(P∣∣Q) 不等于
D
K
L
(
Q
∣
∣
K
)
{{D_{KL}}\left( {Q||K} \right)}
DKL?(Q∣∣K) 。
??stats.entropy(官方文档)可以计算香农熵也可以计算相对熵,即KL散度,其包括4个参数,分布pk和qk,对数底base(默认为e)以及计算维度axis(默认为0):
scipy.stats.entropy(pk, qk=None, base=None, axis=0)
??其公式表示为:
e
n
t
r
o
p
y
(
x
,
y
)
=
(
∑
x
log
?
(
x
/
y
)
)
/
log
?
(
b
a
s
e
)
{\mathrm{entropy}(x, y) =(\sum x \log(x / y))/ \log(base)}
entropy(x,y)=(∑xlog(x/y))/log(base) ??python中实现为:
import numpy as np
import scipy.stats
p = [0.1, 0.2, 0.3, 0.6]
q = [0.2, 0.2, 0.2, 0.2]
out0 = scipy.stats.entropy(p)
out1 = scipy.stats.entropy(p, q)
out2 = scipy.stats.entropy(q, p)
out3 = scipy.stats.entropy(p, q, base=2)
print("p的香农熵:", out0)
print("p和q的相对熵:", out1)
print("不对称性验证:", out2)
print("base参数:", out3)
print("base参数验证:", out1 / np.log(2))
# 归一化---------------------------------------------
pp = p / np.sum(p)
qq = q / np.sum(q)
# ---------------------------------------------------
zz = []
for i in range(4):
temp = -pp[i] * np.log(pp[i])
zz.append(temp)
print("手动计算香农熵:", np.sum(zz))
xx = []
for i in range(4):
temp = pp[i] * np.log(pp[i] / qq[i])
xx.append(temp)
print("手动计算相对熵:", np.sum(xx))
??值得注意的是,手动计算的时候,需要先将p和q规范化,使得其元素和为1,而stats.entropy函数自动实现了这一步。由结果可知,stats.entropy与手动计算的输出是一致的,KL散度不具有对称性:
p的香农熵: 1.1988493129136213
p和q的相对熵: 0.18744504820626923
不对称性验证: 0.20273255405408233
base参数: 0.27042604148637733
base参数验证: 0.27042604148637733
手动计算香农熵: 1.1988493129136213
手动计算相对熵: 0.18744504820626923
??special.rel_entr(官方文档)也可以计算KL散度,其公式表示为:
r
e
l
_
e
n
t
r
(
x
,
y
)
=
{
x
log
?
(
x
/
y
)
x
>
0
,
y
>
0
0
x
=
0
,
y
≥
0
∞
otherwise
{\mathrm{rel\_entr}(x, y) = \begin{cases} x \log(x / y) & x > 0, y > 0 \\ 0 & x = 0, y \ge 0 \\ \infty & \text{otherwise} \end{cases}}
rel_entr(x,y)=?
?
??xlog(x/y)0∞?x>0,y>0x=0,y≥0otherwise? ??python中实现为:
import numpy as np
import scipy.special
p = [0.1, 0.2, 0.3, 0.6]
q = [0.2, 0.2, 0.2, 0.2]
out1 = scipy.stats.entropy(p, q)
print("stats.entropy:", out1)
# 归一化---------------------------------------------
pp = p / np.sum(p)
qq = q / np.sum(q)
# ---------------------------------------------------
out2 = scipy.special.rel_entr(pp, qq)
print("special.rel_entr:", np.sum(out2))
print("元素输出:", out2)
??值得注意的是special.rel_entr的输入是规范化之后的分布,而非原始分布,special.rel_entr的直接输出是元素输出,求和之后,special.rel_entr与stats.entropy计算结果一致:
stats.entropy: 0.18744504820626923
special.rel_entr: 0.18744504820626923
元素输出: [-9.15510241e-02 -6.75775180e-02 -5.55111512e-17 3.46573590e-01]
??special.kl_div(官方文档)从名字看似乎是最正宗的KL散度,但其公式表示为:
k
l
_
d
i
v
(
x
,
y
)
=
{
x
log
?
(
x
/
y
)
?
x
+
y
x
>
0
,
y
>
0
y
x
=
0
,
y
≥
0
∞
otherwise
{\mathrm{kl\_div}(x, y) = \begin{cases} x \log(x / y) - x + y & x > 0, y > 0 \\ y & x = 0, y \ge 0 \\ \infty & \text{otherwise} \end{cases}}
kl_div(x,y)=?
?
??xlog(x/y)?x+yy∞?x>0,y>0x=0,y≥0otherwise? ??python中实现为:
import numpy as np
import scipy.special
import scipy.stats
p = [0.1, 0.2, 0.3, 0.6]
q = [0.2, 0.2, 0.2, 0.2]
# 归一化---------------------------------------------
pp = p / np.sum(p)
qq = q / np.sum(q)
# ---------------------------------------------------
out1 = scipy.stats.entropy(p, q)
print("stats.entropy:", out1)
out2 = scipy.special.rel_entr(pp, qq)
print("special.rel_entr:", np.sum(out2))
print("special.rel_entr元素:", out2)
out3 = scipy.special.kl_div(pp, qq)
print("special.kl_div:", np.sum(out3))
print("special.kl_div元素:", out3)
xx = []
for i in range(4):
temp = (pp[i] * np.log(pp[i] / qq[i])) - pp[i] + qq[i]
xx.append(temp)
print("手动special.kl_div计算:", np.sum(xx))
print("手动special.kl_div元素:", xx)
??与special.rel_entr相同,special.kl_div需要输入规范化之后的分布,也是元素输出,求和之后,special.kl_div、special.rel_entr和stats.entropy计算结果一致,不同的是,special.kl_div的元素与special.rel_entr不一样,因为每一项多了 ? x + y {-x+y} ?x+y ,但因输入分布经过了规范化,故求和后值相同:
stats.entropy: 0.18744504820626923
special.rel_entr: 0.18744504820626923
special.rel_entr元素: [-9.15510241e-02 -6.75775180e-02 -5.55111512e-17 3.46573590e-01]
special.kl_div: 0.1874450482062694
special.kl_div元素: [0.07511564 0.01575582 0. 0.09657359]
手动special.kl_div计算: 0.1874450482062694
手动special.kl_div元素: [0.07511564261099085, 0.015755815315305954, 0.0, 0.09657359027997259]
??F.kl_div(官方文档)和nn.KLDivLoss(官方文档)是torch中实现的KL散度损失函数,设
y
pred
{y_{\text{pred}}}
ypred? 为模型预测输出,
y
true
{y_{\text{true}}}
ytrue? 为真实分布,公式表达为:
L
(
y
pred
,
?
y
true
)
=
y
true
?
log
?
y
true
y
pred
=
y
true
?
(
log
?
y
true
?
log
?
y
pred
)
{L(y_{\text{pred}},\ y_{\text{true}}) = y_{\text{true}} \cdot \log \frac{y_{\text{true}}}{y_{\text{pred}}} = y_{\text{true}} \cdot (\log y_{\text{true}} - \log y_{\text{pred}})}
L(ypred?,?ytrue?)=ytrue??logypred?ytrue??=ytrue??(logytrue??logypred?) ??但他们实际实现的时候与上述公式有些许的不同,主要由参数 log_target 控制。
??当 log_target=False 时,公式为:
L
(
y
pred
,
?
y
true
)
=
y
true
?
(
log
?
y
true
?
y
pred
)
{L(y_{\text{pred}},\ y_{\text{true}}) = y_{\text{true}} \cdot (\log y_{\text{true}} - y_{\text{pred}})}
L(ypred?,?ytrue?)=ytrue??(logytrue??ypred?) ??当 log_target=True 时,公式为:
L
(
y
pred
,
?
y
true
)
=
e
y
true
?
(
y
true
?
y
pred
)
{L(y_{\text{pred}},\ y_{\text{true}}) = e^{y_{\text{true}}} \cdot (y_{\text{true}} - y_{\text{pred}})}
L(ypred?,?ytrue?)=eytrue??(ytrue??ypred?) ??值得注意的是,F.kl_div与nn.KLDivLoss输入的第一个分布为对数概率分布(
y
pred
y_{\text{pred}}
ypred? ),第二个分布为概率分布(
y
true
y_{\text{true}}
ytrue? ),由于
y
pred
y_{\text{pred}}
ypred? 已经是对数了,所以当log_target=False时,
y
pred
y_{\text{pred}}
ypred? 那没有取对数,而
y
true
y_{\text{true}}
ytrue? 那取了对数。
??python中实现为:
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.tensor([0.1, 0.2, 0.3, 0.6])
y = torch.tensor([0.2, 0.2, 0.2, 0.2])
logp_x = F.log_softmax(x, dim=-1) # torch.log(F.softmax(x, dim=-1))
p_y = F.softmax(y, dim=-1) # [0.25, 0.25, 0.25, 0.25]
kl_sum = F.kl_div(logp_x, p_y, reduction='sum')
kl_mean = F.kl_div(logp_x, p_y, reduction='batchmean')
print(kl_sum, kl_mean)
kl_sum_log_target = F.kl_div(logp_x, p_y, reduction='sum', log_target=True)
kl_mean_log_target = F.kl_div(logp_x, p_y, reduction='batchmean', log_target=True)
print(kl_sum_log_target, kl_mean_log_target)
kl_loss_sum = nn.KLDivLoss(reduction="sum")
output1 = kl_loss_sum(logp_x, p_y)
kl_loss_mean = nn.KLDivLoss(reduction="batchmean")
output2 = kl_loss_mean(logp_x, p_y) # logp_x:pred, p_y:target/true
print(output1, output2)
xx = []
for i in range(4):
temp = p_y[i] * (p_y[i].log() - logp_x[i])
xx.append(temp)
print("log_target=False, 手动计算KL_loss:", sum(xx), (xx[0]+xx[1]+xx[2]+xx[3])/4)
yy = []
for i in range(4):
temp = p_y[i].exp() * (p_y[i] - logp_x[i])
yy.append(temp)
print("log_target=True, 手动计算KL_loss:", sum(yy), (yy[0]+yy[1]+yy[2]+yy[3])/4)
??验证了手动计算与F.kl_div与nn.KLDivLoss的输出一致:
tensor(0.0182) tensor(0.0045)
tensor(8.4976) tensor(2.1244)
tensor(0.0182) tensor(0.0045)
log_target=False, 手动计算KL_loss: tensor(0.0182) tensor(0.0045)
log_target=True, 手动计算KL_loss: tensor(8.4976) tensor(2.1244)