江大白|万字长文，深入浅出Transformer，值得收藏！(测试代码已跑通)

发布时间：2024年01月23日

本文来源公众号“江大白”，仅用于学术分享，侵权删，干货满满。

以下文章来源于知乎：算法美食屋作者：梁云1991

前方干货预警：这可能是你能够找到的最容易懂的最具实操性的最系统的学习transformer模型的入门教程。

我们从零开始用pytorch搭建Transformer模型(中文可以翻译成变形金刚)。训练它来实现一个有趣的实例：两数之和。

输入输出类似如下：

输入："12345+54321" 输出："66666"。我们把这个任务当做一个机器翻译任务来进行。输入是一个字符序列，输出也是一个字符序列(seq-to-seq).这和机器翻译的输入输出结构是类似的，所以可以用Transformer来做。

参考资料：论文《Attention is All you needed》: https://arxiv.org/pdf/1706.03762.pdf

哈佛博客：https://github.com/harvardnlp/annotated-transformer/

一准备数据

import random
import numpy as np
import torch
from torch.utils.data import Dataset,DataLoader
?
# 定义字典
words_x = '<PAD>,1,2,3,4,5,6,7,8,9,0,<SOS>,<EOS>,+'
vocab_x = {word: i for i, word in enumerate(words_x.split(','))} #enumerate()所有值都会遍历 '<PAD>':0
vocab_xr = [k for k, v in vocab_x.items()] #反查词典 k是'<PAD>'
?
words_y = '<PAD>,1,2,3,4,5,6,7,8,9,0,<SOS>,<EOS>'
vocab_y = {word: i for i, word in enumerate(words_y.split(','))}
vocab_yr = [k for k, v in vocab_y.items()] #反查词典
?
#两数相加数据集
def get_data():
 ? ?# 定义词集合
 ? ?words = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
?
 ? ?# 每个词被选中的概率
 ? ?p = np.array([7, 5, 5, 7, 6, 5, 7, 6, 5, 7])
 ? ?p = p / p.sum()
?
 ? ?# 随机采样n1个词作为s1
 ? ?n1 = random.randint(10, 20) ?#返回一个10-20的整数
 ? ?s1 = np.random.choice(words, size=n1, replace=True, p=p)#replace=True表示可以取相同数字，数组p表示取每个元素的概率，返回的是一维数组（ndarray），类似array([1, 4, 1])
 ? ?s1 = s1.tolist()
?
 ? ?# 随机采样n2个词作为s2
 ? ?n2 = random.randint(10, 20)
 ? ?s2 = np.random.choice(words, size=n2, replace=True, p=p)
 ? ?s2 = s2.tolist()
?
 ? ?# x等于s1和s2字符上的相加
 ? ?x = s1 + ['+'] + s2
 ? ?
 ? ?# y等于s1和s2数值上的相加
 ? ?y = int(''.join(s1)) + int(''.join(s2))
 ? ?y = list(str(y))
 ? ?
 ? ?# 加上首尾符号
 ? ?x = ['<SOS>'] + x + ['<EOS>']
 ? ?y = ['<SOS>'] + y + ['<EOS>']
?
 ? ?# 补pad到固定长度，这个操作很妙
 ? ?x = x + ['<PAD>'] * 50
 ? ?y = y + ['<PAD>'] * 51
 ? ?x = x[:50]
 ? ?y = y[:51]
?
 ? ?# 编码成token
 ? ?token_x = [vocab_x[i] for i in x] ?#vocab_x的存储格式'<PAD>':0
 ? ?token_y = [vocab_y[i] for i in y]
?
 ? ?# 转tensor
 ? ?tensor_x = torch.LongTensor(token_x) #它是一种特定的张量类型，其中的元素都为整数类型，使用64位整数进行存储。
 ? ?tensor_y = torch.LongTensor(token_y) #tensor([1, 2, 3, 4, 5])
 ? ?return tensor_x, tensor_y
?
def show_data(tensor_x,tensor_y) ->"str":
 ? ?words_x = "".join([vocab_xr[i] for i in tensor_x.tolist()]) ?#vocab_xr反查词典 值是'<PAD>'
 ? ?words_y = "".join([vocab_yr[i] for i in tensor_y.tolist()])
 ? ?return words_x,words_y
?
x,y = get_data() 
print(x,y,"\n")
print(show_data(x,y)) 
?
# 定义数据集
class TwoSumDataset(torch.utils.data.Dataset):
 ? ?def __init__(self,size = 100000):
 ? ? ? ?super(Dataset, self).__init__()
 ? ? ? ?self.size = size
?
 ? ?def __len__(self):
 ? ? ? ?return self.size
?
 ? ?def __getitem__(self, i):
 ? ? ? ?return get_data()
 ? ?
ds_train = TwoSumDataset(size = 100000) #训练集大小=10w
ds_val = TwoSumDataset(size = 10000) ?#测试集大小=1w
?
# 数据加载器
dl_train = DataLoader(dataset=ds_train,
 ? ? ? ? batch_size=200,
 ? ? ? ? drop_last=True,
 ? ? ? ? shuffle=True)
?
dl_val = DataLoader(dataset=ds_val,
 ? ? ? ? batch_size=200,
 ? ? ? ? drop_last=True,
 ? ? ? ? shuffle=False)
?
for src,tgt in dl_train:
 ? ?print(src.shape)
 ? ?print(tgt.shape)
 ? ?break 
#torch.Size([200, 50])
#torch.Size([200, 51])

二定义模型

下面，我们会像搭积木建城堡那样从低往高地构建Transformer模型。

先构建6个基础组件：多头注意力、前馈网络、层归一化、残差连接、单词嵌入、位置编码。类似用最基础的积木块搭建了墙壁，屋顶，篱笆，厅柱，大门，窗户这样的模块。
然后用这6个基础组件构建了3个中间成品: 编码器，解码器，产生器。类似用基础组件构建了城堡的主楼，塔楼，花园。
最后用这3个中间成品组装成Tranformer完整模型。类似用主楼，塔楼，花园这样的中间成品拼凑出一座完整美丽的城堡。

1, 多头注意力: MultiHeadAttention (用于融合不同单词之间的信息, 三处使用场景，①Encoder self-attention, ② Decoder masked-self-attention, ③ Encoder-Decoder cross-attention)

2, 前馈网络: PositionwiseFeedForward (用于逐位置将多头注意力融合后的信息进行高维映射变换，简称FFN)

3, 层归一化: LayerNorm (用于稳定输入，每个样本在Sequece和Feature维度归一化，相比BatchNorm更能适应NLP领域变长序列)

4, 残差连接: ResConnection (用于增强梯度流动以降低网络学习难度, 可以先LayerNorm再Add，LayerNorm也可以放在残差Add之后)

5, 单词嵌入: WordEmbedding (用于编码单词信息，权重要学习，输出乘了sqrt(d_model)来和位置编码保持相当量级)

6, 位置编码: PositionEncoding (用于编码位置信息，使用sin和cos函数直接编码绝对位置)

7, 编码器: TransformerEncoder (用于将输入Sequence编码成与Sequence等长的memory向量序列, 由N个TransformerEncoderLayer堆叠而成)

8, 解码器: TransformerDecoder (用于将编码器编码的memory向量解码成另一个不定长的向量序列, 由N个TransformerDecoderLayer堆叠而成)

9, 生成器: Generator (用于将解码器解码的向量序列中的每个向量映射成为输出词典中的词，一般由一个Linear层构成)

10, 变形金刚: Transformer (用于Seq2Seq转码，例如用于机器翻译，采用EncoderDecoder架构，由Encoder, Decoder 和 Generator组成)

import torch 
from torch import nn 
import torch.nn.functional as F
import copy 
import math 
import numpy as np
import pandas as pd 
?
def clones(module, N):
 ? ?"Produce N identical layers."
 ? ?return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

2.1 多头注意力 MultiHeadAttention

需要逐步理解 ScaledDotProductAttention->MultiHeadAttention->MaskedMultiHeadAttention

先理解什么是 ScaledDotProductAttention，再理解MultiHeadAttention，然后理解MaskedMultiHeadAttention

class ScaledDotProductAttention(nn.Module):
 ? ?"Compute 'Scaled Dot Product Attention'"
 ? ?def __init__(self):
 ? ? ? ?super(ScaledDotProductAttention, self).__init__()
?
 ? ?def forward(self,query, key, value, mask=None, dropout=None):
 ? ? ? ?d_k = query.size(-1)
 ? ? ? ?scores = query@key.transpose(-2,-1) / math.sqrt(d_k)#@重载为运算符,命名为__matmul__，进行矩阵乘法
 ? ? ? ?if mask is not None:
 ? ? ? ? ? ?scores = scores.masked_fill(mask == 0, -1e20)
 ? ? ? ?p_attn = F.softmax(scores, dim = -1)
 ? ? ? ?if dropout is not None:
 ? ? ? ? ? ?p_attn = dropout(p_attn)
 ? ? ? ?return p_attn@value, p_attn
 ? ?
class MultiHeadAttention(nn.Module):
 ? ?def __init__(self, h, d_model, dropout=0.1):
 ? ? ? ?"Take in model size and number of heads."
 ? ? ? ?super(MultiHeadAttention, self).__init__()
 ? ? ? ?assert d_model % h == 0
 ? ? ? ?# We assume d_v always equals d_k
 ? ? ? ?self.d_k = d_model // h
 ? ? ? ?self.h = h
 ? ? ? ?self.linears = clones(nn.Linear(d_model, d_model), 4)
 ? ? ? ?self.attn = None #记录 attention矩阵结果
 ? ? ? ?self.dropout = nn.Dropout(p=dropout)
 ? ? ? ?self.attention = ScaledDotProductAttention()
 ? ? ? ?
 ? ?def forward(self, query, key, value, mask=None):
 ? ? ? ?if mask is not None:
 ? ? ? ? ? ?# Same mask applied to all h heads.
 ? ? ? ? ? ?mask = mask.unsqueeze(1)
 ? ? ? ?nbatches = query.size(0)
 ? ? ? ?
 ? ? ? ?# 1) Do all the linear projections in batch from d_model => h x d_k 
 ? ? ? ?query, key, value = [
 ? ? ? ? ? ?l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
 ? ? ? ? ? ? for l, x in zip(self.linears, (query, key, value))
 ? ? ?  ]
 ? ? ? ?
 ? ? ? ?# 2) Apply attention on all the projected vectors in batch. 
 ? ? ? ?x, self.attn = self.attention(query, key, value, mask=mask, 
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dropout=self.dropout)
 ? ? ? ?
 ? ? ? ?# 3) "Concat" using a view and apply a final linear. 
 ? ? ? ?x = x.transpose(1, 2).contiguous() \
 ? ? ? ? ? ? .view(nbatches, -1, self.h * self.d_k)
 ? ? ? ?return self.linears[-1](x)
?
?
#为了让训练过程与解码过程信息流一致，遮挡tgt序列后面元素，设置其注意力为0
def tril_mask(data):
 ? ?"Mask out future positions."
 ? ?size = data.size(-1) #size为序列长度
 ? ?full = torch.full((1,size,size),1,dtype=torch.int,device=data.device)
 ? ?mask = torch.tril(full).bool() 
 ? ?return mask
?
#设置对<PAD>的注意力为0
def pad_mask(data, pad=0):
 ? ?"Mask out pad positions."
 ? ?mask = (data!=pad).unsqueeze(-2)
 ? ?return mask 
?
#计算一个batch数据的src_mask和tgt_mask
class MaskedBatch:
 ? ?"Object for holding a batch of data with mask during training."
 ? ?def __init__(self, src, tgt=None, pad=0):
 ? ? ? ?self.src = src
 ? ? ? ?self.src_mask = pad_mask(src,pad)
 ? ? ? ?if tgt is not None:
 ? ? ? ? ? ?self.tgt = tgt[:,:-1] #训练时,拿tgt的每一个词输入,去预测下一个词,所以最后一个词无需输入
 ? ? ? ? ? ?self.tgt_y = tgt[:, 1:] #第一个总是<SOS>无需预测，预测从第二个词开始
 ? ? ? ? ? ?self.tgt_mask = \
 ? ? ? ? ? ? ? ?self.make_tgt_mask(self.tgt, pad)
 ? ? ? ? ? ?self.ntokens = (self.tgt_y!= pad).sum() 
 ? ?
 ? ?@staticmethod
 ? ?def make_tgt_mask(tgt, pad):
 ? ? ? ?"Create a mask to hide padding and future words."
 ? ? ? ?tgt_pad_mask = pad_mask(tgt,pad)
 ? ? ? ?tgt_tril_mask = tril_mask(tgt)
 ? ? ? ?tgt_mask = tgt_pad_mask & (tgt_tril_mask)
 ? ? ? ?return tgt_mask
 ? ?
import plotly.express as px ?#pip install plotly
# 测试tril_mask 
mask = tril_mask(torch.zeros(1,10)) #序列长度为10
#sns.heatmap(mask[0],cmap=sns.cm.rocket);
px.imshow(mask[0],color_continuous_scale="blues",height=600,width=600)

代码结果如下：??

接着看：

#测试 ScaledDotProductAttention
query = torch.tensor([[[0.0,1.414],[1.414,0.0],[1.0,1.0],[-1.0,1.0],[1.0,-1.0]]])
key = query.clone() 
value = query.clone()
?
attention = ScaledDotProductAttention()
?
#没有mask 
out,p_att = attention(query, key, value)
fig = px.imshow(p_att[0],color_continuous_scale="blues",
 ? ? ? ? ? ? ? ?title="without mask",height=600,width=600)
fig.show()

代码结果如下：?

#考虑mask
out,p_att = attention(query, key, value, mask = tril_mask(torch.zeros(3,5)))
fig = px.imshow(p_att[0],color_continuous_scale="blues",
 ? ? ? ? ? ? ? ?height=600,width=600,
 ? ? ? ? ? ? ? ?title="with mask")
fig.show()

代码结果如下：??

# 测试MultiHeadAttention
cross_attn = MultiHeadAttention(h=2, d_model=4)
cross_attn.eval()
q1 = torch.tensor([[[0.1,0.1,0.1,0.1],[0.1,0.3,0.1,0.3]]])
k1 = q1.clone()
v1 = q1.clone()
tgt_mask = tril_mask(torch.zeros(2,2))
?
out1 = cross_attn.forward(q1,k1,v1,mask = tgt_mask)
print("out1:\n",out1)
?
#改变序列的第2个元素取值，由于有mask的遮挡，不会影响第1个输出
q2 = torch.tensor([[[0.1,0.1,0.1,0.1],[0.4,0.5,0.5,0.8]]])
k2 = q2.clone()
v2 = q2.clone()
tgt_mask = tril_mask(torch.zeros(2,2))
out2 = cross_attn.forward(q2,k2,v2,mask = tgt_mask)
print("out2:\n",out2)

out1:
 tensor([[[ 0.4672, -0.0756,  0.0934,  0.0190],
         [ 0.4808, -0.0639,  0.0991,  0.0419]]], grad_fn=<AddBackward0>)
out2:
 tensor([[[ 0.4672, -0.0756,  0.0934,  0.0190],
         [ 0.4878, -0.0829,  0.1065,  0.0124]]], grad_fn=<AddBackward0>)

# 测试MaskedBatch
mbatch = MaskedBatch(src = src,tgt = tgt, pad = 0)
print(mbatch.src.shape)
print(mbatch.tgt.shape)
print(mbatch.tgt_y.shape)
?
print(mbatch.src_mask.shape)
print(mbatch.tgt_mask.shape)
px.imshow(mbatch.tgt_mask[0],color_continuous_scale="blues",width=600,height=600)

?代码结果如下：

torch.Size([200, 50])
torch.Size([200, 50])
torch.Size([200, 50])
torch.Size([200, 1, 50])
torch.Size([200, 50, 50])

关于Transformer的多头注意力机制，有几个要点问题，此处做一些梳理：

(1)Transformer是如何解决长距离依赖的问题的？

Transformer是通过引入Scale-Dot-Product注意力机制来融合序列上不同位置的信息，从而解决长距离依赖问题。以文本数据为例，在循环神经网络LSTM结构中，输入序列上相距很远的两个单词无法直接发生交互，只能通过隐藏层输出或者细胞状态按照时间步骤一个一个向后进行传递。对于两个在序列上相距非常远的单词，中间经过的其它单词让隐藏层输出和细胞状态混入了太多的信息，很难有效地捕捉这种长距离依赖特征。但是在Scale-Dot-Product注意力机制中，序列上的每个单词都会和其它所有单词做一次点积计算注意力得分，这种注意力机制中单词之间的交互是强制的不受距离影响的，所以可以解决长距离依赖问题。

(2)Transformer在训练和测试阶段可以在时间(序列)维度上进行并行吗？

在训练阶段，Encoder和Decoder在时间(序列)维度都是并行的，在测试阶段，Encoder在序列维度是并行的，Decoder是串行的。

首先，Encoder部分在训练阶段和预测阶段都可以并行比较好理解，无论在训练还是预测阶段，它干的事情都是把已知的完整输入编码成memory，在序列维度可以并行。

对于Decoder部分有些微妙。在预测阶段Decoder肯定是不能并行的，因为Decoder实际上是一个自回归，它前面k-1位置的输出会变成第k位的输入的。前面没有计算完，后面是拿不到输入的，肯定不可以并行。那么训练阶段能否并行呢？虽然训练阶段知道了全部的解码结果，但是训练阶段要和预测阶段一致啊，前面的解码输出不能受到后面解码结果的影响啊。但Transformer通过在Decoder中巧妙地引入Mask技巧，使得在用Attention机制做序列特征融合的时候，每个单词对位于它之后的单词的注意力得分都为0，这样就保证了前面的解码输出不会受到后面解码结果的影响，因此Decoder在训练阶段可以在序列维度做并行。

(3)Scaled-Dot Product Attention为什么要除以?

(4)MultiHeadAttention的参数数量和head数量有何关系?

2.2 前馈网络: PositionwiseFeedForward

用于逐位置将多头注意力融合后的信息进行高维映射变换，简称FFN。

FFN仅有两个线性层，第一层将模型向量维度从 d_model(512) 升到 d_ff(2048), 第二层再降回 d_model(512)，两个线性层之间加了一个0.1的Dropout。

class PositionwiseFeedForward(nn.Module):
 ? ?"Implements FFN equation."
 ? ?def __init__(self, d_model, d_ff, dropout=0.1):
 ? ? ? ?super(PositionwiseFeedForward, self).__init__()
 ? ? ? ?self.linear1 = nn.Linear(d_model, d_ff) ?#线性层默认作用在最后一维度
 ? ? ? ?self.linear2 = nn.Linear(d_ff, d_model)
 ? ? ? ?self.dropout = nn.Dropout(dropout)
?
 ? ?def forward(self, x):
 ? ? ? ?return self.linear2(self.dropout(F.relu(self.linear1(x))))

2.3 层归一化：LayerNorm

在视觉领域，归一化一般用BatchNorm，但是在NLP领域，归一化一般用LayerNorm。

这是由于NLP领域的输入常常是不等长的Sequence，使用BatchNorm会让较长的Sequence输入的后面特征能够使用的参与归一化的样本数太少，让输入变得不稳定。

同时同一个Sequence的被PADDING填充的特征也会因BatchNorm获得不同的非零值，这对模型非常不友好。相比之下，LayerNorm总是对一个样本自己的特征进行归一化，没有上述问题。

class LayerNorm(nn.Module):
 ? ?"Construct a layernorm module (similar to torch.nn.LayerNorm)."
 ? ?def __init__(self, features, eps=1e-6):
 ? ? ? ?super(LayerNorm, self).__init__()
 ? ? ? ?self.weight = nn.Parameter(torch.ones(features))
 ? ? ? ?self.bias = nn.Parameter(torch.zeros(features))
 ? ? ? ?self.eps = eps
?
 ? ?def forward(self, x):
 ? ? ? ?mean = x.mean(-1, keepdim=True)
 ? ? ? ?std = x.std(-1, keepdim=True)
 ? ? ? ?return self.weight * (x - mean) / (std + self.eps) + self.bias

2.4 残差连接：ResConnection

用于增强梯度流动以降低网络学习难度。ResConnection 包括LayerNorm和Add残差连接操作, LayerNorm可以放在最开始(norm_first=True)，也可以放在最后(norm_first=False)。

《Attention is All you needed》论文原文是残差连接之后再 LayerNorm，但后面一些研究发现最开始的时候就LayerNorm更好一些。残差连接对于训练深度网络至关重要。有许多研究残差连接(ResNet)作用机制，解释它为什么有效的文章，主要的一些观点如下。

1，残差连接增强了梯度流动。直观上看，loss端的梯度能够通过跳跃连接快速传递到不同深度的各个层，增强了梯度流动，降低了网络的学习难度。数学上看，残差块的导数 f(x)=x+h(x) 为 f'(x)=1+h'(x) 在1.0附近，避免了梯度消失问题。

2，残差连接减轻了网络退化。一个网络层h(x)可以用一个变换矩阵H来表示，由于许多神经元有相同的反应模式，h(x)等价的变换矩阵H可能有许多行是线性相关的，这使得H的行列式为0，H为非可逆矩阵，h(x)会导致网络的退化和信息丢失。但增加了残差连接之后，f(x)=x+h(x)对应的变换矩阵F=H+I，单位阵I消除了H中相关行的线性相关性，减轻了退化的可能。

3，残差连接实现了模型集成。如果将训练好的ResNet的一些block移除，模型的预测精度并不会崩溃式下降，但是如果将训练好的VGG的一些block移除，模型的预测精度会雪崩。这说明ResNet中的各个Block类似基模型，ResNet通过残差连接将它们整合成了一个ensemble集成模型，增强了泛化能力。

4，残差连接增强了表达能力。使用残差块构建的深层网络所代表的函数簇集合是浅层网络所代表的的函数簇集合的超集，表达能力更强，所以可以通过添加残差块不断扩充模型表达能力。如果不使用残差连接，一个一层的网络f(x) = h1(x) 所能表示的函数簇不一定能被一个二层的网络 f(x) = h2(h1(x))所覆盖，但是使用残差连接后，f(x) = h1(x)+h2(h1(x))一定可以覆盖一层的网络所表示的函数簇，只要h2的全部权重取0即可。

参考：残差网络的前世今生与原理 - 知乎

class ResConnection(nn.Module):
 ? ?"""
 ?  A residual connection with a layer norm.
 ?  Note the norm is at last according to the paper, but it may be better at first.
 ?  """
 ? ?def __init__(self, size, dropout, norm_first=True):
 ? ? ? ?super(ResConnection, self).__init__()
 ? ? ? ?self.norm = LayerNorm(size)
 ? ? ? ?self.dropout = nn.Dropout(dropout)
 ? ? ? ?self.norm_first = norm_first
?
 ? ?def forward(self, x, sublayer):
 ? ? ? ?"Apply residual connection to any sublayer with the same size."
 ? ? ? ?if self.norm_first:
 ? ? ? ? ? ?return x + self.dropout(sublayer(self.norm(x)))
 ? ? ? ?else:
 ? ? ? ? ? ?return self.norm(x + self.dropout(sublayer(x)))

2.5 单词嵌入: WordEmbedding(权重要学习)

用于编码单词信息，权重要学习，输出乘了sqrt(d_model)来和位置编码保持相当量级。

当d_model越大的时候，根据 nn.init.xavier_uniform 初始化策略初始化的权重取值会越小。

# 单词嵌入
class WordEmbedding(nn.Module):
 ? ?def __init__(self, d_model, vocab):
 ? ? ? ?super(WordEmbedding, self).__init__()
 ? ? ? ?self.embedding = nn.Embedding(vocab, d_model)
 ? ? ? ?self.d_model = d_model
?
 ? ?def forward(self, x):
 ? ? ? ?return self.embedding(x) * math.sqrt(self.d_model) #note here, multiply sqrt(d_model)

2.6 位置编码：PositionEncoding(直接编码)

PositionEncoding用于编码位置信息，使用sin和cos函数直接编码绝对位置。单词和单词顺序对语言意义都非常重要。例如，"你欠我1000块钱"和"我欠你1000块钱"是由完全相同的单词组成，但由于词的顺序不同，含义截然相反。

在Transformer之前，一般用RNN模型来处理句子序列。RNN模型本身蕴含了对顺序的建模，单词是按照它们在句子中的自然顺序一个个地被RNN单元处理，逐个地被编码。但Transformer是并行地处理句子中的单词的，缺少单词的位置信息表征。

为了有效地表征单词的位置信息，Transformer设计了位置编码 PositionalEncoding，并添加到模型的输入中。于是，Transformer 用单词嵌入(权重要学习)向量 和位置编码(直接编码)向量之和来表示输入。

如何构造位置编码呢？即如何把 pos = 0,1,2,3,4,5,... 这样的位置序列映射成为一个一个的向量呢?Transformer设计了基于正弦函数和余弦函数的位置编码方法。

这种编码方法有以下几个优点：

1，编码值分布在[-1,1]之间，这样的分布对神经网络是比较友好的。

2，编码了绝对位置信息，对于0<=pos<=2*pi*10000，每个pos的位置编码向量都是不一样的。

更多位置编码的讨论参考如下博客：

《让研究人员绞尽脑汁的Transformer位置编码》

让研究人员绞尽脑汁的Transformer位置编码 - 科学空间|Scientific Spaces

# 位置编码
class PositionEncoding(nn.Module):
 ? ?"Implement the PE function."
 ? ?def __init__(self, d_model, dropout, max_len=5000):
 ? ? ? ?super(PositionEncoding, self).__init__()
 ? ? ? ?self.dropout = nn.Dropout(p=dropout)
 ? ? ? ?
 ? ? ? ?# Compute the positional encodings once in log space.
 ? ? ? ?pe = torch.zeros(max_len, d_model)
 ? ? ? ?position = torch.arange(0, max_len).unsqueeze(1)
 ? ? ? ?div_term = torch.exp(torch.arange(0, d_model, 2) *
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? -(math.log(10000.0) / d_model))
 ? ? ? ?pe[:, 0::2] = torch.sin(position * div_term)
 ? ? ? ?pe[:, 1::2] = torch.cos(position * div_term)
 ? ? ? ?pe = pe.unsqueeze(0)
 ? ? ? ?self.register_buffer('pe', pe)
 ? ? ? ?
 ? ?def forward(self, x):
 ? ? ? ?x = x + self.pe[:, :x.size(1)]
 ? ? ? ?return self.dropout(x)
 ? ?
pe = PositionEncoding(120, 0)
z = pe.forward(torch.zeros(1, 100, 120))
df = pd.DataFrame(z[0, :, [0,20,60,110]].data.numpy(),columns = ["dim"+c for c in ['0','20','60','110']])
df.insert(0,"x",np.arange(100))
px.line(df, x = "x",y = ["dim"+c for c in ['0','20','60','110']]).show()

代码结果如下：?

px.imshow(np.squeeze(z.data.numpy()) ,color_continuous_scale="blues",width=1000,height=800)

代码结果如下：?

2.7 编码器: TransformerEncoder

用于将输入Sequence编码成与Sequence等长的memory向量序列, 由N个TransformerEncoderLayer堆叠而成

class TransformerEncoderLayer(nn.Module):
 ? ?"TransformerEncoderLayer is made up of self-attn and feed forward (defined below)"
 ? ?def __init__(self, size, self_attn, feed_forward, dropout):
 ? ? ? ?super(TransformerEncoderLayer, self).__init__()
 ? ? ? ?self.self_attn = self_attn
 ? ? ? ?self.feed_forward = feed_forward
 ? ? ? ?self.res_layers = clones(ResConnection(size, dropout), 2)
 ? ? ? ?self.size = size
?
 ? ?def forward(self, x, mask):
 ? ? ? ?"Follow Figure 1 (left) for connections."
 ? ? ? ?x = self.res_layers[0](x, lambda x: self.self_attn(x, x, x, mask))
 ? ? ? ?return self.res_layers[1](x, self.feed_forward)
 ? ?
class TransformerEncoder(nn.Module):
 ? ?"TransformerEncoder is a stack of N TransformerEncoderLayer"
 ? ?def __init__(self, layer, N):
 ? ? ? ?super(TransformerEncoder, self).__init__()
 ? ? ? ?self.layers = clones(layer, N)
 ? ? ? ?self.norm = LayerNorm(layer.size)
 ? ? ? ?
 ? ?def forward(self, x, mask):
 ? ? ? ?"Pass the input (and mask) through each layer in turn."
 ? ? ? ?for layer in self.layers:
 ? ? ? ? ? ?x = layer(x, mask)
 ? ? ? ?return self.norm(x)
 ? ?
 ? ?@classmethod
 ? ?def from_config(cls,N=6,d_model=512, d_ff=2048, h=8, dropout=0.1):
 ? ? ? ?attn = MultiHeadAttention(h, d_model)
 ? ? ? ?ff = PositionwiseFeedForward(d_model, d_ff, dropout)
 ? ? ? ?layer = TransformerEncoderLayer(d_model, attn, ff, dropout)
 ? ? ? ?return cls(layer,N)
 ? ?
from torchkeras import summary 
?
src_embed = nn.Sequential(WordEmbedding(d_model=32, vocab = len(vocab_x)), 
 ? ? ? ? ? ? ? ? ? ? ? ? ?PositionEncoding(d_model=32, dropout=0.1))
encoder = TransformerEncoder.from_config(N=3,d_model=32, d_ff=128, h=8, dropout=0.1)
src_mask = pad_mask(src)
memory = encoder(*[src_embed(src),src_mask]) 
summary(encoder,input_data_args = [src_embed(src),src_mask]);

代码结果如下：?

--------------------------------------------------------------------------
Layer (type) ? ? ? ? ? ? ? ? ? ? ? ? ?  Output Shape ? ? ? ? ? ?  Param #
==========================================================================
LayerNorm-1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-5 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-6 ? ? ? ? ? [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-7 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-8 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-9 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-10 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-11 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-12 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-13 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-14 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-15 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-16 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-17 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-18 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-19 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-20 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-21 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-22 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-23 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-24 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-25 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-26 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-27 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-28 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-29 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-30 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-31 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-32 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-33 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-34 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-35 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-36 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-37 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-38 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-39 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-40 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
==========================================================================
Total params: 38,176
Trainable params: 38,176
Non-trainable params: 0
--------------------------------------------------------------------------
Input size (MB): 0.000000
Forward/backward pass size (MB): 1.129150
Params size (MB): 0.145630
Estimated Total Size (MB): 1.274780
--------------------------------------------------------------------------

2.8 解码器：TransformerDecoder

用于将编码器编码的memory向量解码成另一个不定长的向量序列, 由N个TransformerDecoderLayer堆叠而成。

class TransformerDecoderLayer(nn.Module):
 ? ?"TransformerDecoderLayer is made of self-attn, cross-attn, and feed forward (defined below)"
 ? ?def __init__(self, size, self_attn, cross_attn, feed_forward, dropout):
 ? ? ? ?super(TransformerDecoderLayer, self).__init__()
 ? ? ? ?self.size = size
 ? ? ? ?self.self_attn = self_attn
 ? ? ? ?self.cross_attn = cross_attn
 ? ? ? ?self.feed_forward = feed_forward
 ? ? ? ?self.res_layers = clones(ResConnection(size, dropout), 3)
 
 ? ?def forward(self, x, memory, src_mask, tgt_mask):
 ? ? ? ?"Follow Figure 1 (right) for connections."
 ? ? ? ?m = memory
 ? ? ? ?x = self.res_layers[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
 ? ? ? ?x = self.res_layers[1](x, lambda x: self.cross_attn(x, m, m, src_mask))
 ? ? ? ?return self.res_layers[2](x, self.feed_forward)
 ? ?
class TransformerDecoder(nn.Module):
 ? ?"Generic N layer decoder with masking."
 ? ?def __init__(self, layer, N):
 ? ? ? ?super(TransformerDecoder, self).__init__()
 ? ? ? ?self.layers = clones(layer, N)
 ? ? ? ?self.norm = LayerNorm(layer.size)
 ? ? ? ?
 ? ?def forward(self, x, memory, src_mask, tgt_mask):
 ? ? ? ?for layer in self.layers:
 ? ? ? ? ? ?x = layer(x, memory, src_mask, tgt_mask)
 ? ? ? ?return self.norm(x)
 ? ?
 ? ?@classmethod
 ? ?def from_config(cls,N=6,d_model=512, d_ff=2048, h=8, dropout=0.1):
 ? ? ? ?self_attn = MultiHeadAttention(h, d_model)
 ? ? ? ?cross_attn = MultiHeadAttention(h, d_model)
 ? ? ? ?ff = PositionwiseFeedForward(d_model, d_ff, dropout)
 ? ? ? ?layer = TransformerDecoderLayer(d_model, self_attn, cross_attn, ff, dropout)
 ? ? ? ?return cls(layer,N)
from torchkeras import summary 
?
mbatch = MaskedBatch(src=src,tgt=tgt,pad=0)
?
src_embed = nn.Sequential(WordEmbedding(d_model=32, vocab = len(vocab_x)), 
 ? ? ? ? ? ? ? ? ? ? ? ? ?PositionEncoding(d_model=32, dropout=0.1))
encoder = TransformerEncoder.from_config(N=3,d_model=32, d_ff=128, h=8, dropout=0.1)
memory = encoder(src_embed(src),mbatch.src_mask) 
?
tgt_embed = nn.Sequential(WordEmbedding(d_model=32, vocab = len(vocab_y)), 
 ? ? ? ? ? ? ? ? ? ? ? ? ?PositionEncoding(d_model=32, dropout=0.1))
decoder = TransformerDecoder.from_config(N=3,d_model=32, d_ff=128, h=8, dropout=0.1)
?
result = decoder.forward(tgt_embed(mbatch.tgt),memory,mbatch.src_mask,mbatch.tgt_mask) 
summary(decoder,input_data_args = [tgt_embed(mbatch.tgt),memory,
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?mbatch.src_mask,mbatch.tgt_mask]);
decoder.eval() 
mbatch.tgt[0][1]=8
result = decoder.forward(tgt_embed(mbatch.tgt),memory,mbatch.src_mask,mbatch.tgt_mask)
print(torch.sum(result[0][0])) 
?
mbatch.tgt[0][1]=7
result = decoder.forward(tgt_embed(mbatch.tgt),memory,mbatch.src_mask,mbatch.tgt_mask)
print(torch.sum(result[0][0]))

代码结果如下：?

--------------------------------------------------------------------------
Layer (type) ? ? ? ? ? ? ? ? ? ? ? ? ?  Output Shape ? ? ? ? ? ?  Param #
==========================================================================
LayerNorm-1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-5 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-6 ? ? ? ? ? [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-7 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-8 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-9 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-10 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-11 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-12 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-13 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-14 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-15 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-16 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-17 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-18 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-19 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-20 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-21 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-22 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-23 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-24 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-25 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-26 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-27 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-28 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-29 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-30 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-31 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-32 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-33 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-34 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-35 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-36 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-37 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-38 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-39 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-40 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-41 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-42 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-43 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-44 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-45 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-46 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-47 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-48 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-49 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-50 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-51 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-52 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-53 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-54 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-55 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-56 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-57 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-58 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-59 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-60 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-61 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-62 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-63 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-64 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
==========================================================================
Total params: 51,040
Trainable params: 51,040
Non-trainable params: 0
--------------------------------------------------------------------------
Input size (MB): 0.000000
Forward/backward pass size (MB): 1.843262
Params size (MB): 0.194702
Estimated Total Size (MB): 2.037964
--------------------------------------------------------------------------
tensor(0., grad_fn=<SumBackward0>)
tensor(-4.7684e-07, grad_fn=<SumBackward0>)

2.9 生成器: Generator

用于将解码器解码输出的向量序列中的每个向量逐个映射成为输出词典中各个词的取词概率。

一般由一个Linear层接F.log_softmax构成，比较简单。接F.log_softmax而不接F.softmax的原因是对于一些特别小的概率如1e-100，在精度约束条件下，F.log_softmax能够更加准确地表示其大小。

class Generator(nn.Module):
 ? ?"Define standard linear + softmax generation step."
 ? ?def __init__(self, d_model, vocab):
 ? ? ? ?super(Generator, self).__init__()
 ? ? ? ?self.proj = nn.Linear(d_model, vocab)
?
 ? ?def forward(self, x):
 ? ? ? ?return F.log_softmax(self.proj(x), dim=-1)
generator = Generator(d_model = 32, vocab = len(vocab_y)) 
log_probs ?= generator(result)
probs = torch.exp(log_probs)
print("output_probs.shape:",probs.shape)
print("sum(probs)=1:") 
print(torch.sum(probs,dim = -1)[0]) 
?
summary(generator,input_data = result);

代码结果如下：?

output_probs.shape: torch.Size([200, 50, 13])
sum(probs)=1:
tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
 ? ? ?  1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
 ? ? ?  1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
 ? ? ?  1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
 ? ? ?  1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
 ? ? ?  1.0000, 1.0000, 1.0000, 1.0000, 1.0000], grad_fn=<SelectBackward0>)
--------------------------------------------------------------------------
Layer (type) ? ? ? ? ? ? ? ? ? ? ? ? ?  Output Shape ? ? ? ? ? ?  Param #
==========================================================================
Linear-1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 13] ? ? ? ? ? ? ? ?  429
==========================================================================
Total params: 429
Trainable params: 429
Non-trainable params: 0
--------------------------------------------------------------------------
Input size (MB): 0.000069
Forward/backward pass size (MB): 0.004959
Params size (MB): 0.001637
Estimated Total Size (MB): 0.006664
--------------------------------------------------------------------------

2.10 变形金刚：Transformer

用于Seq2Seq转码，例如用于机器翻译，采用EncoderDecoder架构，由Encoder, Decoder 和 Generator组成。

from torch import nn 
class Transformer(nn.Module):
 ? ?"""
 ?  A standard Encoder-Decoder architecture. Base for this and many other models.
 ?  """
 ? ?def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
 ? ? ? ?super(Transformer, self).__init__()
 ? ? ? ?self.encoder = encoder
 ? ? ? ?self.decoder = decoder
 ? ? ? ?self.src_embed = src_embed
 ? ? ? ?self.tgt_embed = tgt_embed
 ? ? ? ?self.generator = generator
 ? ? ? ?self.reset_parameters()
 ? ? ? ?
 ? ?def forward(self, src, tgt, src_mask, tgt_mask):
 ? ? ? ?"Take in and process masked src and target sequences."
 ? ? ? ?return self.generator(self.decode(self.encode(src, src_mask), 
 ? ? ? ? ? ? ? ?src_mask, tgt, tgt_mask))
 ? ?
 ? ?def encode(self, src, src_mask):
 ? ? ? ?return self.encoder(self.src_embed(src), src_mask)
 ? ?
 ? ?def decode(self, memory, src_mask, tgt, tgt_mask):
 ? ? ? ?return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)
 ? ?
 ? ?@classmethod
 ? ?def from_config(cls,src_vocab,tgt_vocab,N=6,d_model=512, d_ff=2048, h=8, dropout=0.1):
 ? ? ? ?encoder = TransformerEncoder.from_config(N=N,d_model=d_model,
 ? ? ? ? ? ? ? ? ?d_ff=d_ff, h=h, dropout=dropout)
 ? ? ? ?decoder = TransformerDecoder.from_config(N=N,d_model=d_model,
 ? ? ? ? ? ? ? ? ?d_ff=d_ff, h=h, dropout=dropout)
 ? ? ? ?src_embed = nn.Sequential(WordEmbedding(d_model, src_vocab), PositionEncoding(d_model, dropout))
 ? ? ? ?tgt_embed = nn.Sequential(WordEmbedding(d_model, tgt_vocab), PositionEncoding(d_model, dropout))
 ? ? ? ?
 ? ? ? ?generator = Generator(d_model, tgt_vocab)
 ? ? ? ?return cls(encoder, decoder, src_embed, tgt_embed, generator)
 ? ?
 ? ?def reset_parameters(self):
 ? ? ? ?for p in self.parameters():
 ? ? ? ? ? ?if p.dim() > 1:
 ? ? ? ? ? ? ? ?nn.init.xavier_uniform_(p)
 ? ?
from torchkeras import summary 
net = Transformer.from_config(src_vocab = len(vocab_x),tgt_vocab = len(vocab_y),
 ? ? ? ? ? ? ? ? ? N=2, d_model=32, d_ff=128, h=8, dropout=0.1)
?
mbatch = MaskedBatch(src=src,tgt=tgt,pad=0)
?
summary(net,input_data_args = [mbatch.src,mbatch.tgt,mbatch.src_mask,mbatch.tgt_mask]);

代码结果如下：?

--------------------------------------------------------------------------
Layer (type) ? ? ? ? ? ? ? ? ? ? ? ? ?  Output Shape ? ? ? ? ? ?  Param #
==========================================================================
Embedding-1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ?  448
Dropout-2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-5 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-6 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-7 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-8 ? ? ? ? ? [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-9 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-10 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-11 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-12 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-13 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-14 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-15 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-16 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-17 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-18 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-19 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-20 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-21 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-22 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-23 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-24 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-25 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-26 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-27 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-28 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-29 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Embedding-30 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ?  416
Dropout-31 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-32 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-33 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-34 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-35 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-36 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-37 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-38 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-39 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-40 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-41 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-42 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-43 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-44 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-45 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-46 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-47 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-48 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-49 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-50 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-51 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-52 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-53 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-54 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-55 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-56 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-57 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-58 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-59 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-60 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-61 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-62 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-63 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Linear-64 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-65 ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 8, 50, 50] ? ? ? ? ? ? ? ? ?  0
ScaledDotProductAttention-66 ? ? ? ?  [-1, 8, 50, 4] ? ? ? ? ? ? ? ? ?  0
Linear-67 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  1,056
Dropout-68 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-69 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-70 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 128] ? ? ? ? ? ? ?  4,224
Dropout-71 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 128] ? ? ? ? ? ? ? ? ?  0
Linear-72 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 32] ? ? ? ? ? ? ?  4,128
Dropout-73 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ?  0
LayerNorm-74 ? ? ? ? ? ? ? ? ? ? ? ? ?  [-1, 50, 32] ? ? ? ? ? ? ? ? ? 64
Linear-75 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? [-1, 50, 13] ? ? ? ? ? ? ? ?  429
==========================================================================
Total params: 60,813
Trainable params: 60,813
Non-trainable params: 0
--------------------------------------------------------------------------
Input size (MB): 0.000000
Forward/backward pass size (MB): 2.043533
Params size (MB): 0.231983
Estimated Total Size (MB): 2.275517
--------------------------------------------------------------------------

三训练模型

Transformer的训练主要用到了以下两个技巧：

1，学习率调度: Learning Rate Scheduler (用于提升模型学习稳定性。做法是学习率先warm up线性增长，再按照 1/sqrt(step) 规律缓慢下降)

2，标签平滑: Label Smoothing. (用于让模型更加集中在对分类错误的样本的学习，而不是扩大已经分类正确样本中正负样本预测差距。做法是将正例标签由1改成0.1，负例标签由0改成0.9/vocab_size)

介绍了用这两个方法封装的 Optimizer和 Loss 后，我们进一步实现完整训练代码。

3，完整训练代码。

3.1 学习率调度：Learning Rate Scheduler

学习率调度用于提升模型学习稳定性。做法是学习率先warm up线性增长，再按照 1/sqrt(step) 规律缓慢下降。

学习率的warm up为何有效呢？一种解释性观点是认为这能够让模型初始学习时参数平稳变化并避免对开始的几个batch数据过拟合陷入局部最优。

由于刚学习时，loss比较大，梯度会很大，如果学习率也很大，两者相乘会更大，那么模型参数会随着不同batch数据的差异剧烈抖动，无法有效地学习，也容易对开始的几个batch数据过拟合，后期很难拉回来。

等到模型学习了一些时候，loss变小了，梯度也会小，学习率调大，两者相乘也不会很大，模型依然可以平稳有效地学习。

后期为何又要让调低学习率呢？这是因为后期模型loss已经很小了，在最优参数附近了，如果学习率过大，容易在最优参数附近震荡，无法逼近最优参数。

参考：神经网络中 warmup 策略为什么有效；有什么理论解释么？ - 知乎

#注1：此处通过继承方法将学习率调度策略融入Optimizer
#注2：NoamOpt中的Noam是论文作者之一的名字
#注3：学习率是按照step而非epoch去改变的
?
class NoamOpt(torch.optim.AdamW):
 ? ?def __init__(self, params, model_size=512, factor=1.0, warmup=4000, 
 ? ? ? ? ? ? ? ? lr=0, betas=(0.9, 0.98), eps=1e-9,
 ? ? ? ? ? ? ? ? weight_decay=0, amsgrad=False):
 ? ? ? ?super(NoamOpt,self).__init__(params, lr=lr, betas=betas, eps=eps,
 ? ? ? ? ? ? ? ? weight_decay=weight_decay, amsgrad=amsgrad)
 ? ? ? ?self._step = 0
 ? ? ? ?self.warmup = warmup
 ? ? ? ?self.factor = factor
 ? ? ? ?self.model_size = model_size
 ? ? ? ?
 ? ?def step(self,closure=None):
 ? ? ? ?"Update parameters and rate"
 ? ? ? ?self._step += 1
 ? ? ? ?rate = self.rate()
 ? ? ? ?for p in self.param_groups:
 ? ? ? ? ? ?p['lr'] = rate
 ? ? ? ?super(NoamOpt,self).step(closure=closure)
 ? ? ? ?
 ? ?def rate(self, step = None):
 ? ? ? ?"Implement `lrate` above"
 ? ? ? ?if step is None:
 ? ? ? ? ? ?step = self._step
 ? ? ? ?return self.factor * \
 ? ? ? ? ?  (self.model_size ** (-0.5) *
 ? ? ? ? ? ?min(step * self.warmup ** (-1.5),step ** (-0.5)))
 ? ?
optimizer = NoamOpt(net.parameters(), 
 ? ? ? ?model_size=net.src_embed[0].d_model, factor=1.0, 
 ? ? ? ?warmup=400)
import plotly.express as px 
?
opts = [NoamOpt(net.parameters(),model_size=512, factor =1, warmup=4000), 
 ? ? ? ?NoamOpt(net.parameters(),model_size=512, factor=1, ?warmup=8000),
 ? ? ? ?NoamOpt(net.parameters(),model_size=256, factor=1, ?warmup=4000)]
?
steps = np.arange(1, 20000)
rates = [[opt.rate(i) for opt in opts] for i in steps]
dfrates = pd.DataFrame(rates,columns = ["512:4000", "512:8000", "256:4000"])
dfrates["steps"] = steps 
?
fig = px.line(dfrates,x="steps",y=["512:4000", "512:8000", "256:4000"])
fig.layout.yaxis.title = "lr"
fig

代码结果如下：

3.2 标签平滑：Label Smoothing

用于让模型更加集中在对分类错误的样本的学习，而不是扩大已经分类正确样本中正负样本预测差距。做法是将正例标签由1改成0.1，负例标签由0改成0.9/vocab_size。

多分类一般用softmax激活函数，要让模型对正例标签预测值为1是非常困难的，那需要输出正无穷才可以。对负例标签预测值为0也是非常困难的，那需要输出负无穷才可以。但实际上我们不需要模型那么确信，只要正例标签的预测值比负例标签大就行了。

因此可以做标签平滑，让模型不必费劲地无限扩大分类正确样本中正负样本之间的预测差距，而是集中在对分类错误的样本的学习。

由于在激活函数中已经采用了F.log_softmax, 所以损失函数不能用nn.CrossEntropyLoss，而需要使用 nn.NLLoss.(注：nn.LogSoftmax + nn.NLLLoss = nn.CrossEntropyLoss)

同时由于使用了标签平滑，采用nn.NLLoss时损失的最小值无法变成0，需要扣除标签分布本身的熵，损失函数进一步变成 nn.KLDivLoss。在采用标签平滑的时候，nn.KLDivLoss和nn.NLLoss的梯度相同，优化效果相同，但其最小值是0，更符合我们对损失的直观理解。

class LabelSmoothingLoss(nn.Module):
 ? ?"Implement label smoothing."
 ? ?def __init__(self, size, padding_idx, smoothing=0.0): #size为词典大小
 ? ? ? ?super(LabelSmoothingLoss, self).__init__()
 ? ? ? ?self.criterion = nn.KLDivLoss(reduction="sum")
 ? ? ? ?self.padding_idx = padding_idx
 ? ? ? ?self.confidence = 1.0 - smoothing
 ? ? ? ?self.smoothing = smoothing
 ? ? ? ?self.size = size
 ? ? ? ?self.true_dist = None
 ? ? ? ?
 ? ?def forward(self, x, target):
 ? ? ? ?assert x.size(1) == self.size
 ? ? ? ?true_dist = x.data.clone()
 ? ? ? ?true_dist.fill_(self.smoothing / (self.size - 2)) ?#预测结果不会是<SOS> #和<PAD>
 ? ? ? ?true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
 ? ? ? ?true_dist[:, self.padding_idx] = 0
 ? ? ? ?mask = torch.nonzero((target.data == self.padding_idx).int())
 ? ? ? ?if mask.dim() > 0:
 ? ? ? ? ? ?true_dist.index_fill_(0, mask.squeeze(), 0.0)
 ? ? ? ?self.true_dist = true_dist
 ? ? ? ?return self.criterion(x, true_dist)
 ? ?
# Example of label smoothing.
smooth_loss = LabelSmoothingLoss(5, 0, 0.4)
predict = torch.FloatTensor([[1e-10, 0.2, 0.7, 0.1, 1e-10],
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [1e-10, 0.2, 0.7, 0.1, 1e-10], 
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? [1e-10, 0.2, 0.7, 0.1, 1e-10]])
loss = smooth_loss(predict.log(), torch.LongTensor([2, 1, 0]))
?
print("smoothed target:\n",smooth_loss.true_dist,"\n") 
print("loss:",loss)
px.imshow(smooth_loss.true_dist,color_continuous_scale="blues",height=600,width=1000)
#smoothed target:
# tensor([[0.0000, 0.1333, 0.6000, 0.1333, 0.1333],
# ? ? ?  [0.0000, 0.6000, 0.1333, 0.1333, 0.1333],
# ? ? ?  [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]) 
#
#loss: tensor(5.9712)

3.3 完整训练代码

有了优化器和Loss后，我们便可以训练模型了。我们先整体试算loss和metric，然后再套上torchkeras的训练模版。（注意：下面是伪代码，不运行）

#整体流程试算
for src,tgt in dl_train:
 ? ?break 
mbatch = MaskedBatch(src=src,tgt=tgt,pad = 0)
?
net = Transformer.from_config(src_vocab = len(vocab_x),tgt_vocab = len(vocab_y),
 ? ? ? ? ? ? ? ? ? N=3, d_model=64, d_ff=128, h=8, dropout=0.1)
?
#loss
loss_fn = LabelSmoothingLoss(size=len(vocab_y), 
 ? ? ? ? ? ?padding_idx=0, smoothing=0.2)
preds = net.forward(mbatch.src, mbatch.tgt, mbatch.src_mask, mbatch.tgt_mask)
preds = preds.reshape(-1, preds.size(-1))
labels = mbatch.tgt_y.reshape(-1)
loss = loss_fn(preds, labels)/mbatch.ntokens 
print('loss=',loss.item()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? 
?
#metric
preds = preds.argmax(dim=-1).view(-1)[labels!=0]
labels = labels[labels!=0]
?
acc = (preds==labels).sum()/(labels==labels).sum()
print('acc=',acc.item())
loss= 2.1108953952789307
acc= 0.08041179925203323
from torchmetrics import Accuracy 
#使用torchmetrics中的指标
accuracy = Accuracy(task='multiclass',num_classes=len(vocab_y))
accuracy.update(preds,labels)
print('acc=',accuracy.compute().item())
acc= 0.08041179925203323

下面使用我们的梦中情炉来实现最优雅的训练循环~

from torchkeras import KerasModel 
?
class StepRunner:
 ? ?def __init__(self, net, loss_fn, 
 ? ? ? ? ? ? ? ? accelerator=None, stage = "train", metrics_dict = None, 
 ? ? ? ? ? ? ? ? optimizer = None, lr_scheduler = None
 ? ? ? ? ? ? ? ? ):
 ? ? ? ?self.net,self.loss_fn,self.metrics_dict,self.stage = net,loss_fn,metrics_dict,stage
 ? ? ? ?self.optimizer,self.lr_scheduler = optimizer,lr_scheduler
 ? ? ? ?self.accelerator = accelerator
 ? ? ? ?if self.stage=='train':
 ? ? ? ? ? ?self.net.train() 
 ? ? ? ?else:
 ? ? ? ? ? ?self.net.eval()
 ? ?
 ? ?def __call__(self, batch):
 ? ? ? ?src,tgt = batch 
 ? ? ? ?mbatch = MaskedBatch(src=src,tgt=tgt,pad = 0)
 ? ? ? ?
 ? ? ? ?#loss
 ? ? ? ?with self.accelerator.autocast():
 ? ? ? ? ? ?preds = net.forward(mbatch.src, mbatch.tgt, mbatch.src_mask, mbatch.tgt_mask)
 ? ? ? ? ? ?preds = preds.reshape(-1, preds.size(-1))
 ? ? ? ? ? ?labels = mbatch.tgt_y.reshape(-1)
 ? ? ? ? ? ?loss = loss_fn(preds, labels)/mbatch.ntokens 
 ? ? ? ? ? ?
 ? ? ? ? ? ?#filter padding
 ? ? ? ? ? ?preds = preds.argmax(dim=-1).view(-1)[labels!=0]
 ? ? ? ? ? ?labels = labels[labels!=0]
?
 ? ? ? ?#backward()
 ? ? ? ?if self.stage=="train" and self.optimizer is not None:
 ? ? ? ? ? ?self.accelerator.backward(loss)
 ? ? ? ? ? ?if self.accelerator.sync_gradients:
 ? ? ? ? ? ? ? ?self.accelerator.clip_grad_norm_(self.net.parameters(), 1.0)
 ? ? ? ? ? ?self.optimizer.step()
 ? ? ? ? ? ?if self.lr_scheduler is not None:
 ? ? ? ? ? ? ? ?self.lr_scheduler.step()
 ? ? ? ? ? ?self.optimizer.zero_grad()
 ? ? ? ? ? ?
 ? ? ? ?all_loss = self.accelerator.gather(loss).sum()
 ? ? ? ?all_preds = self.accelerator.gather(preds)
 ? ? ? ?all_labels = self.accelerator.gather(labels) ? ? 
 ? ? ? ?
 ? ? ? ?#losses (or plain metrics that can be averaged)
 ? ? ? ?step_losses = {self.stage+"_loss":all_loss.item()}
?
 ? ? ? ?step_metrics = {self.stage+"_"+name:metric_fn(all_preds, all_labels).item() 
 ? ? ? ? ? ? ? ? ? ? ? ?for name,metric_fn in self.metrics_dict.items()}
 ? ? ? ?
 ? ? ? ?if self.stage=="train":
 ? ? ? ? ? ?if self.optimizer is not None:
 ? ? ? ? ? ? ? ?step_metrics['lr'] = self.optimizer.state_dict()['param_groups'][0]['lr']
 ? ? ? ? ? ?else:
 ? ? ? ? ? ? ? ?step_metrics['lr'] = 0.0
 ? ? ? ?return step_losses,step_metrics
 ? ?
KerasModel.StepRunner = StepRunner 
from torchmetrics import Accuracy 
?
net = Transformer.from_config(src_vocab = len(vocab_x),tgt_vocab = len(vocab_y),
 ? ? ? ? ? ? ? ? ? N=5, d_model=64, d_ff=128, h=8, dropout=0.1)
loss_fn = LabelSmoothingLoss(size=len(vocab_y), 
 ? ? ? ? ? ?padding_idx=0, smoothing=0.1)
?
metrics_dict = {'acc':Accuracy(task='multiclass',num_classes=len(vocab_y))} 
optimizer = NoamOpt(net.parameters(),model_size=64)
?
model = KerasModel(net,
 ? ? ? ? ? ? ? ? ? loss_fn=loss_fn,
 ? ? ? ? ? ? ? ? ? metrics_dict=metrics_dict,
 ? ? ? ? ? ? ? ? ? optimizer = optimizer)
?
model.fit(
 ? ?train_data=dl_train,
 ? ?val_data=dl_val,
 ? ?epochs=100,
 ? ?ckpt_path='checkpoint',
 ? ?patience=10,
 ? ?monitor='val_acc',
 ? ?mode='max',
 ? ?callbacks=None,
 ? ?plot=True
)

自己训练时提示GPU没起来，用的 CPU：

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

不知道为什么，我的训练结果没有博主好，best val_acc=0.9997。结果的文字输出：

    epoch   train_loss  train_acc   lr  val_loss    val_acc
0   1   1.926549    0.118282    0.000247    1.755822    0.157529
1   2   1.761568    0.154317    0.000494    1.721014    0.174109
2   3   1.665665    0.202374    0.000741    1.493793    0.281002
3   4   1.512023    0.271837    0.000988    1.412258    0.303103
4   5   1.437772    0.296994    0.001235    1.381396    0.315089
5   6   1.399837    0.311300    0.001482    1.350971    0.328369
6   7   1.370932    0.322916    0.001729    1.320691    0.342514
7   8   1.343289    0.335374    0.001976    1.288572    0.360102
8   9   1.286426    0.366694    0.001863    1.018766    0.503617
9   10  1.034561    0.513786    0.001768    0.541111    0.778445
10  11  0.713718    0.693464    0.001685    0.228717    0.915925
11  12  0.478800    0.811486    0.001614    0.094066    0.967885
12  13  0.348117    0.867726    0.001550    0.074332    0.974813
13  14  0.280525    0.895304    0.001494    0.053904    0.983355
14  15  0.226962    0.916092    0.001443    0.041132    0.990952
15  16  0.195403    0.928465    0.001398    0.038528    0.990127
16  17  0.171988    0.937497    0.001356    0.035575    0.991251
17  18  0.153889    0.944426    0.001318    0.031089    0.991947
18  19  0.140068    0.950137    0.001282    0.025078    0.995194
19  20  0.123703    0.956331    0.001250    0.021635    0.996846
20  21  0.112552    0.960819    0.001220    0.021932    0.995882
21  22  0.105381    0.963738    0.001192    0.018248    0.997842
22  23  0.097686    0.966738    0.001166    0.016056    0.997908
23  24  0.093353    0.968499    0.001141    0.015239    0.998117
24  25  0.085447    0.971407    0.001118    0.013914    0.998273
25  26  0.080234    0.973560    0.001096    0.013089    0.998318
26  27  0.074848    0.975562    0.001076    0.011789    0.998910
27  28  0.070730    0.977029    0.001056    0.011067    0.998822
28  29  0.066550    0.978457    0.001038    0.010334    0.999155
29  30  0.062746    0.979813    0.001021    0.008421    0.999589
30  31  0.058696    0.981321    0.001004    0.008245    0.999578
31  32  0.055279    0.982486    0.000988    0.007618    0.999606
32  33  0.054744    0.982649    0.000973    0.006576    0.999747
33  34  0.050709    0.984109    0.000959    0.007351    0.999397
34  35  0.049017    0.984489    0.000945    0.006222    0.999713
35  36  0.047814    0.984852    0.000932    0.006155    0.999707
36  37  0.047023    0.985224    0.000919    0.006253    0.999370
37  38  0.043590    0.986276    0.000907    0.006603    0.999370
38  39  0.041667    0.986945    0.000895    0.005294    0.999623
39  40  0.041457    0.986870    0.000884    0.004411    0.999730
40  41  0.040276    0.987317    0.000873    0.005145    0.999600
41  42  0.038181    0.987988    0.000863    0.004716    0.999691
42  43  0.037286    0.988278    0.000852    0.003854    0.999707

博主给的结果图：

四使用模型

下面使用贪心法进行翻译推理过程。和训练过程可以通过掩码遮挡未来token，从而实现一个句子在序列长度方向并行训练不同。

翻译推理过程只有先翻译了前面的内容，添加到输出中，才能够翻译后面的内容，这个过程是无法在序列维度并行的。

Decoder&Generator第k位的输出实际上对应的是 已知输入编码后的memory和前k位Deocder输入(解码序列)的情况下解码序列第k+1位取输出词典中各个词的概率。

贪心法是获取解码结果的简化方案，工程实践当中一般使用束搜索方法(Beam Search)。

参考：《十分钟读懂Beam Search》十分钟读懂Beam Search 1：基础 - 知乎

def greedy_decode(net, src, src_mask, max_len, start_symbol):
 ? ?net.eval() 
 ? ?memory = net.encode(src, src_mask)
 ? ?ys = torch.full((len(src),max_len),start_symbol,dtype = src.dtype).to(src.device)
 ? ?for i in range(max_len-1):
 ? ? ? ?out = net.generator(net.decode(memory, src_mask, 
 ? ? ? ? ? ? ?ys, tril_mask(ys)))
 ? ? ? ?ys[:,i+1]=out.argmax(dim=-1)[:,i]
 ? ?return ys
?
def get_raw_words(tensor,vocab_r) ->"str":
 ? ?words = [vocab_r[i] for i in tensor.tolist()]
 ? ?return words
?
def get_words(tensor,vocab_r) ->"str":
 ? ?s = "".join([vocab_r[i] for i in tensor.tolist()])
 ? ?words = s[:s.find('<EOS>')].replace('<SOS>','')
 ? ?return words
?
def prepare(x,accelerator=model.accelerator):
 ? ?return x.to(accelerator.device)
##解码翻译结果
net = model.net
net.eval() 
net = prepare(net)
src,tgt = get_data()
src,tgt = prepare(src),prepare(tgt)
mbatch = MaskedBatch(src=src.unsqueeze(dim=0),tgt=tgt.unsqueeze(dim=0))
?
y_pred = greedy_decode(net,mbatch.src,mbatch.src_mask,50,vocab_y["<SOS>"])
print("input:")
print(get_words(mbatch.src[0],vocab_xr),'\n') #标签结果
print("ground truth:")
print(get_words(mbatch.tgt[0],vocab_yr),'\n') #标签结果
print("prediction:")
print(get_words(y_pred[0],vocab_yr)) #解码预测结果，原始标签中<PAD>位置的预测可以忽略

#自己的测试结果：
input:
99996956438+960894583730911 
?
ground truth:
960994580687349 
?
prediction:
960994580687349

五评估模型

我们训练过程中监控的acc实际上是字符级别的acc，现在我们来计算样本级别的准确率。

from tqdm.auto import tqdm
?
net = prepare(net)
loop = tqdm(range(1,201))
correct = 0
for i in loop:
 ? ?src,tgt = get_data()
 ? ?src,tgt = prepare(src),prepare(tgt)
 ? ?mbatch = MaskedBatch(src=src.unsqueeze(dim=0),tgt=tgt.unsqueeze(dim=0))
 ? ?y_pred = greedy_decode(net,mbatch.src,mbatch.src_mask,50,vocab_y["<SOS>"])
?
 ? ?inputs = get_words(mbatch.src[0],vocab_xr) #标签结果
 ? ?gt = get_words(mbatch.tgt[0],vocab_yr) #标签结果
 ? ?preds = get_words(y_pred[0],vocab_yr) #解码预测结果，原始标签中<PAD>位置的预测可以忽略
 ? ?if preds==gt:
 ? ? ? ?correct+=1
 ? ?loop.set_postfix(acc = correct/i)
 ? ?
print("acc=",correct/len(loop))

我的结果（不知道是不是训练有区别，我的acc=0.97，博主的acc=1.0）：

?博主结果：

perfect，基本完美实现两数之和。😋😋

运行代码时碰到的小问题【已解决】

【问题】jupyter中图片不显示的问题报错：Javascript Error: Something went wrong with axis scaling

【解决】升级plotly包，然后重启内核。

!pip install plotly --upgrade

文章结束，感谢阅读。大家有推荐的公众号可以评论区留言，共同学习，一起进步。

文章来源:https://blog.csdn.net/csdn_xmj/article/details/135770996
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！

江大白|万字长文，深入浅出Transformer，值得收藏！(测试代码已跑通)

前方干货预警：这可能是你能够找到的最容易懂的最具实操性的最系统的学习transformer模型的入门教程。

一 准备数据

二 定义模型

2.1 多头注意力 MultiHeadAttention

2.2 前馈网络: PositionwiseFeedForward

2.3 层归一化：LayerNorm

2.4 残差连接：ResConnection

2.5 单词嵌入: WordEmbedding(权重要学习)

2.6 位置编码：PositionEncoding(直接编码)

2.7 编码器: TransformerEncoder

2.8 解码器：TransformerDecoder

2.9 生成器: Generator

2.10 变形金刚：Transformer

三 训练模型

3.1 学习率调度：Learning Rate Scheduler

3.2 标签平滑：Label Smoothing

3.3 完整训练代码

四 使用模型

五 评估模型

运行代码时碰到的小问题【已解决】

一准备数据

二定义模型

三训练模型

四使用模型

五评估模型