这论文其实也不用多说了,我相信百分之70以上我的读者读过
但是还是老规矩 1,2,3 上链接
1706.03762.pdf (arxiv.org)
《Attention is all you need》我如果干讲这个可能有点枯燥,毕竟好多人看过,但是这个论文又是玩LLM不可能跨过的一篇文章,所以我站在我的角度夹带点私货来对这个论文做一些个人解读,保证你们看到一篇不一样的,更丰富内容的“Attention is all you need”
我的目的就是一定要让大家明白,所以会讲的很细,希望出一个能让草履虫都能看懂的Transformer论文解析
我就只沾一部分原文就是background:
Background The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering