小周带你读论文-2之“草履虫都能看懂的Transformer老活儿新整“Attention is all you need(1)

发布时间：2024年01月10日

这论文其实也不用多说了，我相信百分之70以上我的读者读过

但是还是老规矩 1,2,3 上链接

1706.03762.pdf (arxiv.org)

《Attention is all you need》我如果干讲这个可能有点枯燥，毕竟好多人看过，但是这个论文又是玩LLM不可能跨过的一篇文章，所以我站在我的角度夹带点私货来对这个论文做一些个人解读，保证你们看到一篇不一样的，更丰富内容的“Attention is all you need”

我的目的就是一定要让大家明白，所以会讲的很细，希望出一个能让草履虫都能看懂的Transformer论文解析

我就只沾一部分原文就是background：

Background The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering

文章来源:https://blog.csdn.net/kingsoftcloud/article/details/135512881
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！