Transformer

Word count: 677 | Reading time≈ 2 min

from paper attention is all you need

upload successful

decoder 通常分为Autoregressive decoder (AT)和 Non-autoregresive decoder(NAT)

Autoregressive decoder(AT自回归解码）

upload successful

encoder 和 decoder的不同之处

upload successful

upload successful

如上图所以，当计算b2的时候，只考虑前面的a1,a2部分向量，同理计算b3时只考虑前面的a1，a2,a3这三个向量。当前的输出只通过之前的输入向量计算

Why masked?

:因为按照decoder的实现机理，decoder的输入是前序列的输出，所以当前的输出是无法考虑到后面的输入向量的，因为后面的输入向量还没有计算出来。
如何终止输出？

添加终止token END

Non-autoregressive(NAT)

AT：
AT是逐步输出从begin开始，知道输出END为止

NAT:
NAT是一次性输出，一次就产生一个句子出来，不是一次只产生一个单词

upload successful

upload successful

cross attention 部分有两个输入来自encoder部分，一个来自decoder
如下图：

decoder提供了q,encoder提供k,v，通过q和k计算出注意力分数a,然后通过a,v计算出V，然后传入到decoder中的FeedForward中

upload successful

upload successful

：上图以语音识别为例

Encoder的输入是一段语音序列
Decoder的输入是正确的Label
需要做的是将Decoder的输出为Label+END
最小化decoder的输出值和Label的交叉熵的最小值（类似多分类问题)

训练的时候Decoder的输入为正确的答案Label，但是测试的时候是没有Label的，所以当decoder的输入为错误的值的时候,后面就会连续的出现错误
在训练的时候添加错误的信息

upload successful

Copy Mechanism

有些项目中，模型需要做的是复制输入的某些序列，而不是重新创造。
- 利于下图的对话中，人名，特需名称
- 做文章的摘要的项目中summarization,从文章中复制内容

upload successful

Guided Attention

例如：在做语音合成的时候，需要从左到右依次合成语音

upload successful

BLEU score

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.