注意力机制

2023-04-22

Word count: 979 | Reading time≈ 3 min

注意力机制

注意力机制产生因素

upload successful
如何让模型能够考虑上下文

upload successful
可以使用上图中的方式，使用全连接考虑所有的sequence.

问题： a window covers the whole sequence?

因为每个序列的长度是不一样的，所以需要计算最长序列的长度,将window设置成最长的序列长度
但是window如果设置的很长，那么FC的参数会很多，增加计算量，内存，而且容易overfitting

self-attention机制可以很好的解决这种困境

引入self-attention机制

运作原理：
self-attention会考虑所有的输入序列（下图中：黑色方块是考虑到所有序列后的结果）

upload successful

多层结构：

upload successful

self-attention 运作原理

从总体看，就是考虑所有输入前后序列的值

upload successful

b1的产生过程

upload successful

找出序列中和a1相关的向量。(Find the relevant vectors in a sequence)

upload successful

计算出一个alpa(attention score)来表示向量间的关联性

alpa的计算方式有Dot-product,Additive

upload successful

attetnion score的计算方式，如下图

计算的是a1与其它向量的关联性，所以query和key 的区分如图

upload successful

a1也需要计算自己和自己的关联性

upload successful

使用一个soft-max函数进行输出得到a'（也可以使用别的激活函数）

upload successful

这里输出的a’就是表示的各个向量与a1的关联性，需要根据a’来抽取出seqence中的重要信息

引入V向量,然后用每个V向量乘上各自的注意力分数a'，然后求和得到b1

upload successful

主要思想就是，与a1的关联性越大的向量，对b1的值影响越大，attention score值越大的向量，抽取出来的信息越多
然后以同样的方式计算出所有的b的值

upload successful

所有的b是并行计算出来的

upload successful

从矩阵乘法说明self-attention的计算过程

通过 $w^q, w^k,w^v$ 获取$q^i , k^i , v^i $

upload successful

获取 $a$ 注意力分数
分别将 $q^1,q^2...$ 与每个$ k$值进行向量乘法

upload successful

对每个注意力分数进行激活操作-（soft-max函数)

upload successful

通过 $a'和v$ 计算 $b^1$

upload successful

总结self-attention

upload successful

需要学习的参数为 $w^k, w^q, w^v$

多头注意力（Multi-head Self-attention)

2 heads as example

每个序列各自产生2个， $Q , K , V$

分别产生两类Q1,Q2,K1,K2,V1,V2
这里计算b的时候使用Q1，K1，V1之间一起计算，Q2,K2,V2之间一起计算，每个向量产生两个b。
然后将两个b向量融合

upload successful

self-attention的缺点

No position information in self-attention ,(缺少了对位置的分析）

解决办法 Positional Encoding

为每个位置设置一个权重
Each position has a unique positional vector $e^i$

upload successful

[https://arxiv.org/abs/2003.09229]

Truncated Self-attention

Self-attention 在语音识别的时候可能需要很多的计算量和存储单元。

所以引用Truncated self-attention 改变注意力的视野,计算当前向量前后部分的向量信息，不考虑所有的序列信息

upload successful

[https://arxiv.org/abs/1910.12977]

Self-attention 与 CNN的差异点

self-attention考虑的是整张图片信息
例如图中：
每个像素产生query，其它像素产生key

upload successful

CNN 只考虑每个像素周边范围内的信息

upload successful

CNN是一种简化版的Self-attention
CNN : self-attention that can only attends in a receptive field(感受野)
Self-attention: CNN with learnable receptive field.(是一种自动学习感受野的CNN）

On the Relationship between Self-Attention and Convolutional Layers

upload successful

self-attention 与 RNN的差异

upload successful

单向RNN只能从左到右传递信息，self-attention考虑所有的信息，双向RNN是考虑前后的信息。
RNN无法记住很久之前的信息
RNN不是并行计算，Self-attention是Parallel

Transformers are RNNS: Fast Autoregressive Transformers with Linear Attention

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.