Multi-Scale attention-seq

PaperLookThrough

Publish Date: 2020-10-06

Update Date: 2020-10-12

Word Count: 653

Read Times: 3 Min

Read Count:

Title

《MUSE: PARALLEL MULTI-SCALE ATTENTION FOR SEQUENCE TO SEQUENCE LEARNING》

Abstract

Q:

attention mechanism alone suffer from dispersed weights not suit long seq

1.the attention in deep layers tends to over-concentrate on a single token

2.insufficient use of local info

3.difficult in representing long seq

S:

parallel multi-scale representation

condition:

More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unifified semantic space.

1.Introduction

Transformer mainly adept at seq tasks like MT,text classiification,language modeling.

It is solely based on an attention mechanism that captures global dependencies between input tokens, dispensing with recurrence and convolutions entirely. The key idea of the self-attention mechanism is updating token representations based on a weighted sum of all input representations.

Transformer基于一种自注意力机制，这种机制可以通过递归和卷积来捕捉输入令牌之间的全局依赖，在所有输入表示的权值的基础上更新令牌。

但Transformer存在缺点，如图1所示，seq的长度与Transformer的表现成反比，因为注意力机制会被过度关注，分散，只有一小部分令牌能被注意力表示。可能会引起insufficient representation of info,hard to understand the source info for model.

尝试限制注意力只关注长序列的一部分，但是这种方法由于长距离依赖代价太高，也没有显示出seq to seq学习的有效性。是为建立一个具有局部和全局上下文建模的归纳偏差的模块，将self-attention与convolution结合，提出multi-scale attention called MUSE.attention用于捕捉依赖，convolution补偿局部信息的不足，并且有扩展性。

2.MUSE：Parallel Multi-Scale Attention

MUSE has encoder-decoder framework,encoder has word embedding as input(x1,x2,…,xn),Muse transfer X to z,给定 z，解码器再生成文本序列y。encoder is a stack of N MUSE.decoder与encoder相似，decoder不仅在文本表示中捕捉特征，而且用additional context attention对输出的编码栈进行attention。