Multi-Scale attention-seq


Title

《MUSE: PARALLEL MULTI-SCALE ATTENTION FOR SEQUENCE TO SEQUENCE LEARNING》

Abstract

Q:

attention mechanism alone suffer from dispersed weights not suit long seq

1.the attention in deep layers tends to over-concentrate on a single token

2.insufficient use of local info

3.difficult in representing long seq

S:

parallel multi-scale representation

condition:

More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unifified semantic space.

1.Introduction

Transformer mainly adept at seq tasks like MT,text classiification,language modeling.

It is solely based on an attention mechanism that captures global dependencies between input tokens, dispensing with recurrence and convolutions entirely. The key idea of the self-attention mechanism is updating token representations based on a weighted sum of all input representations.

Transformer基于一种自注意力机制,这种机制可以通过递归和卷积来捕捉输入令牌之间的全局依赖,在所有输入表示的权值的基础上更新令牌。

但Transformer存在缺点,如图1所示,seq的长度与Transformer的表现成反比,因为注意力机制会被过度关注,分散,只有一小部分令牌能被注意力表示。可能会引起insufficient representation of info,hard to understand the source info for model.

图1

尝试限制注意力只关注长序列的一部分,但是这种方法由于长距离依赖代价太高,也没有显示出seq to seq学习的有效性。是为建立一个具有局部和全局上下文建模的归纳偏差的模块,将self-attention与convolution结合,提出multi-scale attention called MUSE.attention用于捕捉依赖,convolution补偿局部信息的不足,并且有扩展性。

2.MUSE:Parallel Multi-Scale Attention

MUSE has encoder-decoder framework,encoder has word embedding as input(x1,x2,…,xn),Muse transfer X to z,给定 z,解码器再生成文本序列y。encoder is a stack of N MUSE.decoder与encoder相似,decoder不仅在文本表示中捕捉特征,而且用additional context attention对输出的编码栈进行attention。

MUSE有三个主要部分:self-attention(global feature),depth-wise separable convolution(local),position-wise feed-forward network(token).

MUSE的i-1层来预测i层 MUSE-simple

2.1 Attention Mechanism

给定输入seq X,转换成K,Q,V,再用self-attention得到输出。

self-attention is dot-production:image-20201007095833650

2.2Convolution

depth-wise convolution has 2 transformers:point-wise projecting,contextual transformation.

每一个convolution的子模块包含不同大小的单元,用来捕捉不同范围的特征。

select the weight of different convolution cells.

Shared projection

the shared projection project the input feature into the same hidden space.

2.3Point-wise feed-forward network

learn token level representation,MUSE将self-attention与feed-forward连接,不同位置线性变化相同,所以位置前馈网路可用于提取token


Author: Weiruohe
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Weiruohe !
  TOC