Title
《Convolutional Sequence to Sequence Learning》
Abstract
We introduce an architecture based entirely on convolutional neural networks.Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fifixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module.
Compared with RNN,CNN的计算并行性更好,训练的时候可以更好利用GPU而且优化更容易进行,因为非线性参数的数量固定,与输入长度无关。我们使用门控线性单元消除梯度传播
Compared with RNN,CNN create representation for fixed size,but the context size of the network can be easily made larger by stacking.CNNs do not depend on the previous time step and allow parallelization over every element in a seq.
[^RNN matain a hidden state of the entire past that prevents parallel computation]:
Muti-layer CNN create hierarchical representations over the input.
nearby elements interact at lower layer,distant one at higher.
Compared with RNN,this structure provides a shorter path to capture long-range dependenceies.
我们可以在窗口大小为n通过捕捉关系来获得特征表达,CNN的O(n/k),k为kernel’s width,O(n) for RNN
Q:
Bradbury et al. (2016) introduce recurrent pooling between a succession of convolutional layers,but doesn’t improve result.
Gated CNN restricted the sixe of dataset and must be uesed with count-based model.
*partially CNN *shows well,but decoder is still recurrent.
S:
an architecture for seq2seq which is entirely convolutional.
included gated linear units,residual connection,attetion in decoder layer(control overhead)
C:
CNN模型相比于RNN在更大数据集上提升了一个数量级的速度,相比于RNN更能发现组合结构,依赖于门控以及multi-attention

对CNN的理解
卷积公式:
[^系统某一时刻的输出是由多个输入共同作用的结果]:
![]()
![]()
[^计算第二个元素只需将蓝色方框向右移动一列]:
![]()
[^上图中间的卷积矩阵通过卷积运算可以看出输入矩阵是从暗变亮还是从亮变暗]:
![]()
[^卷积矩阵可以用来当作水平或垂直边缘检测]:
[^中间一行的参数权重变大是为了增大鲁棒性]:
可以将矩阵的九个参数全都通过神经网络去学习(反向算法),矩阵还可以进行翻转
卷积核上所有作用点依次作用于原始像素点后(即乘起来),线性叠加的输出结果,即是最终卷积的输出,也是我们想要的结果,我们称为destination pixel.
Padding
对输入矩阵进行卷积的时候,假设输入矩阵维数为n,卷积矩阵为f,则输出矩阵为n-f+1,f>=1,所以输出会被缩小,这样边缘的特征会丢失。
引入padding,对输入矩阵四周进行扩充,元素为0.
根据是否padding,将卷积分为Valid,Same。
- Vaild输出矩阵维数为n-f+1.
- Same输出矩阵n+2p-f+1(p为padding的维数)
要使input=output,p=(f-1)/2,即p为基数时才能均匀padding
Strided convolution
卷积步长是s,即每次卷积后的输出矩阵为
three dimensional CNN
![]()
输出矩阵的通道数可以随着核的数量继续增加
![]()
Pooling layer
Max Pooling
如果在过滤器中提取出任何的特征则保留最大值,反之则不是最大值
Average pooling
通过计算方块特征值中的平均数来进行特征提取。效果没有Max pooling好。
POOL层无参数,卷积层参数较少,全连接层参数最多,随着层数增加,激活值减少。
Why use CNN?
通过共享参数和使用稀疏网络(防止过拟合)来减少参数。
经典网络
LetNet-5
[^早期pooling只有avg,人们早期使用Sigmoid/tanh,而非Relu]:
AlexNet
残差网络(Residual block)
神经网络的层数太深会导致梯度爆炸或梯度消失。残差网络,可以将从一层获取的激活值迅速反馈给其他层或是更深的网络。
CNN
position embedding
show the current position of the input or output seq.