title:《Understanding Back-Translation at Scale》
Q:
how to improve neural machine translation with monolingual data?
S:
augment the parallel training corpus with back-tranalations of targetlanguage sentences.
C:
sampling or noisy synthetic data better.
new state of the art of 35 BLEU on the WMT’14 English-German test set.
Abstract
We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search.
我们发现在所有资源设置缺失的反向翻译中,通过采样或者噪声束输出的是最有效的,相比于波束或者是贪婪搜索。
Introduction
MT通常依赖于大型平行数据集的统计(paired sentences in both the source and target),但是bitext有限,monolingual data有很多,所以开始利用monolingual data。可以使用language model fusion,back-translation,dual learning来进行优化。
我们的重点是BT,在semi-supervised setup中双语单语数据在目标语言中都是可用的。
首先回译在平行数据训练出一个中间系统,,结果是生成一个平行数据库,它的源端是合成的机器翻译,输出是人类书写的真正的文本。这个合成好的加入真正的bitext为了训练最终的系统。
对基于词的翻译,NMT,unsupervised MT很有作用