Title:
《Very Deep Transformers for Neural Machine Translation》
Q:
how to decrease the variance of the output layer in order to train deep Trsnaformer for NMT?
S:
use a initialization named “ADMIN” to remedy the variance problem and stabilize training
C:
outperform their 6-layer baseline,with up to 2.5BLEU improvement.
The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.