deep transformer-NMT


Title:

Very Deep Transformers for Neural Machine Translation

Q:

how to decrease the variance of the output layer in order to train deep Trsnaformer for NMT?

S:

use a initialization named “ADMIN” to remedy the variance problem and stabilize training

C:

outperform their 6-layer baseline,with up to 2.5BLEU improvement.

The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.


Author: Weiruohe
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Weiruohe !
  TOC