Title
《**BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**》
1.Abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fifinetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task specifific architecture modififications.
BERT全称Bidirectional Encoder Representation from Transformer,它是一种在未标记的文本中,用所有layer上下文为共同条件进行深度双向的预训练表示。预训练BERT模型只微调一个额外输出层,没有进行实质性任务特定的模型修改,为多领域(比如问题回答和语言推理)的任务创造一个先进的模型。
2.Introduction
预训练模型在许多任务中表现很好,such as:
A:sentence-level(推理和解释inference和paraphrasing)
B: token-level (NER,QA)
预训练的两个下游任务:基于特征(ELMO)以及微调(OpenAI GPT)参数
fine-tunning微调在Transformer的自注意力层中只能关注前序token的信息,在QA这种需要关注上下文的应用中存在问题。还提出了masked language model(MLM),它通过随机mask一些输入中的token来随机预测一些基于文本背景的masked token
3.BERT
Bert模型的框架包括两个部分:pre-training,fine-tunning:
pre-training - unlabeled data - pre-traning taskB:fine-tunning - labeled data - downstream task
Model architecture:multi-layer bidirectional Transformer
num of layer:L,hidden size:H,self-attention head A
Input/Output Representation
[^Segment Embedding是标记token属于句子A还是句子B的]:
3.1Pre-training BERT
Task1 Masked LM
随机选择input token中的一些token进行随机mask,接着预测这些token,最后将和maskToken相关的final hidden vector放入output softmax。有一个缺陷是在pre-training和fine-tunning可能有不匹配问,在fine-tunning时[Mask]并没有出现,解决方法是:
- 训练数据随机选择15%的token position进行预测
- 对于选中的token:80%[MASK],10% random token,10%unchanged
Task2 NextSentence Prediction
QA和NLI都是基于理解两个句子之间的关系的,这个关系无法在语言模型中被捕捉,为了解决该问题,预训练了
binarizednext sentence predictiontask,在每个预训练例子中找到对应的A,B,发现50%的句子B确实是A后继labed(isNext),其他是来自数据库的随机句子
label(notnext)
3.2Fine-tunning BERT
微调是简单的,因为BERT的自注意力机制允许通过合适的交换合适的输入和输出来模拟下游任务。
文本对应用:独立地对文本对进行encode+bidirectional cross attention.
BERT通过使用自注意机制来进行这两个部分。encode a concatenated text pair with self-attention