Title
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Abstract
Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections.
为了学习图像-语言推理,除了学习视觉概念和语义之外,两种模态间的一致性和联系是最重要的。
LXMERT在学习这些联系。
Transformer模型包含三个编码层:object relationship encoder(处理图像),language encoder(文字),cross-modality encoder(交叉模态)
pre-training:image-sentence pairs,
task:masked language modeling, masked object prediction(feature regression and label classifification),cross-modality matching, and image question answering.
good result on two dataset:VQA,GQA
Introduction
Despite these inflfluential single modality works, large-scale pretraining and fifine tuning studies for the modality-pair of vision and language are still under-developed.
这个框架是将BERT模型应用于跨模态场景,主要关注单个图像以及描述性句子
在Abstract中提到的五种预训练任务中,与单模态预训练模型不同,交叉模态预训练通过同模态的可见组件或者不同模态的对齐组件推断masked features.最终建立intra-modality,cross-modality relationships.
evaluate
首先在VQA,GQA上进行评估,整体精度上达到了最先进的程度。
为证明通用性,在NLVR^2上使用真实世界的图片进行微调,评估(不使用数据集中的自然图片),最近进行several analysis,ablation stadies证明模型以及预训练任务的有效性
Model
模型输入是concept-level image embedding,以及word-level sentence embedding.
input embedding
Word-level Sentence Embedding

对于sentence来说,首先将句子分成单词,再加入每个word在句子中的绝对位置,最后投射到向量里,最后添加到索引感知的单词嵌入中
Object-Level Image Embedding
不同于以往使用feature map或者直接使用RoI特征,使用了bounding box coordinates和2048维的ROI特征,加入了位置信息(position info),因为在image embedding和attention layer中对输入的位置是未知的。

Encoder层
Attention是在Query向量相关的上下文向量(context vector)集合中检索info,如果Query向量本身就是context vector,那Attention就是Self-attention。attention层先计算query vector和context vector的匹配分数,被softmax统一化
Single-Modality Encoder
**与BERT不同的是,transformer encoder输入不仅是language,而且是vision.每一个encoder层包括Self,FF.FF又包括两个全连接子层。又加入residual connection,layer normalization.