LXMERT


Title

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Abstract

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections.

为了学习图像-语言推理,除了学习视觉概念和语义之外,两种模态间的一致性和联系是最重要的。

LXMERT在学习这些联系。

Transformer模型包含三个编码层:object relationship encoder(处理图像),language encoder(文字),cross-modality encoder(交叉模态)

pre-training:image-sentence pairs,

task:masked language modeling, masked object prediction(feature regression and label classifification),cross-modality matching, and image question answering.

good result on two dataset:VQA,GQA

Introduction

Despite these inflfluential single modality works, large-scale pretraining and fifine tuning studies for the modality-pair of vision and language are still under-developed.

这个框架是将BERT模型应用于跨模态场景,主要关注单个图像以及描述性句子

Abstract中提到的五种预训练任务中,与单模态预训练模型不同,交叉模态预训练通过同模态的可见组件或者不同模态的对齐组件推断masked features.最终建立intra-modality,cross-modality relationships.

evaluate

首先在VQA,GQA上进行评估,整体精度上达到了最先进的程度。

为证明通用性,在NLVR^2上使用真实世界的图片进行微调,评估(不使用数据集中的自然图片),最近进行several analysis,ablation stadies证明模型以及预训练任务的有效性

Model

模型输入是concept-level image embedding,以及word-level sentence embedding.

input embedding

Word-level Sentence Embedding

对于sentence来说,首先将句子分成单词,再加入每个word在句子中的绝对位置,最后投射到向量里,最后添加到索引感知的单词嵌入中


Object-Level Image Embedding

不同于以往使用feature map或者直接使用RoI特征,使用了bounding box coordinates和2048维的ROI特征,加入了位置信息(position info),因为在image embedding和attention layer中对输入的位置是未知的。

Encoder层

Attention是在Query向量相关的上下文向量(context vector)集合中检索info,如果Query向量本身就是context vector,那Attention就是Self-attention。attention层先计算query vector和context vector的匹配分数,被softmax统一化

Single-Modality Encoder

**与BERT不同的是,transformer encoder输入不仅是language,而且是vision.每一个encoder层包括Self,FF.FF又包括两个全连接子层。又加入residual connection,layer normalization.

Reference

https://blog.csdn.net/qq_31050167/article/details/79161077


Author: Weiruohe
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Weiruohe !
 Previous
studynote-20201106-20201119 studynote-20201106-20201119
一、学习内容 研读nlp领域的经典论文 《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》 《Improving Language Unde
2020-11-19 WeiRuoHe
Next 
闲话语义 闲话语义
闲话语义(7)1.相谐所有语义单位自带螺母螺钉,螺钉是输出类型,语义单位不是态射,输出类型就是语义单位本身,输入类型只有态射类型有,带入论元的语义单位中输出必须是输入的本身或下位(相谐性条件)。 “谋杀”事件有两个必选论元:施事和受事者,
  TOC