MNER-Multimodal Entity Span Detection

PaperLookThrough

Publish Date: 2020-07-06

Update Date: 2020-10-12

Word Count: 1.4k

Read Times: 5 Min

Read Count:

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

A example of NER:

Jim bought 300 shares of Acme Corp. in 2006.
[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

level:ACL2020

author:Jianfei Yu

keywords:MNER,Entity Span Detection

##Qusetions

MNER drawbacks

the words are insensitive to the visual context

现有的方法侧重模态间交互进行建模，因为单词的隐藏层表示仍然基于文本上下文，对视觉上下文不敏感。

忽略了合并视觉信息的误差。关联的图片信息只包括句子中的一两个实体，不涉及其他实体，这样会使其他实体无法识别。
most of the words ignore the bias brought by the visual context

Strategy:

1.main strategy:多通道交互模块（MMI）：standard Transformer layer+cross-model attention mechanism

2.auxiliary task:leverage purely text-based entity span detection

Consequence:

achieves the new state-of-the-artperformance on two benchmark datasets.

Notes

Overall Architecture of Our Unified Multimodal Transformer.

Transformer模型

采用encoder-decoder模型。与Attention相似

基本内部结构如图所示，进入Encoder层前先将单词进行Emebedding操作，self-attention操作后送入前馈神经网络，也可并行进行self-attention和前馈神经网络。

BERT

全称Bidirectional Encoder Representation from Transformers，即双向Transformer的Encoder
创新点：pre-train：Masked LM+Next Sentence Prediction

MLM（Masked LM)可以理解为完形填空，作者会随机mask每一个句子中15%的词，用其上下文来做预测，例如：my dog is hairy → my dog is [MASK]

此处将hairy进行了mask处理，然后采用非监督学习的方法预测mask位置的词是什么，但是该方法有一个问题，因为是mask15%的词，其数量已经很高了，这样就会导致某些词在fine-tuning阶段从未见过，为了解决这个问题，作者做了如下的处理：
- 80%的时间是采用[mask]，my dog is hairy → my dog is [MASK]
- 10%的时间是随机取一个词来代替mask的词，my dog is hairy -> my dog is apple
- 10%的时间保持不变，my dog is hairy -> my dog is hairy
那么为啥要以一定的概率使用随机词呢？这是因为transformer要保持对每个输入token分布式的表征，否则Transformer很可能会记住这个[MASK]就是”hairy”。至于使用随机词带来的负面影响，文章中解释说,所有其他的token(即非”hairy”的token)共享15%*10% = 1.5%的概率，其影响是可以忽略不计的。Transformer全局的可视，又增加了信息的获取，但是不让模型获取全量信息。
注意：
- 有参数dupe_factor决定数据duplicate的次数。
- 其中，create_instance_from_document函数，是构造了一个sentence-pair的样本。对每一句，先生成[CLS]+A+[SEP]+B+[SEP]，有长（0.9）有短（0.1），再加上mask，然后做成样本类object。
- create_masked_lm_predictions函数返回的tokens是已经被遮挡词替换之后的tokens
- masked_lm_labels则是遮挡词对应位置真实的label。
Next Sentence Prediction
选择一些句子对A与B，其中50%的数据B是A的下一条句子，剩余50%的数据B是语料库中随机选择的，学习其中的相关性，添加这样的预训练的目的是目前很多NLP的任务比如QA和NLI都需要理解两个句子之间的关系，从而能让预训练的模型更好的适应这样的任务。
个人理解：
- Bert先是用Mask来提高视野范围的信息获取量，增加duplicate再随机Mask，这样跟RNN类方法依次训练预测没什么区别了除了mask不同位置外；
- 全局视野极大地降低了学习的难度，然后再用A+B/C来作为样本，这样每条样本都有50%的概率看到一半左右的噪声；
- 但直接学习Mask A+B/C是没法学习的，因为不知道哪些是噪声，所以又加上next_sentence预测任务，与MLM同时进行训练，这样用next来辅助模型对噪声/非噪声的辨识，用MLM来完成语义的大部分的学习。

positional Encoding

Transformer中缺少一种解释单词顺序的方法，positional Encoding维度和embedding一样，可以通过它计算出任意两个词之间的距离，最终将它和Embedding相加输入下一层即可

self-attention

定义三个向量：Query,Key,Value(三个矩阵是embedding向量与三个随机矩阵相乘的结果，eg：维度（64，128)，注意第二个维度与embedding向量的维度相同
scores=Q*K将结果除以1提到的第一个维度的开方得到的是softmax

该词代表的是每个词对于当前位置的词的相关性大小。将value和softmax相乘得到的各个结果进行相加得到的结果即为self-attention在当前节点的值

Resnet

ResNet是一种残差网络,网络越深，获取的信息越多，特征也越丰富。但是根据实验表明，随着网络的加深，优化效果反而越差，测试数据和训练数据的准确率反而降低了。这是由于网络的加深会造成梯度爆炸和梯度消失的问题。

Multimodal Interaction (MMI) Module.

Weiruohe

https://weiruohe.github.io/2020/07/06/improving-multimodal-named-entity-recognition-via-entity-spandetection-with-unified-multimodal-transformer/

All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Weiruohe !

NLP MNER - NER(命名实体识别)

deep learning-Wuenda

4-1多功能note n=number of features x(i)=input of ith training example x(i)j=value of j feature in ith training example 假

2020-07-07 VideoClass

deep-learning

PPT Useful Plugin

1、PPT美化大师“让制作专业精美PPT变得简单”“让不会做PPT的人，也能做好PPT”作为一款由wps的开发公司金山软件开发的PPT插件，自然来头不小，也不负大师之名。 1.1、内容规划生成模板在美化大师工具栏选择新建一个PPT，选择美

2020-05-11 软件工具

PPT

MNER-Multimodal Entity Span Detection

A example of NER:

##Qusetions

MNER drawbacks

Strategy:

Notes

Transformer模型

采用encoder-decoder模型。与Attention相似

BERT

Next Sentence Prediction

positional Encoding

self-attention

你的赏识是我前进的动力