/awesome-MNER

awesome-multimodal-named-entity-recognition

Awesome Multimodal Named Entity Recognition πŸŽΆπŸ“œ

A collection of resources on multimodal named entity recognition.

Sorry, this repo will be not updated, since I already changed my research topic.

Content

1.Description

🐌 Markdown Format:

  • (Conference/Journal Year) Title, First Author et al. [Paper] [Code] [Project]
  • (Conference/Journal Year) [πŸ’¬Topic] Title, First Author et al. [Paper] [Code] [Project]
    • (Optional) 🌱 or πŸ“Œ
    • (Optional) πŸš€ or πŸ‘‘ or πŸ“š
  • 🌱: Novel idea
  • πŸ“Œ: The first...
  • πŸš€: State-of-the-Art
  • πŸ‘‘: Novel dataset/model
  • πŸ“šοΌšDownstream Tasks

2. Topic Order

  • πŸ‘‘ Dataset
    • (AAAI 2017) Adaptive Co-attention Network for Named Entity Recognition in Tweets [paper]
    • (ACL 2018) Visual Attention Model for Name Tagging in Multimodal Social Media [paper]

3. Chronological Order

  • 2020

    • (ACL 2020) Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer [paper]
    • (COLING 2020) RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER [paper]
  • 2021

    • (AAAI 2021) Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance [paper]
    • (AAAI 2021) RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER [paper] [code]
    • (EMNLP 2021) Can images help recognize entities? A study of the role of images for Multimodal NER [paper] [code]
  • 2022

    • (COLING 2022) Flat Multi-modal Interaction Transformer for Named Entity Recognition [paper]

      • πŸ“Œ 1st interpolating FLAT with MNER
      • πŸš€ SOTA on Twitter15 with Bert_base_uncased but code is unavailable
    • (NAACL Findings 2022) Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction [paper] [code]

      • πŸ“Œ code using refined Twitter15 dataset
    • (WSDM 2022) MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition [paper] [code]

    • (SIGIR 2022) Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [paper] [paper]

      • πŸ“Œ 1st fully Transformer structure
      • πŸš€ SOTA on Twitter17 using Bert_base_uncased but only implement on Twitter17
    • (NAACL 2022) ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [paper] [code]

      • πŸ“Œ Roberta_large as backbone provides powerful improvements
      • 🌱 Using OCR ect without directly using images
    • (MM 2022) Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition [paper]

      • 🌱 1st MRC based framework for MNER
    • (SIGIR 2022) Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER [paper]

      • πŸ“Œ Trustworthy performance by reimplementation
    • (ICME 2022) CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention [paper]

      • πŸš€ SOTA on Twitter15 and Twitter17 with Roberta_large
      • πŸ“Œ Require 8 V100 GPU
    • (DSAA 2022) PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition [paper]

      • πŸš€ SOTA on Twitter15 and Twitter17 with Roberta_large
      • πŸ“Œ Require 8 V100 GPU
      • 🌱 Prompt-based
    • (arxiv 2022οΌ‰ Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media [paper] [code]

    • (arxiv 2021) Multi-Granularity Contrastive Knowledge Distillation for Multimodal Named Entity Recognition

      • submitted to ACL2021 but not accepted
    • (arxiv 2022) MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding [paper]

  • 2023

    • (EMNLP 2023) Named Entity and Relation Extraction with Multi-Modal Retrieval [paper] [code in construction]
      • πŸš€ retrieve augmentation in MNER image
    • (MM 2023) Learning Implicit Entity-object Relations by Bidirectional Generative Alignment for Multimodal NER [paper]
      • πŸš€ generative alignment
      • 🌱 free from image during inference

4. Course

5. Thinking MNER

  • MNER is a hard task since it needs multimodal understanding in social media domain. However, existing methods simplify it to extacting helpful viusal clue to assist NER, with a simple showcase. In twitter datasets, the image-text pair always has no or vague relationship, which needs extra information or supervision for model to understand. Therefore, I believe that is why MNER-QG, MoRe, R-GCN and PromptMNER work. However, existing works are still nowhere near logical understanding, since they all introduce out-sample knowledge. Now I am trying to introduce knowledge graph in MNER to provide in-sample context.

  • Tricky task: When I developed my work (SOTA in two datasets but in submission), I found

    1 only tuning task head of BERT (freeze Bert and ViT) can achieve comparable results (0.2-0.5% drop). So I believe we can directly introduce prompt eniggering for text.

    2 large language model matters more than fancy innovation

    3 using a empty image to replace all valid images during test only drop 2%

    4 simple loss like contractive loss brings 2% improvement. So I think the model heavily focuses on text.

    5 Applying the same code in different environments even in different GPUs has results with large variance...

    6 It has two mainstreams depending on image+text or caption+text as input. DAMO-series works mainly focus on the latter one, which is proved to be more extandable and SOTA than others.

Contact Me