/HEAT

HEAT: Heterogeneous Entity-level Attention for Entity Typing

Primary LanguagePythonMIT LicenseMIT

HEAT

This project relates to the paper "HEAT: Heterogeneous Entity-level Attention for Entity Typing".

This project provides an attention-based neural network model for the entity-level typing task. It processes heterogeneous data and learns representations from them, including the name, relevant paragraphs, and relevant image features.

While Mention-level Entity Typing infers the types of a mention that are supported by its textual contexts, Entity-level Typing infers types of an entity by considering all types supported by data.

Requirements and initialization

  • PyTorch version >= 1.7
  • Python version >= 3.7
  • transformers >= 3.3
  • a GPU with 11GB graphic RAM if run with the BiLSTM encoder
  • a GPU with 32GB graphic RAM if run with the BERT encoder

Run python init.py to download datasets. Run bash run.sh to train the model with each dataset.

Usage

  • -task specifies the task id
  • -dataset specifies the dataset name
  • -text_encoder specifies the text encoder in {'bert', 'bert_freeze', 'lstm'}
  • -remove_name, -remove_para, -remove_img make the model not use corresponding modules
  • -without_token_attention, -without_cross_modal_attention make the model not use corresponding attention layers
  • -seed specifies the random seed id
  • -consistency makes the model do consistency training together
  • -labeled_num limits the labeled samples number
  • -cpu runs on cpu
  • Other hyper-parameters are set in config.py

Datasets

Four public datasets have been processed for the Entity-level typing with heterogeneous data task:

  • TypeNet: aligns Freebase types to noun synsets from the WordNet hierarchy and eliminate types that are either very specific or very rare. The original dataset consists of more than two million mentions with the name and contexts. Our processed TypeNet consists of more than half a million entities with the name and relevant paragraphs. download
  • MedMentions: contains annotations for a wide diversity of entities in the biomedical domain. The original dataset consists of nearly $250$ thousand mentions. Our processed MedMentions consists of more than 50 thousand entities with the name and relevant paragraphs. download
  • Flowers: (Oxford Flowers-102) provides text descriptions for each flower image. The objective is to classify the fine-grained flower name of the sample. We keep the original splits and use the cross-entropy loss since each sample has only one positive type. download
  • Birds: (Caltech-UCSD Birds) provides text descriptions for each bird image. The objective is to classify the fine-grained bird name of the sample. We also keep the original splits and use the cross-entropy loss since each sample has only one positive type. download

Files in each dataset:

  • data.pkl contains the name, raw paragraphs, and each typing task's labels for an entity
  • data_txt.pkl is generated by data_loader.py and contains the tokenized indexes of the name and paragraphs
  • data_img.pkl (optional) contains each entity's relevant image features
  • split.pkl (optional) contains the split information of train/valid/test sets by entity's name
  • types.json contains the name of each type
  • hierarchies.json contains the taxonomy between types