This project relates to the paper "HEAT: Heterogeneous Entity-level Attention for Entity Typing".
This project provides an attention-based neural network model for the entity-level typing task. It processes heterogeneous data and learns representations from them, including the name, relevant paragraphs, and relevant image features.
While Mention-level Entity Typing infers the types of a mention that are supported by its textual contexts, Entity-level Typing infers types of an entity by considering all types supported by data.
- PyTorch version >= 1.7
- Python version >= 3.7
- transformers >= 3.3
- a GPU with 11GB graphic RAM if run with the BiLSTM encoder
- a GPU with 32GB graphic RAM if run with the BERT encoder
Run python init.py
to download datasets. Run bash run.sh
to train the model with each dataset.
-task
specifies the task id-dataset
specifies the dataset name-text_encoder
specifies the text encoder in {'bert', 'bert_freeze', 'lstm'}-remove_name, -remove_para, -remove_img
make the model not use corresponding modules-without_token_attention, -without_cross_modal_attention
make the model not use corresponding attention layers-seed
specifies the random seed id-consistency
makes the model do consistency training together-labeled_num
limits the labeled samples number-cpu
runs on cpu- Other hyper-parameters are set in
config.py
Four public datasets have been processed for the Entity-level typing with heterogeneous data task:
- TypeNet: aligns Freebase types to noun synsets from the WordNet hierarchy and eliminate types that are either very specific or very rare. The original dataset consists of more than two million mentions with the name and contexts. Our processed TypeNet consists of more than half a million entities with the name and relevant paragraphs. download
- MedMentions: contains annotations for a wide diversity of entities in the biomedical domain. The original dataset consists of nearly
$250$ thousand mentions. Our processed MedMentions consists of more than 50 thousand entities with the name and relevant paragraphs. download - Flowers: (Oxford Flowers-102) provides text descriptions for each flower image. The objective is to classify the fine-grained flower name of the sample. We keep the original splits and use the cross-entropy loss since each sample has only one positive type. download
- Birds: (Caltech-UCSD Birds) provides text descriptions for each bird image. The objective is to classify the fine-grained bird name of the sample. We also keep the original splits and use the cross-entropy loss since each sample has only one positive type. download
Files in each dataset:
data.pkl
contains the name, raw paragraphs, and each typing task's labels for an entitydata_txt.pkl
is generated bydata_loader.py
and contains the tokenized indexes of the name and paragraphsdata_img.pkl
(optional) contains each entity's relevant image featuressplit.pkl
(optional) contains the split information of train/valid/test sets by entity's nametypes.json
contains the name of each typehierarchies.json
contains the taxonomy between types