git clone https://github.com/izuna385/Dual-encoder-with-BERT.git
cd Dual-encoder-with-BERT
python3 train.py -num_epochs 1
For further speednizing, you can use multi gpus.
CUDA_VISIBLE_DEVICES=0,1 python3 train.py -num_epochs 1 -cuda_devices 0,1
Re-implementation of [Gillick et al., '19] and [Humeau et al., '20] 's bi-encoder.
- You can run Bi-encoder based Entity Linking experiments with your own datasets.
- This experiments are specifically for In-domain Entity Linking. For Zero-Shot one, see this repository.
See requirements.txt
.
If allennlp
is not installed to your local environments, follow Allennlp documentation.
You need cui2idx.json
, idx2cui.json
, cui2cano.json
, and cui2def.json
for encoding entities of specified KB (, or, entity set).
-
cui2idx.json
andidx2json
cui means one unique id for each entity, like
D0002131
ofUnited stated of America
in DBpedia.idx is integer for each cui.
-
cui2cano.json
andcui2def.json
Canonical names specify entity name for each entity. Canonical names and Definitions (first sentence of definition is often used here) must be split to tokens.
You also needs annotated train/dev/test mentions.
See ./mention_dump_dir/xxx/
for more details.
-
id2line.json This contains all annotated mentions including train, dev and test.
"0": "D000001\tPER\tHarry\tThe success of the books and films has allowed the <target> Harry Potter </target> franchise ..."
- "0" : mention uniq id.
- "D000001": Gold entity for each mention
- "PER": Type, like ORG, LOC, and MISC. You can use dummy tag because this type is not used for training.
- "Harry Potter": Raw mention string.
- "The success ...": One sentence which contains one mention. The mention is wrapped with special tokens,
<target>,</target>.
-
For checking scripts with dummy datasets, run
python3 train.py -num_epochs 1
- Linking evaluation is done with entire accuracy, not normalized one.
-
Prepare entities mentioned above, and linking dataset.
- The required formats of datasets can be confirmed at
./dataset/
directory.
- The required formats of datasets can be confirmed at
-
Make dataset creation more easier.
-
Pip packaging.
MIT