This exercise was given as part of the Entity Linking lecture at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016).
Presentation slides: http://bit.ly/russir2016-el
- Complete the missing parts in el_cmn.py to implement a simple commonness baseline.
- I.e., link each mention to the entity with the highest commonness score.
- Sample solution: el_cmn_sol.py
- Implement TAGME's voting approach for disambiguation by completing el_tagme.py.
- This builds on the previous exercise and already includes commonness computation.
- We note that the original TAGME approach includes additional pruning steps, which are disregarded here (those would make a big difference in performance though).
- Sample solution: el_tagme_sol.py
- Optionally, you can implement any other disambiguation approach (including novel ideas of your own).
- The input documents are found in data/snippets.txt; the first column is the docID
- The results (one annotation per line) need to be written in a file using the following format:
docID score entityID mention page-id
- where score is the annotation confidence score and the last column is the string 'page-id'
- see data/output_cmn.txt for an example
- Evaluation:
evaluator_annot.py <qrel_file> <result_file> [score_threshold]
- If
score_threshold
is provided, the evaluation script will only consider annotations from the output file with scores above the given threshold (and ignore lower confidence annotations).
- If
See the code files under the nordlys directory.
Python v2.7 is required.
- mention_entity.tsv: number of times a mention refers to a given entity
- Format:
mention entity frequency
- When entity="_total" it means the total number of times the mention was linked (to any entity)
- Format:
- entity_inlinks.tsv: total number of inlinks an entity has
- Format:
entity frequency
- Format:
- entity_pairs_inlinks.tsv: number of inlinks two entities have in common
- Format:
entity1 entity2 frequency
- Format:
- snippets.txt: 20 input text snippets (to be annotated)
- Format:
id text
- Format:
- qrels.txt ground truth annotations corresponding to snippets.txt
- Format:
id 1 entityID mention tmpID
- Format:
Method | Score threshold | Prec | Recall | F1 |
---|---|---|---|---|
Commonness | 0.5 | 0.4407 | 0.5629 | 0.4944 |
Commonness | 0.7 | 0.4533 | 0.4675 | 0.4603 |
Commonness | 0.9 | 0.6000 | 0.3532 | 0.4446 |
TAGME | 0.5 | 0.4634 | 0.4929 | 0.4777 |
TAGME | 0.7 | 0.4763 | 0.4233 | 0.4483 |
TAGME | 0.9 | 0.5857 | 0.3357 | 0.4268 |
This exercise was created based on the TAGME reproducibility code developed by Faegheh Hasibi.