RUSSIR'16 Entity linking exercise

This exercise was given as part of the Entity Linking lecture at the 10th Russian Summer School in Information Retrieval (RuSSIR 2016).

Tasks

Complete the missing parts in el_cmn.py to implement a simple commonness baseline.
- I.e., link each mention to the entity with the highest commonness score.
- Sample solution: el_cmn_sol.py
Implement TAGME's voting approach for disambiguation by completing el_tagme.py.
- This builds on the previous exercise and already includes commonness computation.
- We note that the original TAGME approach includes additional pruning steps, which are disregarded here (those would make a big difference in performance though).
- Sample solution: el_tagme_sol.py
Optionally, you can implement any other disambiguation approach (including novel ideas of your own).
The input documents are found in data/snippets.txt; the first column is the docID
The results (one annotation per line) need to be written in a file using the following format: docID score entityID mention page-id
- where score is the annotation confidence score and the last column is the string 'page-id'
- see data/output_cmn.txt for an example
Evaluation: evaluator_annot.py <qrel_file> <result_file> [score_threshold]
- If score_threshold is provided, the evaluation script will only consider annotations from the output file with scores above the given threshold (and ignore lower confidence annotations).

See the code files under the nordlys directory.

Python v2.7 is required.

mention_entity.tsv: number of times a mention refers to a given entity
- Format: mention entity frequency
- When entity="_total" it means the total number of times the mention was linked (to any entity)
entity_inlinks.tsv: total number of inlinks an entity has
- Format: entity frequency
entity_pairs_inlinks.tsv: number of inlinks two entities have in common
- Format: entity1 entity2 frequency
snippets.txt: 20 input text snippets (to be annotated)
- Format: id text
qrels.txt ground truth annotations corresponding to snippets.txt
- Format: id 1 entityID mention tmpID

Method	Score threshold	Prec	Recall	F1
Commonness	0.5	0.4407	0.5629	0.4944
Commonness	0.7	0.4533	0.4675	0.4603
Commonness	0.9	0.6000	0.3532	0.4446
TAGME	0.5	0.4634	0.4929	0.4777
TAGME	0.7	0.4763	0.4233	0.4483
TAGME	0.9	0.5857	0.3357	0.4268

This exercise was created based on the TAGME reproducibility code developed by Faegheh Hasibi.