COST STSM Wikification 2019.3

The archive of the code created for the STSM titled "Named-entity recognition with wikification using Wikidata in the Hungarian part of the ELTEc corpus"

hun_POS_5000_e-magyar.csv: An stratified random sample of 5000 token from the Hungarian ELTEc corpus POS tagged with e-magyar
main.py: A pilot implementation of the wikifier using Bert embeddings to disambiguate by using context
query3.sparql and query_wikidata.sh are the Sparql and the Query script to gather names and aliases for all living person entry in the wikidata. It yields more customizable results compared to NECKAr dataset
result.csv, result3.csv, result3_mod.csv are the results of the Sparql query
scratch.py: Gather statistics from the results obtained from Wikidata
tagged.txt: The Wikified corpus sample. (No found entries because they are all fictive characters.)
wikifier: Tool to parse eltec-level0 and annotate words with wikipedia entries. Makes use of http://wikifier.org/

dlazesz/COST-STSM-Wikification-2019.3

COST STSM Wikification 2019.3

Contents