Question: Do you have missed entity statistics for the pre-processed datasets?
Closed this issue · 2 comments
Hi There @NikosKolitsas,
Thanks a lot for open sourcing your code. I had a question about the conversion process from the previous dataset formats to the standardised one. The preprocessing code skips entities if they do not exist in the known set of entities. Could you provide statistics for how frequently this occurs for the following datasets?
ace2004.txt
aida_dev.txt
aida_test.txt
aida_train.txt
aquaint.txt
clueweb.txt
msnbc.txt
wikipedia.txt
Thanks a lot!
If you execute this command:
end2end_neural_el/code$ python -m preprocessing.prepro_aida
you see all the entities that are not found. From a quick look I observe that it is a handful but some of them are repeated many times like the country "Bosnia".
For the rest of the datasets you should execute:
end2end_neural_el/code$ python -m preprocessing.prepro_other_datasets