Question: Do you have missed entity statistics for the pre-processed datasets?

Question

Question: Do you have missed entity statistics for the pre-processed datasets?

Closed this issue 6 years ago · 2 comments

Thanks a lot for open sourcing your code. I had a question about the conversion process from the previous dataset formats to the standardised one. The preprocessing code skips entities if they do not exist in the known set of entities. Could you provide statistics for how frequently this occurs for the following datasets?

ace2004.txt
aida_dev.txt
aida_test.txt
aida_train.txt
aquaint.txt
clueweb.txt
msnbc.txt
wikipedia.txt

Thanks a lot!

Answer 1 · 2019-05-19T17:25:29.000Z

If you execute this command:
end2end_neural_el/code$ python -m preprocessing.prepro_aida
you see all the entities that are not found. From a quick look I observe that it is a handful but some of them are repeated many times like the country "Bosnia".

Answer 2 · 2019-05-19T17:27:27.000Z

For the rest of the datasets you should execute:
end2end_neural_el/code$ python -m preprocessing.prepro_other_datasets