YoungXiyuan/DCA

Source of entity NER classes

RVACardoso opened this issue · 5 comments

Hi! First of all, thank you for sharing your very interesting work!

I have a simple question: how did you generate the file "./data/entity2type.pkl"?
This is essentially the same as asking: how did you determine the NER class (PER, ORG, LOC or UNK) best suited for each entity?

Thanks!

Thanks for your interest in our work, and sorry for my late reply.

I am very glad to answer your professional question. We select those four NER classes (PER, ORG, LOC and UNK) because the most important dataset in our research is AIDA created by Max Planck Institute (paper: Robust Disambiguation of Named Entities in Text), and AIDA dataset was annotated for the original CoNLL 2003 entity recognition task.

The CoNLL 2003 entity recognition task concentrates on four types of named entities: PER, ORG, LOC and MISC (so called UNK). (paper: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition)

Therefore, our selection would benefit the evaluation performance in the AIDA-test, but may not bring advantages to other cross domain datasets.

You actually pointed out a potential optimization idea that can further improve our DCA framework, give it a try! (-:

Thank you for your answer. That makes sense and it is a good approach to leverage the existing datasets.

However, I have another question: how did you determine the NER class of each one of the candidate entities?
For example, for a mention to the gold entity Barack Obama: "Last Tuesday, Obama gave a speech in the White House.". We know the mention type (from CoNLL 2003 entity recognition task), but how do we know the entity type for all the possible candidates?

Thanks!

Thank you for your professional question.

I would like to spend some time explaining more details about obtaining mention type, type of candidate entity and type embedding.

(1) Mention type: For the AIDA data set, each mention type could be obtained from CoNLL 2003 entity recognition task. Then we adopt the AIDA to train the NFETC system and predict types for mentions in the other five cross-domain data sets.

(2) Candidate entity type: Each candidate entity corresponds to a Wikipedia item as well as a Freebase item. So that we can look up types of a candidate entity in the Freebase by using its surface name/string as the key. (Maybe you need to do some simple post-processing work)

(3) Type embedding: The type embedding in the DCA framework is a 4*5 two-dimensional matrix which is randomly initialized (see the variable "self.type_emb" in the code line #67 of "mulrel_ranker.py"). Then it's learned simultaneously with other model parameters during the model training phase (see the code segment #131-143 in "mulrel_ranker.py").

If you have any other questions, feel free to drop me a message (:

Once again, thank you for your answer and very thorough explanation.

I have one more question: you mention in your paper that you "train a typing system proposed by (Xu and Barbosa, 2018) on AIDA-train dataset, yielding 95% accuracy on AIDA-A dataset", however when I measure the named entity classification performance on your provided mention and entity type files, I obtain a 64% accuracy for AIDA-A. Further inspection shows that 88% of these mistakes are mention types incorrectly labelled as UNK.

Why is your named entity classification performance so low? And, why are there so many mentions incorrectly classified as UNK?

Thanks again!

Sorry for my late reply.

You said that "I measure the named entity classification performance on your provided mention and entity type files, I obtain a 64% accuracy for AIDA-A."

Could you give more details about your methods of measurement? I guess there might be some inappropriate operations.