YoungXiyuan/DCA

Maybe ent_inlink mistakes?

ZacharyChenpk opened this issue · 7 comments

Hi! When I used your data, I found that some entities which should be related, such as 'Cambodia_national_football_team' and 'Football_Federation_of_Cambodia' have no link to each other, but they both have link with 'Shrewsbury,_Pennsylvania'. That's strange, and I found more strange links and un-links when I tried to build a graph with them.
I used the entityid_dictid_inlinks_uniq.pkl. I assumed that the dict means a relationship with an entity whose id is the key and an entity whose id is in the values. Had I made a mistake, or the data?

I remember that the "entityid_dictid_inlinks_uniq.pkl" should be a dict in which the key is an entity id (corresponding to one Wikipedia Page), while the value is a list of non-repeating inlinks in that wiki page. (See code line # 318 in "mulrel_ranker.py")

I have read the code, and I found that the function "compute_coherence" put its value into "self.entity_embeddings", and therefore I assume that the values can be seen as the ID of other entities which related to the key one. The problem is that I cannot find any reasonable links even between the candidates in one document, and many of the links seem unreasonable.

“The problem is that I cannot find any reasonable links even between the candidates in one document, and many of the links seem unreasonable”

Sorry...Could you please show me a concrete example? Because I forget a lot about this project, and I am a little confused about the term "the candidates in one document"...

For example, "Cambodia national football team" and "Football Federation of Cambodia" have no link between each other, but they both have link with "2011 State of the Union Address", which is unreasonable.
By "the candidates in one document", I meant that when I tried to put the all candidates(entities) of all mentions in a document into one graph, there should be some links between these entities, because there are entities under the same topics or domains, and they should be linked.

  1. I am a little curious about the resource of your listed three entities, are they all candidates of different mentions in the same doc? Or some are candidates while some are just inlinked entities of different candidates of different mentions in the same doc?

  2. This is a very interesting try! And I have 6 points to claim:
    a) The candidates are not generated by us, we just download them from the previous work.

    b) Maybe there are some indirect and potential links between these candidates, not direct links.

    c) Maybe there are some docs whose mentions don't share a unified topic.

    d) In the coherence computation, the usage of candidates and their inlinked entities is to retrieve their corresponding entity embeddings. So to some degree meaning, whether two entities are related to each other or not, is essentially determined by their distance in vector space, not their physical links (:

    e) In our paper, we accumulate previous linked entities to create one kind of a "temporary topic" which exists in vector space, then candidates are preferred whose vectors are close to that topic. So maybe that "temporary topic" can't be expressed in human language.

    f) To get "entityid_dictid_inlinks_uniq.pkl", we adopt JWPL (Java Wikipedia Library) to process Wikipedia dumps. May there are some defects in that tool.

All in all, I have to say that the visualization try of "entityid_dictid_inlinks_uniq.pkl" is a extremely interesting experiment. I truly hope you can find some good ideas to improve the performance of DCA system (:

Hi @YoungXiyuan, since you mention the entityid_dictid_inlinks_uniq.pkl is an entity id mapped to a list of non-repeating inlinks in that wiki page. So does the list of ids refers to a list of entity ids or list of word ids?

@theblackcat102 According to my impression, the list of ids should refer to a list of entity ids.