lephong/mulrel-nel

Why use two types of word segmentation methods?

Closed this issue · 4 comments

Hi, thanks for your code.

I have a question on the word segmentation methods on the contexts of mentions.
I noticed that you apply two different methds. In the first method, you split the words from m['context'].

mulrel-nel/nel/ed_ranker.py

Lines 196 to 199 in db14942

lctx = m['context'][0].strip().split()
lctx_ids = [self.prerank_model.word_voca.get_id(t) for t in lctx if utils.is_important_word(t)]
lctx_ids = [tid for tid in lctx_ids if tid != self.prerank_model.word_voca.unk_id]
lctx_ids = lctx_ids[max(0, len(lctx_ids) - self.args.ctx_window//2):]

In the second method, the words are already segemented in files like aida_train.txt.

mulrel-nel/nel/ed_ranker.py

Lines 215 to 216 in db14942

sent = conll_doc['sentences'][conll_m['sent_id']]
start = conll_m['start']

mulrel-nel/nel/ed_ranker.py

Lines 219 to 220 in db14942

snd_lctx = [self.model.snd_word_voca.get_id(t)
for t in sent[max(0, start - self.args.snd_local_ctx_window//2):start]]

So, I am wondering why not apply a uniform method?

Sorry I can't recall completely. What I remember is:

The first context is for the local model which is reused from Ganea and Hofmann 2017. The context they used (so our first context) is cross-sentence while our second context is not. Because of that, we reused their data preprocessing (and their code) for creating the first context. Hence, we used their word segmentation as well. However, for the second context, we just simply used the word segmentation given by the dataset.

I don't think there is any major difference between the two segmentation methods. Just that they are easy for us to compute different contexts.

Thanks for your reply!

May I ask another question? I noticed that the candidate entities are ranked in decreasing order by p(e|m). However, different mentions in the dataset usually have different number of candidates. Some mentions may only have 1 or 2 candidate entities, whiles some others may have as many as 100 candidates.

I understand that there is a large number of entities (as many as 20,0000), so it is desirable to select a small set of candidates in advance. But, I am wondering, why not set a uniform number for the candidate selection. For example, uniformly select 100 candidates for each mention. Is this also due to the result of Ganea and Hofmann 2017?

For a mention, the number of candidates depends on the alias dictionary extracted from Wikipedia. For instance, if the alias is "Obama", there can be 1000 candidates. But for "Barrack Obama", the number of candidates should be smaller, maybe only 50.

I remember that in the dataset, there are some mentions that have pretty few candidates due to the alias dictionary. That's why you see some mentions have only 2-3 candidates whereas others have 100.

Thanks for your reply!