thunlp/KNET

Some doubts on the datasets

wujsAct opened this issue · 2 comments

KNET is an excellent work and is very useful for many applications.
Recently, I follow your AAAI 2018 paper and download this code.
I found that context sequences in valid_context.npy are out-of-order in valid_context.npy , test_context.npy
and train_context,npy. So that it may impossible for us to reuse this data.
On the other hand, the left and right context sequences length are separately 15?

*_context.npy files are organized in the following way.
For a sentence

...a5, a4, a3, a2, a1, ENTITY WORDS, b1, b2, b3, b4, b5, ...

it's stored as [a1, b1, a2, b2, ..., a15, b15].

And yes, context on both side has a window of 15 words, which sometimes even goes beyond the sentence boundary. But if there is not enough words, say, at the beginning of a paragraph, unk will be used as paddings.

Does this solve your doubts?

Thanks for your explanation.