Some doubts on the datasets
wujsAct opened this issue · 2 comments
KNET is an excellent work and is very useful for many applications.
Recently, I follow your AAAI 2018 paper and download this code.
I found that context sequences in valid_context.npy are out-of-order in valid_context.npy , test_context.npy
and train_context,npy. So that it may impossible for us to reuse this data.
On the other hand, the left and right context sequences length are separately 15?
*_context.npy
files are organized in the following way.
For a sentence
...a5, a4, a3, a2, a1, ENTITY WORDS, b1, b2, b3, b4, b5, ...
it's stored as [a1, b1, a2, b2, ..., a15, b15]
.
And yes, context on both side has a window of 15 words, which sometimes even goes beyond the sentence boundary. But if there is not enough words, say, at the beginning of a paragraph, unk
will be used as paddings.
Does this solve your doubts?
Thanks for your explanation.