malllabiisc/WordGCN

What does data.txt mean?

loginaway opened this issue · 3 comments

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

This problem confuses me for a long time... I tried setting the last part the same as the sentence tokens, and it kept showing segmentation fault...

Would you please give a description of data.txt, and also update the README.md?

Thanks!
@svjan5

Hi @loginaway,
Yes, you got it right. Earlier data had some extra information which was not being utilized. I have updated the data now. Please have a look.

@loginaway hi,I also try to use this model to process my own dataset, but I am new and face many difficult. May you share the relevant code, and that will be really helpful!

Hi @eyuansu62!
It's been a while and I could not find my relevant code. However, I remember that the official code is okay to run, as long as you follow the instruction in README.md to build your own dataset. This issue #9 was about the wrong format of an old version of official dataset (but not the code), which has already been updated.
Hope it's helpful!