malllabiisc/WordGCN

Can we upload our own dataset?

sandro272 opened this issue · 5 comments

Do you have scripts available/any easy way to convert raw data to your processed dataset files . So that i can test your on my own dataset .

Hi @sandro272,
By dataset you mean training dataset (wikipedia corpus) or evaluation data?

@svjan5 Uh... I mean that because I want to use our own dataset, so can you provide a script or method that converts raw data into your processed data (eg voc2id.txt, etc.). Thank you!

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

@svjan5 OK,thank you!

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

Also, please update the README.md : )
Thanks!
@svjan5

p.s. I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.