A problem with dev,train,test csv colums
Closed this issue · 4 comments
Hi,
After one follows the instructions to set up the NewQA dataset (https://github.com/Maluuba/newsqa) the columns in dev,test,train files are not ordered according to the requirements in the source code.
- First issue is that in the csv files the column "answer_token_ranges" should be "answer_char_ranges" as it is refered the source code (bidaf/newqa/prepro.py)
- Then the other issue is with the column order in the header.
The csv files are like this
story_id,story_text,question,answer_token_ranges
294:297|None|None,"41,55,82,100,126,138,165,181,204,219,237",60:61,./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story
Shouldn’t the header be the following ?
answer_char_ranges,story_text,question,story_id.
Thanks
Hi RasberryJoy,
Thanks for reaching out. It seems like the dataset/preprocessing code from Maluuba changed since I last accessed the repo, i.e. see: Maluuba/newsqa@0433cd9#diff-04c6e90faac2675aa89e2176d2eec7d8. Back when I wrote the code the dataset was delimited by answer_char_ranges. Feel free to issue changes to the repo or I'll take a look when I have time.
Thanks!
David
Hello, I have looked into this issue again and I believe that the problem comes from Maluuba's NewsQA Dataset, in the Tokenize and Split process (https://github.com/Maluuba/newsqa). The columns get mixed up after splitting the dataset into dev,train,test. It is clear that this issue is causing your newsqa.prepro script to fail. I will get back to you once it is resolved.
I have found what is causing the error in the preprocessing step, anyone that is interested can check the proposed solution here Maluuba/newsqa#16
Maluuba/newsqa has been updated to be more robust with the field processing. Please update Maluuba/newsqa#16 if there are any more issues with the order of the fields.
As for char/token ranges. Sorry for the confusion. Char ranges were removed from the tokenized version because they aren't reliable in the tokenized version. There's an explanation here: Maluuba/newsqa#18
I'd like to simplify the dataset by providing a JSON and JSONL versions. Stay tuned!
Thanks for the feedback!