davidgolub/QuestionGeneration

A problem with dev,train,test csv colums

Closed this issue · 4 comments

Hi,

After one follows the instructions to set up the NewQA dataset (https://github.com/Maluuba/newsqa) the columns in dev,test,train files are not ordered according to the requirements in the source code.

  1. First issue is that in the csv files the column "answer_token_ranges" should be "answer_char_ranges" as it is refered the source code (bidaf/newqa/prepro.py)
  2. Then the other issue is with the column order in the header.
    The csv files are like this
    story_id,story_text,question,answer_token_ranges
    294:297|None|None,"41,55,82,100,126,138,165,181,204,219,237",60:61,./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story

Shouldn’t the header be the following ?
answer_char_ranges,story_text,question,story_id.

Thanks

Hi RasberryJoy,
Thanks for reaching out. It seems like the dataset/preprocessing code from Maluuba changed since I last accessed the repo, i.e. see: Maluuba/newsqa@0433cd9#diff-04c6e90faac2675aa89e2176d2eec7d8. Back when I wrote the code the dataset was delimited by answer_char_ranges. Feel free to issue changes to the repo or I'll take a look when I have time.

Thanks!
David

Hello, I have looked into this issue again and I believe that the problem comes from Maluuba's NewsQA Dataset, in the Tokenize and Split process (https://github.com/Maluuba/newsqa). The columns get mixed up after splitting the dataset into dev,train,test. It is clear that this issue is causing your newsqa.prepro script to fail. I will get back to you once it is resolved.

I have found what is causing the error in the preprocessing step, anyone that is interested can check the proposed solution here Maluuba/newsqa#16

Maluuba/newsqa has been updated to be more robust with the field processing. Please update Maluuba/newsqa#16 if there are any more issues with the order of the fields.

As for char/token ranges. Sorry for the confusion. Char ranges were removed from the tokenized version because they aren't reliable in the tokenized version. There's an explanation here: Maluuba/newsqa#18

I'd like to simplify the dataset by providing a JSON and JSONL versions. Stay tuned!

Thanks for the feedback!