Only the First 100 Paragraphs

Question

Only the First 100 Paragraphs

Closed this issue 3 years ago · 3 comments

Hello, It looks like the paragraphs field of the examples includes only the first 100 paragraphs. I wonder if I could get the dataset with full paragraphs. Thank you!

Answer 1 · 2022-06-28T18:47:59.000Z

Sorry that I don't quite get what "paragraph" means, do you mean the first 100 sentences of a certain wikipedia page?

Answer 2 · 2022-07-02T05:49:27.000Z

Sorry that I don't quite get what "paragraph" means, do you mean the first 100 sentences of a certain wikipedia page?

Hi @wenhuchen, thanks for your reply and help. The first 100 paragraphs refer to the first 100 paragraphs of each Wikipedia page. For example, in the train.hard.json.gzip file, each training example has these fields: "idx", "question", "context", "targets", "paragraphs". The field "paragraphs" contains at most the first 100 paragraphs of a Wikipedia page, but sometimes the answer falls outside the first 100 paragraphs. So I was wondering if I could get the full list of "paragraphs".

I just realized there is a Process.ipynb. Does this mean I can just remove the [:100] from the split_paragraphs function to get the full paragraphs list? Thanks again.

Answer 3 · 2022-07-02T14:01:28.000Z

Oh, yes, you should be able to get the full document if that is provided. Mostly the 100 paragraphs will cover the needed information.