allenai/s2orc

Data Integrity

Hannibal046 opened this issue · 2 comments

Hi, after downloading the full data from the link you emailed. I found that the set of paper_id in metadata_0.jsonl.gz do not equal to that of pdf_parses_0.jsonl.gz. Am I getting wrong ? is it possible that paper of the paper_id in metadata_0.jsonl.gz appear in the pdf_parses_x.jsonl.gz ? Thanks so much !

@Hannibal046 The paper_id's are the same between each similarly numbered metadata and pdf_parse set. The metadata file will have many more paper_ids, since it includes papers where we do not have any full text. All entries with has_pdf_parse: True in the metadata entry will have a corresponding entry in the pdf_parse file.

For example, in metadata_0.jsonl.gz, there are 1366661 entries. Only 310736 of these have a PDF, all of which have corresponding entries in pdf_parses_0.jsonl.gz.

Please let me know if anything is still unclear

Thanks so much ! It solves all my problem