Data Integrity
Hannibal046 opened this issue · 2 comments
Hi, after downloading the full data from the link you emailed. I found that the set of paper_id
in metadata_0.jsonl.gz
do not equal to that of pdf_parses_0.jsonl.gz
. Am I getting wrong ? is it possible that paper of the paper_id
in metadata_0.jsonl.gz
appear in the pdf_parses_x.jsonl.gz
? Thanks so much !
@Hannibal046 The paper_id
's are the same between each similarly numbered metadata
and pdf_parse
set. The metadata file will have many more paper_ids, since it includes papers where we do not have any full text. All entries with has_pdf_parse: True
in the metadata entry will have a corresponding entry in the pdf_parse
file.
For example, in metadata_0.jsonl.gz
, there are 1366661 entries. Only 310736 of these have a PDF, all of which have corresponding entries in pdf_parses_0.jsonl.gz
.
Please let me know if anything is still unclear
Thanks so much ! It solves all my problem