parsing full dataset?

Question

parsing full dataset?

hp0404 opened this issue 3 years ago · 2 comments

before downloading the latest full release, I thought I'd clarify - I've noticed you have s2orc-doc2json library, so do I need to manually parse zipped files once I have full dataset or do you upload processed JSONL files that don't require any parsing?

thanks

Answer 1 · 2021-07-24T01:05:40.000Z

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

Answer 2 · 2023-06-29T08:35:36.000Z

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

Can you provide some samples of using this dataset?