allenai/s2orc

parsing full dataset?

hp0404 opened this issue · 2 comments

before downloading the latest full release, I thought I'd clarify - I've noticed you have s2orc-doc2json library, so do I need to manually parse zipped files once I have full dataset or do you upload processed JSONL files that don't require any parsing?

thanks

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

Can you provide some samples of using this dataset?