jcpeterson/openwebtext
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
PythonGPL-3.0
Issues
- 0
Quick question
#45 opened by Anindyadeep - 0
A question about dealing with dataset
#44 opened by etoilestar - 0
Idea for further filtering
#43 opened by davidgilbertson - 0
getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the dataset from huggingface
#42 opened by lovodkin93 - 0
pre-filtered URLs can no longer be accessed
#37 opened by sunhmy - 2
How to cite this version of openwebtext?
#36 opened by Guitaricet - 1
Estimated disk space usage of scraped data?
#26 opened by dnola - 1
How to resume download after an error?
#25 opened by drfinkus - 1
Error with get_state in download.py
#22 opened by JohnGiorgi - 0
- 2
Filtering extracted results
#23 opened by Jack000 - 6
Why is Newspaper3k used for html scraping?
#5 opened by tilmanrpk - 1
- 1
- 3
- 1
Getting the karma score from pushshift
#15 opened by ronnyli - 1
Undeclared requirement
#11 opened by aakova - 7
Can't download the 2G links
#1 opened by chiphuyen - 5