jcpeterson/openwebtext

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

PythonGPL-3.0

Issues

Quick question
#45 opened a year ago by Anindyadeep
0
A question about dealing with dataset
#44 opened 2 years ago by etoilestar
0
Idea for further filtering
#43 opened 2 years ago by davidgilbertson
0
getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the dataset from huggingface
#42 opened 2 years ago by lovodkin93
0
pre-filtered URLs can no longer be accessed
#37 opened 3 years ago by sunhmy
0
How to cite this version of openwebtext?
#36 opened 3 years ago by Guitaricet
2
Estimated disk space usage of scraped data?
#26 opened 4 years ago by dnola
1
How to resume download after an error?
#25 opened 4 years ago by drfinkus
1
Error with get_state in download.py
#22 opened 5 years ago by JohnGiorgi
1
pycurl error: transfer closed with X bytes remaining to read
#24 opened 4 years ago by drfinkus
0
Filtering extracted results
#23 opened 4 years ago by Jack000
2
Why is Newspaper3k used for html scraping?
#5 opened 5 years ago by tilmanrpk
6
extract_text.py is very slow and does not fully utilize multiprocessing
#12 opened 5 years ago by villmow
1
missing argument `--html_archive` in extract_text.py instructions
#14 opened 5 years ago by hughperkins
1
BPE
#8 opened 5 years ago by 8enmann
3
Getting the karma score from pushshift
#15 opened 5 years ago by ronnyli
1
Undeclared requirement
#11 opened 6 years ago by aakova
1
Can't download the 2G links
#1 opened 6 years ago by chiphuyen
7
(Also) parsing structured data while you're at it
#2 opened 6 years ago by westurner
5