/openwebtext_async

Open clone of OpenAI's unreleased WebText dataset scraper

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

OpenWebTextAsync

This implementation builds on the work of jcpeterson et al to focus on one thing in particular: downloading the html contents as fast as possible. This is achieved through asynchronous scraping in parallel across multiple workers, using only firebase for coordination.

Dependencies

If you use pipenv (pip install --user pipenv), cd to the project root and run

pipenv install 
pipenv shell

Otherwise, just run the following in a new virtual environment

pip3 install -r requirements.txt

Requirements

  • Download the pre-filtered URLs here and concatenate all into "urls.txt"
  • Google Drive storage (preferably at least 40GB)
  • Many servers for scraping (preferably with fast internet download speed)
  • Firebase project with realtime database set up

Original OpenAI project links