OpenWebTextAsync

This implementation builds on the work of jcpeterson et al to focus on one thing in particular: downloading the html contents as fast as possible. This is achieved through asynchronous scraping in parallel across multiple workers, using only firebase for coordination.

Dependencies

If you use pipenv (pip install --user pipenv), cd to the project root and run

pipenv install 
pipenv shell

Otherwise, just run the following in a new virtual environment

pip3 install -r requirements.txt

Requirements

Download the pre-filtered URLs here and concatenate all into "urls.txt"
Google Drive storage (preferably at least 40GB)
Many servers for scraping (preferably with fast internet download speed)
Firebase project with realtime database set up

Original OpenAI project links

Blog Post (Better Language Models and Their Implications)
Paper (Language Models are Unsupervised Multitask Learners)
Code (https://github.com/openai/gpt-2)

chiayewken/openwebtext_async

OpenWebTextAsync

Dependencies

Requirements

Original OpenAI project links