This implementation builds on the work of jcpeterson et al to focus on one thing in particular: downloading the html contents as fast as possible. This is achieved through asynchronous scraping in parallel across multiple workers, using only firebase for coordination.
If you use pipenv (pip install --user pipenv
), cd to the project root and run
pipenv install
pipenv shell
Otherwise, just run the following in a new virtual environment
pip3 install -r requirements.txt
- Download the pre-filtered URLs here and concatenate all into "urls.txt"
- Google Drive storage (preferably at least 40GB)
- Many servers for scraping (preferably with fast internet download speed)
- Firebase project with realtime database set up