Netcrawler is a very fast HTTP client, capable of pushing thousands of requests/second
.
I have used this to download the ~170 million
.com domain's index pages with a c6g.4xlarge
in ~10 hours.
My average speed was about 4400 requests/s
The repository contains 3 solutions to this problem:
- requests.get in a loop
- multiprocessing with requests.get in a loop
- pycurl with multiprocessing
The first two are for measurement purposes only, the only useful one is the pycurl one, however #2 also can have pretty good performance under ideal conditions.
apt-get install libcurl4-gnutls-dev libgnutls28-dev python3-dev pip install requests pycurl
To run the crawler:
$ python3 -m run --backend pycurl --urlfile ./lists/sample_100.txt --batchsize 100 --logfile logs/logfile.txt --datafile ./logs/datalog --timeout 1 --connect-timeout 1 --pycurl-workers-print-log True --pycurl-maxhandles 50 --nsserver 8.8.8.8
To analyse the data:
$ python3 -m analyse --file-glob './logs/datalog_*' --max-workers 4 --function ip
The code internally uses pycurl
with curlmulti
, and multiprocessing's
ProcessPoolExecutor to scale to more than one core.
The files saved by the crawler and consumed by the analyser are pickle files that are gzipped.
- Migrate from using pickle files to something less vulnerable
- Refactor the analyser so that it can be invoked from Python in a way that it returns the matching objects
- This would enable a more fine-grained analysis