unitedstates/congress

Scraper is eating our CPU.

Opened this issue · 1 comments

Dear Congress Folk:
As the year progress and the congress xml files on govinfo.gov get bigger and bigger, I am finding that downloading some files is eating 80-90% of our CPU, causing server lag times. Some example problem files are:

https://www.govinfo.gov/sitemap/BILLS_2018_sitemap.xml
https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/115hr/sitemap.xml

I would humbly suggest that you need to put a stream handler in the download function of utils in order to download the files in chunks. Examples suggested here:
https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py

Thanks
Sherrod

@mooncalfskb Thanks for identifying this. Would you be up for submitting a pull request which refits our download method to use a streaming handler?