does not download all quarters
rogelioamancisidor opened this issue · 12 comments
I have noticed that edgar.dowload_index
does not download all .tsv
files for all quarters. Some times i get all 4 quarters for, say 2015, while some other times I get only 2. See the output of my log below (it should have downloaded 26 files, but it downloaded only 20)
2021-06-29 14:57:46,763 - DEBUG - downloads will be saved to ../data/edgar
2021-06-29 14:57:46,763 - DEBUG - downloading files since 2015
2021-06-29 14:57:46,763 - INFO - 26 index files to retrieve
2021-06-29 14:57:46,763 - DEBUG - worker count: 8
2021-06-29 14:57:47,441 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2019/QTR3/master.zip to ../data/edgar/2019-QTR3.tsv
2021-06-29 14:57:47,629 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2019/QTR2/master.zip to ../data/edgar/2019-QTR2.tsv
2021-06-29 14:57:47,739 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR1/master.zip to ../data/edgar/2021-QTR1.tsv
2021-06-29 14:57:47,785 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR2/master.zip to ../data/edgar/2021-QTR2.tsv
2021-06-29 14:57:47,815 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to ../data/edgar/2018-QTR2.tsv
2021-06-29 14:57:47,819 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2020/QTR4/master.zip to ../data/edgar/2020-QTR4.tsv
2021-06-29 14:57:47,860 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2019/QTR4/master.zip to ../data/edgar/2019-QTR4.tsv
2021-06-29 14:57:47,937 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.zip to ../data/edgar/2019-QTR1.tsv
2021-06-29 14:57:48,203 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to ../data/edgar/2017-QTR4.tsv
2021-06-29 14:57:48,251 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to ../data/edgar/2018-QTR1.tsv
2021-06-29 14:57:48,389 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to ../data/edgar/2017-QTR3.tsv
2021-06-29 14:57:48,516 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR3/master.zip to ../data/edgar/2016-QTR3.tsv
2021-06-29 14:57:48,578 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to ../data/edgar/2017-QTR2.tsv
2021-06-29 14:57:48,643 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR4/master.zip to ../data/edgar/2016-QTR4.tsv
2021-06-29 14:57:48,736 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR2/master.zip to ../data/edgar/2016-QTR2.tsv
2021-06-29 14:57:48,763 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to ../data/edgar/2017-QTR1.tsv
2021-06-29 14:57:48,914 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2015/QTR4/master.zip to ../data/edgar/2015-QTR4.tsv
2021-06-29 14:57:48,939 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2015/QTR3/master.zip to ../data/edgar/2015-QTR3.tsv
2021-06-29 14:57:49,108 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2015/QTR2/master.zip to ../data/edgar/2015-QTR2.tsv
2021-06-29 14:57:49,217 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2015/QTR1/master.zip to ../data/edgar/2015-QTR1.tsv
2021-06-29 14:57:49,230 - INFO - complete
2021-06-29 14:57:49,231 - INFO - Files downloaded in ../data/edgar
Does this consistently happen?
yes. I have solved it wrapping edgar.dowload_index()
in a while
loop. Then I count the number of files downloaded until the function downloads 26 files. It takes 3-4 tries until I get all files. I wonder if the parallelization has something to do with it. Maybe it can be an idea to add some sort of functionality to check whether all files are downloaded?
Thank you for the discussion. I run into the same issue. So far I am only able to download 101 files starting from 1993. After looping for an hour, files for 2021 quarter 1-2, for 2018-2019 quarter 2 etc. are still missing.
I have found two things are related to this issue:
- sometimes
_url_get(url)
link would run into an unhandled Exception:urllib.error.HTTPError: HTTP Error 403: Forbidden
- For Python Multiprocessing, if a child process runs into an unhandled Exception, this will be not recognized by the parent and the rest just keeps running.
So, you will see not all quarters data are successfully downloaded, and at the same time no exceptions are caught.
Perhaps we need a retry mechanism to fix this issue?
I too am running into similar issues except, when entering run.py -y 2012
=> output only shows 2021 quarterly files but not consistently. Sometimes nothing is downloaded, sometimes just Q1 2021 or Q2 & Q3 2021. But not all years from 2012.
2021-07-15 11:16:06,709 - DEBUG - downloads will be saved to ..Local\Temp\tmp_lkv8xaq
2021-07-15 11:16:06,709 - DEBUG - downloading files since 2021
2021-07-15 11:16:06,710 - INFO - 3 index files to retrieve
2021-07-15 11:16:06,710 - DEBUG - worker count: 8
2021-07-15 11:16:08,066 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR3/master.zip to C:..Temp\tmp_xxxx/2021-QTR3.tsv
2021-07-15 11:16:08,462 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2021/QTR2/master.zip to C:..Temp\tmp_xxxx/2021-QTR2.tsv
2021-07-15 11:16:08,512 - INFO - complete
2021-07-15 11:16:08,513 - INFO - Files downloaded in C:..Temp\tmp_lkv8xaq
Process finished with exit code 0
Same issue here. It might be caused by newly imposed Fair access policy, which allows only 10 requests per second. Most of the EDGAR scraping packages are now affected by this.
Thanks for the report @svendaj. Is anyone volunteering to submit a PR to add
- rate limiting at 10qps
- user-agent prompt
so the project remains compliant with the fair access policy?
same issure here, can only get all the index after many times loops.
Fix in #22, can someone confirm it's working for them as well. Will merge after that. Thanks!
@edouardswiac did the fix takes into account the rate limiting 10qps? or the parallel processing was simply turned off?
@rogelioamancisidor @edouardswiac has turned off multiprocessing and was checking if he is not over 10qps.
I have created fork at https://github.com/svendaj/python-edgar to return parallel processing back with checking max request rate. If you like it, you can merge it back to original.