BadRequest error
Closed this issue · 4 comments
Super nice repo thanks a lot for the work!
Executing the code in 01_download.py
works fine up to the point where one goes above an offset of 5000.
Then I get this:
{"message": "offset: 5100 is greater than the maximum of 5000", "code": "BadRequest"}
This looks to me like the figshare API only allows me to download 5000 articles. I could expect this error if it would appear at a random number, especially since your stats look like it could be ~5000 articles in total. But 5000 is too much of a shiny number and your plots are kind of outdated, so I guess something is going wrong here and not all papers are fetched. Also this finished in ~5min.
Any thoughts on this @cthoyt ?
Many thanks :)
Oof @jannisborn I remember dealing with this issue locally last time I updated the results... it wasn't pretty. I'm a bit busy now but I'll try and dig up my solution and share it with you (and improve the code if possible, but since I didn't push it, I think it was too hacky for comfort)
Thanks for the rapid reply @cthoyt. Already helps a lot to see that you had the same issue.
The figshare docs say:
Please note that there's a limit on the maximum offset or page number you can require.
The offset is currently limited at 1000 and if exceeded a 422 Unprocessable Entity error will be returned.
For pages, it depends on the page_size but for a page_size of 10, the maximum page would be 1000 / 10 = 100
This seems outdated since the error is raised if offset > 5000
. Hence, I manually checked https://chemrxiv.org/search which currently lists 6548 papers. The maximum limit
parameter (number of papers per access request) is 1000. Hence, I can obtain 6000 papers with the code in this repo, leaving out only 548 papers which is just 8% of the DB.
That's good enough for my current usecase, thanks for your help. I leave this open in case you want to work on a solution, but pls feel free to close.
I just pushed a proper solution, it required reading over the api documentation a bit more and realizing that they really don't want you using the offset/limit combination for big results. Instead, page_size and page are preferred. 45f22ad implements this.
Fantastic, that is great news. Many thanks for the rapid solution, it's much appreciated :) 👍