Large query fails to return more than 2000 articles
barentsen opened this issue · 3 comments
I don't seem to get all the articles desired when executing a large query. Example:
In [1]: import ads
In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=9999999)
In [3]: articles = list(qry)
In [4]: len(articles)
Out[4]: 2000
The number 2000
is suspicious, and arguably wrong because the api response says there are more:
In [5]: print(qry.response.numFound)
3140
I figured that SearchQuery.__next__()
likely thought that the max_pages
limit had been reached, but when I repeat the query with a high max_pages
limit, I see an IndexError
.
In [1]: import ads
In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=9999999, max_pages=99999999)
In [3]: articles = list(qry)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/gb/dev/ads/ads/search.py in __next__(self)
388 try:
--> 389 cur = self._articles[self.__iter_counter]
390 # If no more articles, check to see if we should query for the
IndexError: list index out of range
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-3-24c62670914a> in <module>()
----> 1 articles = list(qry)
/home/gb/dev/ads/ads/search.py in __next__(self)
405 self.execute()
406 print(self.__iter_counter)
--> 407 cur = self._articles[self.__iter_counter]
408
409 self.__iter_counter += 1
IndexError: list index out of range
Can anyone reproduce?
(I am trying to count the fraction of articles that appear on arXiv as a function of journal and year, I hope that's ok.)
Hi, thanks for the report.
I suspect what's going on here is related to the fact that the ADS solr service re-writes rows
if it is greater than a certain value as a safety measure: https://github.com/adsabs/solr-service/blob/master/solr/views.py#L59
When that happens, the ads client isn't aware of that re-write and expects its own rows
to be definitive. I'm thinking the best way to handle this is to throw up a warning if the responseHeader doesn't contain the same data as self.query
.
In general, setting rows to a reasonable value and iterating over the pages of results is the preferred way to query the database.
@vsudilov I confirm that setting rows=2000
and max_pages=10
allows me to circumvent the bug, thanks.
In [1]: import ads
In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=2000, max_pages=10)
In [3]: articles = list(qry)
In [4]: len(articles)
Out[4]: 3140
Thx++!