andycasey/ads

Large query fails to return more than 2000 articles

barentsen opened this issue · 3 comments

I don't seem to get all the articles desired when executing a large query. Example:

In [1]: import ads
In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=9999999)
In [3]: articles = list(qry)
In [4]: len(articles)
Out[4]: 2000

The number 2000 is suspicious, and arguably wrong because the api response says there are more:

In [5]: print(qry.response.numFound)
3140

I figured that SearchQuery.__next__() likely thought that the max_pages limit had been reached, but when I repeat the query with a high max_pages limit, I see an IndexError.

In [1]: import ads

In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=9999999, max_pages=99999999)

In [3]: articles = list(qry)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/gb/dev/ads/ads/search.py in __next__(self)
    388         try:
--> 389             cur = self._articles[self.__iter_counter]
    390             # If no more articles, check to see if we should query for the

IndexError: list index out of range

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-3-24c62670914a> in <module>()
----> 1 articles = list(qry)

/home/gb/dev/ads/ads/search.py in __next__(self)
    405             self.execute()
    406             print(self.__iter_counter)
--> 407             cur = self._articles[self.__iter_counter]
    408 
    409         self.__iter_counter += 1

IndexError: list index out of range

Can anyone reproduce?

(I am trying to count the fraction of articles that appear on arXiv as a function of journal and year, I hope that's ok.)

Hi, thanks for the report.

I suspect what's going on here is related to the fact that the ADS solr service re-writes rows if it is greater than a certain value as a safety measure: https://github.com/adsabs/solr-service/blob/master/solr/views.py#L59

When that happens, the ads client isn't aware of that re-write and expects its own rows to be definitive. I'm thinking the best way to handle this is to throw up a warning if the responseHeader doesn't contain the same data as self.query.

In general, setting rows to a reasonable value and iterating over the pages of results is the preferred way to query the database.

@vsudilov I confirm that setting rows=2000 and max_pages=10 allows me to circumvent the bug, thanks.

In [1]: import ads
In [2]: qry = ads.SearchQuery(q='pub:"Monthly Notices of the Royal Astronomical Society" pubdate:"2015"', rows=2000, max_pages=10)
In [3]: articles = list(qry)
In [4]: len(articles)
Out[4]: 3140

Thx++!