stac-utils/pystac-client

Simple Catalog search.items() goes on infinitely

iliion opened this issue · 7 comments

pystac_client version: 0.7.5

I am performing the following simple request to get some items from a catalog and this ends up in an infinite loop (?).

from pystac_client import Client
import datetime

def main():
    catalog = Client.open(url='https://earth-search.aws.element84.com/v1/')
    my_search = catalog.search(collections='cop-dem-glo-30', limit = 5)
    print(my_search.url_with_parameters())
    # prints out -> `https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30`
    for item in my_search.items():
        print(item)

if __name__ == '__main__':
    main()

In the above example I would just expect to the api to return 5 items per page.
What I get instead are multiple requests of the following https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30.
In addtion if the results are less than the limit imposed, then the api will keep returning repeatedly the same items (and not necessarilty in the same order).

Tom is correct, if you only want to return five items, use max_items. A couple of other things:

In the above example I would just expect to the api to return 5 items per page.

It should, but to check this you need to:

for page in my_search.pages_as_dicts():
    print(len(page))

In this line:

print(my_search.url_with_parameters())

During paging, the search object is not updated with the paging parameters, so url_with_parameters will not change while paging. See

def get_pages(
self,
url: str,
method: Optional[str] = None,
parameters: Optional[Dict[str, Any]] = None,
) -> Iterator[Dict[str, Any]]:
"""Iterator that yields dictionaries for each page at a STAC paging
endpoint, e.g., /collections, /search
Return:
Dict[str, Any] : JSON content from a single page
"""
page = self.read_json(url, method=method, parameters=parameters)
if not (page.get("features") or page.get("collections")):
return None
yield page
next_link = next(
(link for link in page.get("links", []) if link["rel"] == "next"), None
)
while next_link:
link = Link.from_dict(next_link)
page = self.read_json(link, parameters=parameters)
if not (page.get("features") or page.get("collections")):
return None
yield page
# get the next link and make the next request
next_link = next(
(link for link in page.get("links", []) if link["rel"] == "next"), None
)
for the relevant code.

Ok I understand that the search request will return all pages and the limit will be the size of the each page and I get the number of items in each page from print(len(page['features']))

My problem is that the requests will go on infinitely when I ran the above example in my catalog. I understand that this is a bug on my part but I cant understand the reason. Maybe you have a clue why the requests from the client wont stop. Do i miss something in the api specification?

FYI: The api response follows the specs here (https://api.stacspec.org/v1.0.0/item-search/#tag/Item-Search)

I think I know what is wrong. stac_client does not support paging implemented with page=x parameter.

For the following request http://localhost:20008/search?limit=2&collections=test-collection
The rel=next link will have this href -> http://localhost:20008/search?limit=2&collections=test-collection&page=1

Unfortunately the above url is parsed and the output is the following

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}

Unfortunately the above url is parsed and the output is the following

I don't quite know what you mean by this. The read_text method doesn't make any assumptions about pagination -- it simply uses what the server returns:

def read_text(self, source: pystac.link.HREF, *args: Any, **kwargs: Any) -> str:
"""Read text from the given URI.
Overwrites the default method for reading text from a URL or file to allow
:class:`urllib.request.Request` instances as input. This method also raises
any :exc:`urllib.error.HTTPError` exceptions rather than catching
them to allow us to handle different response status codes as needed.
"""
if isinstance(source, Link):
link = source.to_dict()
href = link["href"]
# get headers and body from Link and add to request from simple STAC
# resolver
merge = bool(link.get("merge", False))
# If the link object includes a "method" property, use that. If not
# fall back to 'GET'.
method = link.get("method", "GET")
# If the link object includes a "headers" property, use that and
# respect the "merge" property.
headers = link.get("headers", None)
# If "POST" use the body object that and respect the "merge" property.
link_body = link.get("body", {})
if method == "POST":
parameters = (
{**(kwargs.get("parameters", {})), **link_body}
if merge
else link_body
)
else:
# parameters are already in the link href
parameters = {}
return self.request(
href, method=method, headers=headers, parameters=parameters
)
else: # str or something that can be str'ed
href = str(source)
if _is_url(href):
return self.request(href, *args, **kwargs)
else:
with open(href) as f:
href_contents = f.read()
return href_contents

To continue debugging, can you provide the following:

My guess was read_json()

I will try to be more clear.

http://localhost:20008/search?limit=2&collections=test-collection

will output a response where the next link is like this:

{
  "rel":"next",
  "type":"application/json",
  "method":"GET",
  "href":"http://localhost:20008/search?limit=1&collections=test-collection&page=1"
}

If I run the following and print the response then I get something different

catalog = Client.open(url='http://localhost:20008')
my_search = catalog.search(collections='test-collection', limit = 1)

for page in my_search.pages_as_dicts():
        print(my_search.url_with_parameters())
        # -> http://localhost:20008/search?limit=1&collections=test-collection
        print(page['links'])

The page['links'] will output a response where the next link is this:

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}

The point is that the loop will not stop


DEBUG

. . .
REQUEST 0

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:30 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509

REQUEST 1

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '48', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"]}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 48\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"]}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
..  ..  .. (infinite loop).. .. .. 

This is a problem with your server. pages_as_dicts does not modify the links attribute in any way:

def pages_as_dicts(self) -> Iterator[Dict[str, Any]]:
"""Iterator that yields :class:`dict` instances for each page
of results from the search.
Yields:
Dict : a group of items matching the search
criteria as a feature-collection-like dictionary.
"""
if isinstance(self._stac_io, StacApiIO):
num_items = 0
for page in self._stac_io.get_pages(
self.url, self.method, self.get_parameters()
):
call_modifier(self.modifier, page)
features = page.get("features", [])
if features:
num_items += len(features)
if self._max_items and num_items > self._max_items:
# Slice the features down to make sure we hit max_items
page["features"] = features[0 : -(num_items - self._max_items)]
yield page
if self._max_items and num_items >= self._max_items:
return
else:
return

Closing as not-an-issue-with-pystac-client, please re-open if you find otherwise.