Simple Catalog search.items() goes on infinitely

pystac_client version: 0.7.5

I am performing the following simple request to get some items from a catalog and this ends up in an infinite loop (?).

from pystac_client import Client
import datetime

def main():
    catalog = Client.open(url='https://earth-search.aws.element84.com/v1/')
    my_search = catalog.search(collections='cop-dem-glo-30', limit = 5)
    print(my_search.url_with_parameters())
    # prints out -> `https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30`
    for item in my_search.items():
        print(item)

if __name__ == '__main__':
    main()

In the above example I would just expect to the api to return 5 items per page.
What I get instead are multiple requests of the following https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30.
In addtion if the results are less than the limit imposed, then the api will keep returning repeatedly the same items (and not necessarilty in the same order).

I think you want `max_items=5`. `limit` comes from the STAC API spec and controls the number of items per page.

…

On Tue, Nov 21, 2023 at 9:06 AM iliion ***@***.***> wrote: *pystac_client version: 0.7.5* I am performing the following simple request to get some items from a catalog and this ends up in an infinite loop (?). from pystac_client import Client import datetime def main(): catalog = Client.open(url='https://earth-search.aws.element84.com/v1/') my_search = catalog.search(collections='cop-dem-glo-30', limit = 5) print(my_search.url_with_parameters()) # prints out -> `https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30` <https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30> for item in my_search.items(): print(item) if __name__ == '__main__': main() In the above example I would just expect to the api to return 5 items per page. What I get instead are multiple requests of the following https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30 . In addtion if the results are less than the limit imposed, then the api will keep returning repeatedly the same items (and not necessarilty in the same order). — Reply to this email directly, view it on GitHub <#617> or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIRLDRJZTYWJTAX733DYFS7PTBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJTGQZTSOJQGUYTTAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEMBQGQ2DSMRSGM22O5DSNFTWOZLSUZRXEZLBORSQ> . You are receiving this email because you are subscribed to this thread. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> .

Tom is correct, if you only want to return five items, use max_items. A couple of other things:

In the above example I would just expect to the api to return 5 items per page.

It should, but to check this you need to:

for page in my_search.pages_as_dicts():
    print(len(page))

In this line:

print(my_search.url_with_parameters())

During paging, the search object is not updated with the paging parameters, so url_with_parameters will not change while paging. See

pystac-client/pystac_client/stac_api_io.py

Lines 282 to 312 in 4ea6dac

    
               def get_pages( 
        
                   self, 
        
                   url: str, 
        
                   method: Optional[str] = None, 
        
                   parameters: Optional[Dict[str, Any]] = None, 
        
               ) -> Iterator[Dict[str, Any]]: 
        
                   """Iterator that yields dictionaries for each page at a STAC paging 
        
                   endpoint, e.g., /collections, /search 
        
                   Return: 
        
                       Dict[str, Any] : JSON content from a single page 
        
                   """ 
        
                   page = self.read_json(url, method=method, parameters=parameters) 
        
                   if not (page.get("features") or page.get("collections")): 
        
                       return None 
        
                   yield page 
        
                   next_link = next( 
        
                       (link for link in page.get("links", []) if link["rel"] == "next"), None 
        
                   ) 
        
                   while next_link: 
        
                       link = Link.from_dict(next_link) 
        
                       page = self.read_json(link, parameters=parameters) 
        
                       if not (page.get("features") or page.get("collections")): 
        
                           return None 
        
                       yield page 
        
                       # get the next link and make the next request 
        
                       next_link = next( 
        
                           (link for link in page.get("links", []) if link["rel"] == "next"), None 
        
                       )

for the relevant code.

Ok I understand that the search request will return all pages and the limit will be the size of the each page and I get the number of items in each page from print(len(page['features']))

My problem is that the requests will go on infinitely when I ran the above example in my catalog. I understand that this is a bug on my part but I cant understand the reason. Maybe you have a clue why the requests from the client wont stop. Do i miss something in the api specification?

FYI: The api response follows the specs here (https://api.stacspec.org/v1.0.0/item-search/#tag/Item-Search)

I think I know what is wrong. stac_client does not support paging implemented with page=x parameter.

For the following request http://localhost:20008/search?limit=2&collections=test-collection
The rel=next link will have this href -> http://localhost:20008/search?limit=2&collections=test-collection&page=1

Unfortunately the above url is parsed and the output is the following

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}

Unfortunately the above url is parsed and the output is the following

I don't quite know what you mean by this. The read_text method doesn't make any assumptions about pagination -- it simply uses what the server returns:

pystac-client/pystac_client/stac_api_io.py

Lines 128 to 172 in 4ea6dac

    
               def read_text(self, source: pystac.link.HREF, *args: Any, **kwargs: Any) -> str: 
        
                   """Read text from the given URI. 
        
                   Overwrites the default method for reading text from a URL or file to allow 
        
                   :class:`urllib.request.Request` instances as input. This method also raises 
        
                   any :exc:`urllib.error.HTTPError` exceptions rather than catching 
        
                   them to allow us to handle different response status codes as needed. 
        
                   """ 
        
                   if isinstance(source, Link): 
        
                       link = source.to_dict() 
        
                       href = link["href"] 
        
                       # get headers and body from Link and add to request from simple STAC 
        
                       # resolver 
        
                       merge = bool(link.get("merge", False)) 
        
                       # If the link object includes a "method" property, use that. If not 
        
                       # fall back to 'GET'. 
        
                       method = link.get("method", "GET") 
        
                       # If the link object includes a "headers" property, use that and 
        
                       # respect the "merge" property. 
        
                       headers = link.get("headers", None) 
        
                       # If "POST" use the body object that and respect the "merge" property. 
        
                       link_body = link.get("body", {}) 
        
                       if method == "POST": 
        
                           parameters = ( 
        
                               {**(kwargs.get("parameters", {})), **link_body} 
        
                               if merge 
        
                               else link_body 
        
                           ) 
        
                       else: 
        
                           # parameters are already in the link href 
        
                           parameters = {} 
        
                       return self.request( 
        
                           href, method=method, headers=headers, parameters=parameters 
        
                       ) 
        
                   else:  # str or something that can be str'ed 
        
                       href = str(source) 
        
                       if _is_url(href): 
        
                           return self.request(href, *args, **kwargs) 
        
                       else: 
        
                           with open(href) as f: 
        
                               href_contents = f.read() 
        
                           return href_contents

To continue debugging, can you provide the following:

The first page returned by the server (the initial response)
The HTTP request sent by pystac-client to get the second page (e.g. by following the instructions here: https://stackoverflow.com/questions/10588644/how-can-i-see-the-entire-http-request-thats-being-sent-by-my-python-application)

My guess was read_json()

I will try to be more clear.

http://localhost:20008/search?limit=2&collections=test-collection

will output a response where the next link is like this:

{
  "rel":"next",
  "type":"application/json",
  "method":"GET",
  "href":"http://localhost:20008/search?limit=1&collections=test-collection&page=1"
}

If I run the following and print the response then I get something different

catalog = Client.open(url='http://localhost:20008')
my_search = catalog.search(collections='test-collection', limit = 1)

for page in my_search.pages_as_dicts():
        print(my_search.url_with_parameters())
        # -> http://localhost:20008/search?limit=1&collections=test-collection
        print(page['links'])

The page['links'] will output a response where the next link is this:

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}

The point is that the loop will not stop

DEBUG

. . .
REQUEST 0

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:30 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509

REQUEST 1

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '48', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"]}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 48\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"]}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
..  ..  .. (infinite loop).. .. ..

This is a problem with your server. pages_as_dicts does not modify the links attribute in any way:

pystac-client/pystac_client/item_search.py

Lines 725 to 749 in 4ea6dac

    
               def pages_as_dicts(self) -> Iterator[Dict[str, Any]]: 
        
                   """Iterator that yields :class:`dict` instances for each page 
        
                   of results from the search. 
        
                   Yields: 
        
                       Dict : a group of items matching the search 
        
                       criteria as a feature-collection-like dictionary. 
        
                   """ 
        
                   if isinstance(self._stac_io, StacApiIO): 
        
                       num_items = 0 
        
                       for page in self._stac_io.get_pages( 
        
                           self.url, self.method, self.get_parameters() 
        
                       ): 
        
                           call_modifier(self.modifier, page) 
        
                           features = page.get("features", []) 
        
                           if features: 
        
                               num_items += len(features) 
        
                               if self._max_items and num_items > self._max_items: 
        
                                   # Slice the features down to make sure we hit max_items 
        
                                   page["features"] = features[0 : -(num_items - self._max_items)] 
        
                               yield page 
        
                               if self._max_items and num_items >= self._max_items: 
        
                                   return 
        
                           else: 
        
                               return

Closing as not-an-issue-with-pystac-client, please re-open if you find otherwise.

	def get_pages(
	self,
	url: str,
	method: Optional[str] = None,
	parameters: Optional[Dict[str, Any]] = None,
	) -> Iterator[Dict[str, Any]]:
	"""Iterator that yields dictionaries for each page at a STAC paging
	endpoint, e.g., /collections, /search

	Return:
	Dict[str, Any] : JSON content from a single page
	"""
	page = self.read_json(url, method=method, parameters=parameters)
	if not (page.get("features") or page.get("collections")):
	return None
	yield page

	next_link = next(
	(link for link in page.get("links", []) if link["rel"] == "next"), None
	)
	while next_link:
	link = Link.from_dict(next_link)
	page = self.read_json(link, parameters=parameters)
	if not (page.get("features") or page.get("collections")):
	return None
	yield page

	# get the next link and make the next request
	next_link = next(
	(link for link in page.get("links", []) if link["rel"] == "next"), None
	)

	def read_text(self, source: pystac.link.HREF, args: Any, *kwargs: Any) -> str:
	"""Read text from the given URI.

	Overwrites the default method for reading text from a URL or file to allow
	:class:`urllib.request.Request` instances as input. This method also raises
	any :exc:`urllib.error.HTTPError` exceptions rather than catching
	them to allow us to handle different response status codes as needed.
	"""
	if isinstance(source, Link):
	link = source.to_dict()
	href = link["href"]
	# get headers and body from Link and add to request from simple STAC
	# resolver
	merge = bool(link.get("merge", False))

	# If the link object includes a "method" property, use that. If not
	# fall back to 'GET'.
	method = link.get("method", "GET")
	# If the link object includes a "headers" property, use that and
	# respect the "merge" property.
	headers = link.get("headers", None)

	# If "POST" use the body object that and respect the "merge" property.
	link_body = link.get("body", {})
	if method == "POST":
	parameters = (
	{(kwargs.get("parameters", {})), link_body}
	if merge
	else link_body
	)
	else:
	# parameters are already in the link href
	parameters = {}

	return self.request(
	href, method=method, headers=headers, parameters=parameters
	)
	else: # str or something that can be str'ed
	href = str(source)
	if _is_url(href):
	return self.request(href, args, *kwargs)
	else:
	with open(href) as f:
	href_contents = f.read()
	return href_contents

	def pages_as_dicts(self) -> Iterator[Dict[str, Any]]:
	"""Iterator that yields :class:`dict` instances for each page
	of results from the search.

	Yields:
	Dict : a group of items matching the search
	criteria as a feature-collection-like dictionary.
	"""
	if isinstance(self._stac_io, StacApiIO):
	num_items = 0
	for page in self._stac_io.get_pages(
	self.url, self.method, self.get_parameters()
	):
	call_modifier(self.modifier, page)
	features = page.get("features", [])
	if features:
	num_items += len(features)
	if self._max_items and num_items > self._max_items:
	# Slice the features down to make sure we hit max_items
	page["features"] = features[0 : -(num_items - self._max_items)]
	yield page
	if self._max_items and num_items >= self._max_items:
	return
	else:
	return