Granule search performance is significantly slower relative to direct HTTP requests
Closed this issue · 2 comments
CMR Granule Search Performance Comparison
The performance of the maap_py
library for CMR granule searches of moderate size appears to be rather poor.
Here we compare the performance of using maap_py
vs. direct HTTP requests for CMR granule searches of varying sizes. We compare the absolute speeds of a number of requests sizes, as well as the growth rate of the times taken for the requests.
!pip install profilehooks
import urllib.parse
from typing import Any, Callable, Mapping, Sequence, TypeVar
from typing_extensions import TypedDict
import requests
from maap.maap import Collection, Granule, MAAP
from profilehooks import timecall
T = TypeVar('T')
class UMMSearchResponse(TypedDict):
meta: Mapping[str, Any]
items: Sequence[Mapping[str, Any]]
def for_each(f: Callable[[T], None], xs: Sequence[T]) -> None:
for x in xs:
f(x)
Define Searching Functions
So that we can capture timings using a simple function-timing decorator, we define functions for comparison:
@timecall
def maap_find_granules(cmr_host: str, **kwargs) -> Sequence[Granule]:
return maap.searchGranule(cmr_host=cmr_host, **kwargs)
def maap_find_granules_by_doi(
cmr_host: str, *, doi: str
) -> Callable[..., Sequence[Granule]]:
collection = maap.searchCollection(cmr_host=cmr_host, doi=doi, limit=1)[0]
collection_concept_id = collection['concept-id']
def find_granules(**kwargs) -> Sequence[Granule]:
return maap_find_granules(
cmr_host,
collection_concept_id=collection_concept_id,
**kwargs
)
return find_granules
@timecall
def http_find_granules(cmr_host: str, params: Mapping[str, Any], **kwargs) -> UMMSearchResponse:
url = urllib.parse.urljoin(f'https://{cmr_host}/search', 'granules.umm_json')
return requests.get(url, params=params, **kwargs).json()
def http_find_granules_by_doi(cmr_host: str, *, doi: str) -> Callable[..., UMMSearchResponse]:
url = f'https://{cmr_host}/search/collections.umm_json'
r = requests.get(url, params={'doi': doi, 'page_size': 1})
collection_concept_id = r.json()['items'][0]['meta']['concept-id']
def find_granules(params: Mapping[str, Any], **kwargs) -> UMMSearchResponse:
return http_find_granules(
cmr_host,
params={**params, 'collection_concept_id': collection_concept_id},
**kwargs
)
return find_granules
Find GEDI L4A Granules
We'll use the GEDI L4A collection for our granule searches.
nasa_cmr_host = 'cmr.earthdata.nasa.gov'
maap_cmr_host = 'cmr.maap-project.org'
gedi_l4a_doi = '10.3334/ORNLDAAC/1986'
maap = MAAP('api.ops.maap-project.org')
# Search size limits
sizes = [62, 125, 250, 500, 1000, 2000]
MAAP Using MAAP OPS CMR Host
maap_find_gedi_l4a_granules = maap_find_granules_by_doi(maap_cmr_host, doi=gedi_l4a_doi)
for_each(lambda limit: maap_find_gedi_l4a_granules(limit=limit), sizes)
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
0.471 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
0.779 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
1.220 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
2.787 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
5.514 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
12.097 seconds
MAAP Using NASA OPS CMR Host
maap_find_gedi_l4a_granules = maap_find_granules_by_doi(nasa_cmr_host, doi=gedi_l4a_doi)
for_each(lambda limit: maap_find_gedi_l4a_granules(limit=limit), sizes)
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
0.832 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
1.374 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
2.926 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
5.043 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
10.924 seconds
maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
22.387 seconds
Direct HTTP Using MAAP OPS CMR Host
http_find_gedi_l4a_granules = http_find_granules_by_doi(maap_cmr_host, doi=gedi_l4a_doi)
for_each(lambda size: http_find_gedi_l4a_granules(params={'page_size': size}), sizes)
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
0.214 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
0.383 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
0.752 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
1.083 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
2.894 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
4.270 seconds
Direct HTTP Using NASA OPS CMR Host
http_find_gedi_l4a_granules = http_find_granules_by_doi(nasa_cmr_host, doi=gedi_l4a_doi)
for_each(lambda size: http_find_gedi_l4a_granules(params={'page_size': size}), sizes)
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
0.405 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
0.914 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
1.533 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
1.632 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
2.718 seconds
http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
4.577 seconds
Summary
- Using
maap-py
with the MAAP CMR host is nearly twice as fast as when using the NASA CMR host, which is perhaps surprising, given that the NASA CMR system is much larger - The performance of
maap-py
is roughly linear (perhaps slightly worse) relative to the size of the request - The performance of
maap-py
is far worse than using direct HTTP requests - With direct HTTP requests, the difference in performance between the MAAP CMR host and the NASA CMR host appears negligible (unlike the difference in performance between the 2 hosts when using
maap-py
)
NOTE: One obvious difference between the maap-py
requests and the direct HTTP requests is that maap-py
uses the ECHO-10 XML format and performs XML parsing, whereas the direct HTTP requests use JSON (UMM), and JSON parsing is likely much more performant than XML parsing, which might account for a significant portion of the performance difference.
I discovered why using maap-py for finding granules is (by default) slower than making direct HTTP requests, at least when using maap-py within the ADE: the default page size is only 20. No matter how large limit
is in the examples above, maap-py uses a default page size of 20 (as configured in the maap.cfg
file in the ADE).
However, in order to override this page size, the setting in maap.cfg
must be modified, which would affect all users within the ADE. Alternatively, the default maap.cfg
file could be copied to the current directory and updated there, but this is problematic due to #29.
Further, since the page size is configured in maap.cfg
, the page size cannot be specified on a per-request basis. All requests, regardless of the limit
specified for a request, use the same page size.
Closing this issue, as I've created other more specific issues to address the problem.