MAAP-Project/maap-py

Granule search performance is significantly slower relative to direct HTTP requests

Closed this issue · 2 comments

CMR Granule Search Performance Comparison

The performance of the maap_py library for CMR granule searches of moderate size appears to be rather poor.

Here we compare the performance of using maap_py vs. direct HTTP requests for CMR granule searches of varying sizes. We compare the absolute speeds of a number of requests sizes, as well as the growth rate of the times taken for the requests.

!pip install profilehooks
import urllib.parse
from typing import Any, Callable, Mapping, Sequence, TypeVar
from typing_extensions import TypedDict

import requests
from maap.maap import Collection, Granule, MAAP
from profilehooks import timecall

T = TypeVar('T')

class UMMSearchResponse(TypedDict):
    meta: Mapping[str, Any]
    items: Sequence[Mapping[str, Any]]


def for_each(f: Callable[[T], None], xs: Sequence[T]) -> None:
    for x in xs:
        f(x)

Define Searching Functions

So that we can capture timings using a simple function-timing decorator, we define functions for comparison:

@timecall
def maap_find_granules(cmr_host: str, **kwargs) -> Sequence[Granule]:
    return maap.searchGranule(cmr_host=cmr_host, **kwargs)

def maap_find_granules_by_doi(
    cmr_host: str, *, doi: str
) -> Callable[..., Sequence[Granule]]:
    collection = maap.searchCollection(cmr_host=cmr_host, doi=doi, limit=1)[0]
    collection_concept_id = collection['concept-id']

    def find_granules(**kwargs) -> Sequence[Granule]:
        return maap_find_granules(
            cmr_host,
            collection_concept_id=collection_concept_id,
            **kwargs
        )
    
    return find_granules
@timecall
def http_find_granules(cmr_host: str, params: Mapping[str, Any], **kwargs) -> UMMSearchResponse:
    url = urllib.parse.urljoin(f'https://{cmr_host}/search', 'granules.umm_json')
    return requests.get(url, params=params, **kwargs).json()


def http_find_granules_by_doi(cmr_host: str, *, doi: str) -> Callable[..., UMMSearchResponse]:
    url = f'https://{cmr_host}/search/collections.umm_json'
    r = requests.get(url, params={'doi': doi, 'page_size': 1})
    collection_concept_id = r.json()['items'][0]['meta']['concept-id']

    def find_granules(params: Mapping[str, Any], **kwargs) -> UMMSearchResponse:
        return http_find_granules(
            cmr_host,
            params={**params, 'collection_concept_id': collection_concept_id},
            **kwargs
        )
    
    return find_granules

Find GEDI L4A Granules

We'll use the GEDI L4A collection for our granule searches.

nasa_cmr_host = 'cmr.earthdata.nasa.gov'
maap_cmr_host = 'cmr.maap-project.org'
gedi_l4a_doi = '10.3334/ORNLDAAC/1986'

maap = MAAP('api.ops.maap-project.org')

# Search size limits
sizes = [62, 125, 250, 500, 1000, 2000]

MAAP Using MAAP OPS CMR Host

maap_find_gedi_l4a_granules = maap_find_granules_by_doi(maap_cmr_host, doi=gedi_l4a_doi)

for_each(lambda limit: maap_find_gedi_l4a_granules(limit=limit), sizes)
  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    0.471 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    0.779 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    1.220 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    2.787 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    5.514 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    12.097 seconds

MAAP Using NASA OPS CMR Host

maap_find_gedi_l4a_granules = maap_find_granules_by_doi(nasa_cmr_host, doi=gedi_l4a_doi)

for_each(lambda limit: maap_find_gedi_l4a_granules(limit=limit), sizes)
  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    0.832 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    1.374 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    2.926 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    5.043 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    10.924 seconds

  maap_find_granules (/tmp/ipykernel_25672/436560453.py:1):
    22.387 seconds

Direct HTTP Using MAAP OPS CMR Host

http_find_gedi_l4a_granules = http_find_granules_by_doi(maap_cmr_host, doi=gedi_l4a_doi)

for_each(lambda size: http_find_gedi_l4a_granules(params={'page_size': size}), sizes)
  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    0.214 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    0.383 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    0.752 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    1.083 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    2.894 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    4.270 seconds

Direct HTTP Using NASA OPS CMR Host

http_find_gedi_l4a_granules = http_find_granules_by_doi(nasa_cmr_host, doi=gedi_l4a_doi)

for_each(lambda size: http_find_gedi_l4a_granules(params={'page_size': size}), sizes)
  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    0.405 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    0.914 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    1.533 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    1.632 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    2.718 seconds

  http_find_granules (/tmp/ipykernel_25672/436560453.py:27):
    4.577 seconds

Summary

  • Using maap-py with the MAAP CMR host is nearly twice as fast as when using the NASA CMR host, which is perhaps surprising, given that the NASA CMR system is much larger
  • The performance of maap-py is roughly linear (perhaps slightly worse) relative to the size of the request
  • The performance of maap-py is far worse than using direct HTTP requests
  • With direct HTTP requests, the difference in performance between the MAAP CMR host and the NASA CMR host appears negligible (unlike the difference in performance between the 2 hosts when using maap-py)

NOTE: One obvious difference between the maap-py requests and the direct HTTP requests is that maap-py uses the ECHO-10 XML format and performs XML parsing, whereas the direct HTTP requests use JSON (UMM), and JSON parsing is likely much more performant than XML parsing, which might account for a significant portion of the performance difference.

I discovered why using maap-py for finding granules is (by default) slower than making direct HTTP requests, at least when using maap-py within the ADE: the default page size is only 20. No matter how large limit is in the examples above, maap-py uses a default page size of 20 (as configured in the maap.cfg file in the ADE).

However, in order to override this page size, the setting in maap.cfg must be modified, which would affect all users within the ADE. Alternatively, the default maap.cfg file could be copied to the current directory and updated there, but this is problematic due to #29.

Further, since the page size is configured in maap.cfg, the page size cannot be specified on a per-request basis. All requests, regardless of the limit specified for a request, use the same page size.

Closing this issue, as I've created other more specific issues to address the problem.