nsidc/earthaccess

Incorrect granule sizes

betolink opened this issue · 5 comments

earthaccess is reporting incorrect granule sizes because we are not correctly parsing the UMM path that should contain the size. Usually granules map to a single file and if the granule metadata contains the size in MB all works as expected, if the granule contains multiple files and/or the units are not MB, the reported size will be incorrect.

This issue was reported by David Giles

Example:

{
   "DataGranule":{
      "ArchiveAndDistributionInformation":[
         {
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf"
         },
         {
            "Checksum":{
               "Algorithm":"MD5",
               "Value":"45bf3cf50f837a0db6350c3c6bcd3356"
            },
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf.met",
            "Size":"8.2265625",
            "SizeUnit":"KB"
         },
         {
            "Checksum":{
               "Algorithm":"MD5",
               "Value":"9d2bbf8e8fa88c2b105da6b7a9940093"
            },
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf",
            "Size":"47.05515384674072",
            "SizeUnit":"MB"
         }
      ]
   }
}

This is granule is a great example, contains multiple files in different units. The correct size should be ~47MB + 8KB

The method tat needs to be updated is

def size(self) -> float:
and we should also handle cases where the information is not there.

Unfortunately, this cannot be computed unambiguously because the supplied size values are not necessarily consistently computed. For example, some providers might compute an MB value as bytes / 1000 / 1000, whereas others might compute it as bytes / 1024 / 1024, so we have no way of knowing whether to multiply by 1000^2 or 1024^2 to get the number of bytes.

Unfortunately, this was a poor design decision in the UMM, and it should have simply been designed such that the size reported in the metadata is always bytes (which also avoids rounding errors, even if we know for sure what to multiply by). This is why the UMM was later modified to include a SizeInBytes metadata value.

UMM-G v1.6 added SizeInBytes. See the description in the schema, which describes exactly this problem.

With that said, I don't currently have a suggestion for a sensible solution to this.

Further, even if the above were not the case (i.e., even without any ambiguity), I'm not sure that computing the "size" as the sum of individual sizes makes sense. I suppose it might make sense if you want to know the total volume that would be downloaded if all files in the granule were downloaded, but I'm not sure that's a common use case.

Even if that is a common use case, I would think we should also include a mechanism for users to obtain individual file sizes as well (again, ignoring the ambiguity mentioned above).

One path to explore for the size ambiguity might be to provide some sort of size_hint method. When a granule does not specify a SizeInBytes value (which is unambiguous), then size_hint could potentially assume powers of 2 multipliers (e.g., KB=1024, MB=1024^2, etc., which would possibly overestimate sizes, which is perhaps better than underestimating by using powers of 1000). This would be similar to Python's own __length_hint__ vs __len__ (see https://docs.python.org/3/reference/datamodel.html#object.__length_hint__).

In addition, we might want to provide some sort of size_in_units method that does no computation (i.e., makes no assumptions), simply returning perhaps a tuple of (float, str), where the first value is the float of the size attribute, and the second value is the sizeunit string, so the user can then choose how they wish to deal with it. For example (taking from your sample metadata above): (47.05515384674072, "MB")

What about having size_hint() as a fallback, but in the above example we could take into account the reported units and values to sum the files in the granule. If there is only a 10203445 and no reported unit then yes we can just pass it as is.

What about having size_hint() as a fallback, but in the above example we could take into account the reported units and values to sum the files in the granule. If there is only a 10203445 and no reported unit then yes we can just pass it as is.

I think what we must first do is clearly define the use cases and requirements around the use of any type of size "computation" we want to support. Without gaining some clarity around what we want/need, there's little sense in discussing how to implement anything.

What specifically do we want to support/provide through a size method/function and any potentially related methods/functions, such as perhapssize_in_units?

How does Earthdata Search currently handle granule size estimation? I know that they provide an estimated size upon ordering (see screenshot). Maybe we could leverage their work? https://github.com/nasa/earthdata-search
Screenshot 2024-08-26 at 12 45 31 PM

Great suggestion @asteiker! this is what they say:

This is the estimated overall size of your project. If no size
information exists in a granule's metadata, it will not be
included in this number. The size is estimated based upon the
first 20 granules added to your project from each collection.

And they seem to convert units into a common unit https://github.com/nasa/earthdata-search/blob/619d533e53906550ed6428162c25b4878d858768/static/src/js/util/project.js#L8 (there is more code)

So I think we should follow a similar logic to have consistency on what users see in the NASA portal. Maybe we can be even more accurate with the size when we have that data available. And this relates to a conversation we had @chuckwondo about having some lazy loading of the results, I don't remember very well if we talked/covered using a "resultset" class where we could paginate the results from CMR etc. As for now I think we should implement the following:

If a granule has complete metadata on size and units, we should sum them up and report them to the user in granule.size() if a granule has incomplete metadata we should perhaps only pass the data as is (tuples like you mentioned) or we could implement a size_hint(). What do you all think? cc @jhkennedy