asdf-format/asdf

Investigate returning `ndarray` when `lazy_load=False`

Opened this issue · 8 comments

The main purpose of NDArrayType to to allow "lazy loading" of array data. When this feature is enabled, the converter "realizes" the underlying array but still returns a NDArrayType:

if not ctx._blocks._lazy_load:
instance._make_array()
return instance

Investigate if returning the ndarray (instead of the NDArrayType) causes any downstream issues.

+1, I have a use case where this is important for validation on read:

arr = np.arange(10)
af = asdf.AsdfFile({"array": np.arange(10)})
af.write_to("array.asdf")

with asdf.open("array.asdf") as af:
    assert isinstance(af['array'], type(arr)), f"{type(af['array'])} is not {type(arr)}"

# AssertionError: <class 'asdf.tags.core.ndarray.NDArrayType'> is not <class 'numpy.ndarray'>

Although I would not implement validation like this, this is typically the assumption of serializers (e.g., Pydantic).

Would a type union work?

Yes it would, but it exposes NDArrayType specifics to the model unnecessarily. When I write ASDF model I really only want to think about !ndarray-1.0.0 is the same thing as np.ndarray.

If there are other lazy types, it won't scale well if I have to union every lazy-able tags.

Thanks for bringing up the point about using a separate lazy type having scalability issues.

I am currently hopeful that the lazy_tree feature will allow us to remove NDArrayType (by only converting the ndarray-x.x tagged dict to a np.ndarray when it's accessed from the tree). However this might be a worse option for pydantic since any runtime "validation" will likely require accessing the tree (which will trigger the array to be loaded, losing any benefit of the "lazy tree").

Thanks for the suggestion, I agree lazy tree would work but the lazy benefits are lost with pydantic.

Much appreciate the informative response at #1789 (comment). My takeaway is Pydantic validates on instantiation, instantiation occurs on read, ndarray tagged objects gets converted to NDArrayType, and NDArrayType is not a subclass of ndarray.

If we can change the last part—NDArrayType is a subclass of ndarray— then it wouldn't matter if I had lazy loading on/off.


If NDArrayType is a subclass of ndarray, then these two scenarios shows where it is appropriate for users to think about laziness.

If written with ndarray, the class read in must be a subclass of ndarray

af = asdf.AsdfFile({"array": np.arange(10)})
af.write_to("array.asdf")

with asdf.open("array.asdf") as af:
    assert isinstance(af['array'], np.ndarray)

Users never think about laziness, could use NDArrayType as if it's ndarray.

May copy an ndarray from one tree to another, but lazy objects must be materialized before writing

copy_af = asdf.AsdfFile()

with asdf.open("array.asdf") as af:
    array = af.tree['array']   # BAD is NDArrayType causes OSError: ASDF file has already been closed
    assert isinstance(af['array'], np.ndarray)

    array = af.tree['array'].copy()  # OK is ndarray
    assert isinstance(af['array'], np.ndarray)

    copy_af.tree['array'] = array

copy_af.write_to('copy_array.asdf')

Current behavior raises OSError on the write_to() method. This is problematic since the user needs to remember which lines had lazy objects. A warning should be made during copy_af.tree['array'] = array. Otherwise, users are force to use write_to() inside the original file's context.

Thanks for the suggestion.

I don't think making NDArrayType a subclass is required for returning a ndarray when lazy_load=False.

Would you give this branch a test to see if it works with your pydantic code (without any type union for NDArrayType):
https://github.com/braingram/asdf/tree/non_lazy_ndarray

It is far from "complete" as a solution and constitutes a breaking change (so wouldn't be possible until asdf 4, coming this fall at the soonest) but I would be curious to hear if it makes the pydantic integration easier.

Sorry! I did not realize I didn't update this. This change does pass this test

This test in v3 would've failed at asdf.open due to Pydantic seeing NDArrayType != ndarray.

After spending quite a bit of time in this, I'm convinced some sort of union types is needed rather than forcing users to lazy load to guarantee an ndarray is returned.

Would it be appropriate for this asdf project to maintain a list of types (e.g., SomeNameType = NDArrayType | np.ndarray)?