Code review for `FilteredDatapackage`
cmutel opened this issue · 2 comments
I would be very happy to have some help making sure my code is doing what I think it is. In 2.5, we use data packages to store processed arrays and metadata defining generic interfaces to external data sources which return processed arrays. These data packages can (n theory) contain data for multiple matrices, or multiple data resources for the same matrix.
To dispatch data to the correct matrix builders, we use an object called FilteredDatapackage
. These objects are created by one and only one method, filter_by_attribute
. In order for the code flow to work correctly and not use too much memory, we need FilteredDatapackage
to avoid copies wherever possible.
This is where I need help. I think that filter_by_attribute
creates a "shallow" copy, e.g. while .resources
is a new object (a list), the objects in that list are the same as in the parent Datapackage
. But I am not 100% sure, and the question on e.g. whether Numpy create a view or a copy are not always clear for me. I also don't know how to write tests (of course, one could iterate and check the id()
of objects, but is there something else? Maybe to also check memory usage?) for ensuring my assumptions are correct.
In the code review, you may notice that get_resource
uses a cache, and that this cache would not be shared across instances of FilteredDatapackage
. This is OK, as each data resource (the actual underlying numpy array, which can in theory be very large) would only ever be loaded once, by the matrix constructor for that particular matrix.
Can't help much here but I can point to this discussion, on how to tell whether a Numpy array is a view of another one, or whether it owns its own data: https://stackoverflow.com/questions/11524664/how-can-i-tell-if-numpy-creates-a-view-or-a-copy
Thanks @romainsacchi, Still need to fix #9 but the basics are now there.