GUDHI/gudhi-devel

[Representations Module] New `__call__` strategy does not retrieve attributes

astamm opened this issue · 6 comments

astamm commented

If I understand correctly, the new strategy of the __call__ method in the representations module is to clone the class and call fit_transform() on the cloned class. This effectively does not modify the attributes of the main class.

However, it also means that we do not have access to these attributes, e.g. for Betti curves, we get the y-coordinate of the curve but there is no way to retrieve on which grid these y-coordinates were computed. Unless we run the fit() method on the main class and then access grid_ but this was precisely what we wanted to avoid in the first place.

astamm commented

Small reprex:

from gudhi.representations import BettiCurve
import numpy as np

dgm = np.array([[1, 4], [2, 5]])

bc = BettiCurve()
bc(dgm)
bc.grid_

fit and transform are the main interface, with __call__ only a shortcut for the cases where it suffices, so I don't find it too problematic if some use cases cannot be satisfied without fit.
I guess the opposite strategy would be to store a flag of whether fit has been explicitly called, and if it hasn't then __call__ overwrites grid_ (so we don't work on a clone anymore). That seems doable, but potentially confusing.
Do you have suggestions on a good behavior?
(adding @MathieuCarriere to the conversation)

astamm commented

Well in my opinion, when I call __call__() on a single diagram from a class in the {representations} module, I want it to return all the information I need about the selected representation. Hence it means the y-coordinate (values representing the diagram in that representation) and the grid on which those values were computed. Currently, we are lacking this latter information. I guess it does not matter for feeding machine learning pipelines because the current behaviour produces the y-coordinate which is then used as is in feature space. But we might want to do other statistical analyses on the functional representations such as domain selection where differences lie between groups, in which case it is of primary importance to also have the domain on which the representation is defined.

More generally, I guess in my head, it depends on what a class is expected to do. In the case of the {representations} module, they are supposed to provide representations of diagrams. A functional representation is made of two things: x- and y-coordinates. So I think not only should __call__() return both but also transform() and fit_transform(). In the latter two cases, this is achieved by updating the attribute grid_ or some other attribute (im_range_fixed_, etc.) which can then be accessed to actually generate the grid. But different representations update different attributes and forming the grid for each has to be done in different ways.

I would be in favour of each class return systematically x- and y-coordinates (for vector methods) as part of the output object. Maybe as an np.array() of dimension 2 x numPoints.

astamm commented

Hence __call__() would output an np.array() of size 2 x numPoints while transform() and fit_transform() would output an np.array() of size numDiagrams x 2 x numPoints for example.

The point of the classes in vector_methods.py is to produce vectors, i.e. objects that can be added meaningfully, because many stat/ML algorithms need to work in a vector space. For example, the vector space could be a space of functions from some X to ℝ (X can be ℝ⨯ℕ for landscapes, ℝ² for persistence images, etc), and to be able to represent it, we would discretize X and keep only functions from a grid/sample of X to ℝ. Computing [f(x₀)+g(x₀), f(x₁)+g(x₁), ...] is meaningful, it is the discretization of f+g on the same grid. However, computing xₙ+xₙ does not make much sense. Since the main point of having a function called transform is to branch it into scikit-learn, this would force most users to add a layer that removes those x coordinates.
For transform, the grid is logically part of the input, not the output, it would be strange to copy the input in every output. The grid may be part of the output of fit, or it may be a parameter of the class.
Having the x-coordinates (for plotting?) is a side functionality, provided because the information is there so it would be sad not to make it available. Some functionality like having the exact Betti curve was added because it shared a lot of code, but since it isn't the main objective, it is not surprising if the interface is slightly awkward, as long as it is clear enough.
If we wanted BettiCurve.transform to output a list of (n,2)-shaped arrays (n depends on the diagram, so we cannot output 1 3d array or it would need padding), it would have to be a non-default option, say BettiCurve(output_type="coordinates"). Or it could be a function with a name other than transform (and probably other than __call__ to avoid gratuitously breaking things).

astamm commented

That is indeed what I thought. Everything is done in the spirit of machine learning. The aim is to transform a complex object (diagram here) into a vector of features to be fed to some ML pipeline.

I agree with you that storing x- and y-coordinates in the same numpy array would not make much sense since adding two functions (hence y-coordinates) makes sense but adding grid points not much. However, adding two vectors that would represent functions evaluated on different grids can be misleading if the grids are not the same. Hence, if the desired representation of a persistence diagram is to be a functional one, it also makes sense to keep track in some way of the grid. This is currently the case for fit() and fit_transform() which update some attributes that can be used to infer the grid but not with __call__().

I guess your proposal makes sense. Have these methods produce a vector of features by default, so that they are readily pluggable in scikit-learn pipelines. But add the option to output full functional representation as pandas DataFrame or other suitable data structure.