mmp2/megaman

Geometry

Closed this issue · 9 comments

The handling of the Geometry object within the embeddings needs some work & some consistency. I've done a bit of tweaking in 3fcc3d8, but there are still problems. In particular:

  • we need a way to instantiate the estimator without passing the data.
  • we should be able to pass a dictionary of geometry properties rather than a geometry object
  • each embedding algorithm seems to treat missing geometries differently. Some call fit_geometry, some use a manual default, and some fail (I've fixed some of this in 3fcc3d8)

@jakevdp how should the init() geometry function handle the **kwargs for the adjacency, affinity & laplacian methods? e.g

def init(self,
adjacency_method='auto', adjacency_ags = None,
affinity_method='auto', adjacency_ags = None,
laplacian_method='auto', adjacency_ags = None):

or

def init(self,
adjacency_method='auto', affinity_method='auto', laplacian_method='auto',
**kwargs):

Edit: we shouldn't pass a nbhd radius or affinity radius directly either since they will depend on the adjacency and affinity (for example if it's number of neighbors instead of radius).

In a similar vein with respect to the following situation:

  • Initialize geometry object, pass data matrix
  • call get_adjacency_matrix(distance_method, **kwargs)
  • do something else
  • call get_adjacency_matrix(distance_method, **kwargs) again

Since kwargs can be of an arbitrary length I don't think it makes sense to store these and then check and see if they are the same so that if someone calls get_adjacency_matrix() twice then it calculates twice -- even if they have the same arguments. We should therefore encourage users to use the attribute geom.adjacency_matrix to get the existing matrix and only call get_adjacency_matrix() only when they want to re-calculate it.

Same with affinity & Laplacian. This simplifies things I think.

My thought would be:

def init(self, adjacency_method='auto', adjacency_kwds=None, 
         affinity_method='auto', adjacency_kwds=None, 
         laplacian_method='auto', adjacency_kwds=None):

Regarding the computation of the matrix... how about we change get_adjacency_matrix to compute_adjacency_matrix, so that it's more explicit? Then we could use an adjacency_matrix attribute (or perhaps a property) to access the already-computed value.

I agree. One more quick questions. If I outline it like this:

def __init__(self, adjacency_method = 'auto', adjacency_kwds=None,  
                    affinity_method = 'auto', affinity_kwds=None,  
                    laplacian_method = 'auto', laplacian_kwds=None):   
        [...]    
    def compute_adjacency_matrix(self, copy = False, **kwargs):    
        [...]     
        self.adjacency_matrix = compute_adjacency_matrix(self.X, self.adjacency_method,
                                                         self.adjacency_kwds, **kwargs)    

will the self.compute_adjacency_matrix work as intended passing both the adjacency_kwds passed at initialization AND the **kwargs passed to self.compute_adjacency_matrix() itself?

No, I think you have a danger of getting a repeated argument there. I'd do something like this:

kwds = self.adjacency_kwds.copy()
kwds.update(kwargs)

then pass **kwds to the function.

(in Python 3.5 you could do kwds = {**self.adjacency_kwds, **kwargs} but we need backward compatibility for the time being 😄)

I figured as much. Thanks again Jake! Things are going well and look much cleaner. I should be able to push the pull request tonight.

Awesome! I'm working on the other parts of the refactor, and I think I'll have it done tonight as well. Tomorrow we can work on merging our two sets of changes.

@jakevdp just an FYI the following functions inside test_geometry:

test_laplacian_unknown_normalization
test_laplacian_create_A_sparse
test_equal_original

are testing affinity/laplacian methods.

re-factor done.