YingfanWang/PaCMAP

Enhancement request, implement fractional distance

Opened this issue · 2 comments

Hi,

I bumped into this research work https://bib.dbvis.de/uploadedFiles/155.pdf, in which the authors propose a fractional distance to deal better with the notion of near/far neighbors in high dimensional data sets. According to their results, the fractional distance can provide better outcomes compared to Euclidean (or Manhattan) distance.

So, I would like to submit this enhancement request: include fractional distance in PaCMAP package.

Kind regards,

Ivan

I think that using minkowski distance can do the job. Perhaps, a way to code this distance:
#--->
@numba.njit(fastmath= True)
def fractional_dist(p_vec, q_vec, fraction= 5.0):
result= 0.0

    for isamp in range(0, p_vec.shape[0]):
        if (p_vec[isamp] > q_vec[isamp]):
            result += (p_vec[isamp] - q_vec[isamp]) ** fraction
            
        else:
            result += (q_vec[isamp] - p_vec[isamp]) ** fraction
    
    dist= result ** (1 / fraction)
    
    return dist

#--->

I found a best solution to this enhancement request, which I think it can also help PaCMAP. The solution is this package https://github.com/lmcinnes/pynndescent

According to ann-benchmarks, it is faster than annoy package. In addition to this, it supports 27 distance metrics and even, it lets you to code your own custom distance function.

To use PyNNDescent package is simple as shown below:

from pynndescent import NNDescent

tree= NNDescent(X, metric= 'minkowski', metric_kwds= {'p': 5}, n_neighbors= n_neighbors, random_state= 1969, n_jobs= cpu_count)
tree.prepare()

nbrs= np.zeros((X.shape[0], n_neighbors), dtype= np.int32)
nbrs_= tree.query(X, k= n_neighbors + 1)
nbrs= nbrs_[0][:,1:].copy()

scaled_dist= np.ones((X.shape[0], n_neighbors), dtype= np.float32)
pair_neighbors= pacmap.pacmap.sample_neighbors_pair(X, scaled_dist, nbrs, np.int32(n_neighbors))

embedding= pacmap.PaCMAP(n_dims= 2, n_neighbors= n_neighbors, MN_ratio= 0.05, FP_ratio= 10.0, lr= 1.0,
pair_neighbors= pair_neighbors, num_iters= 200, apply_pca= False)

I examined the pacmap.py file, and I think that code lines 226-238 can be easily replaced by:

tree = NNDescent(X, metric= distance, metric_kwds= {'p': 5}, n_neighbors= n_neighbors, random_state= 1969, n_jobs= cpu_count)
tree.prepare()

nbrs= np.zeros((X.shape[0], n_neighbors_extra), dtype= np.int32)
knn_distances= np.zeros((X.shape[0], n_neighbors_extra), dtype= np.float32)

nbrs_= tree.query(train_attr_cube, k= n_neighbors_extra + 1)

nbrs= nbrs_[0][:,1:].copy()
knn_distances= nbrs_p[1][:,1:].copy()

Ivan