Enhancement request, implement fractional distance
Opened this issue · 2 comments
Hi,
I bumped into this research work https://bib.dbvis.de/uploadedFiles/155.pdf, in which the authors propose a fractional distance to deal better with the notion of near/far neighbors in high dimensional data sets. According to their results, the fractional distance can provide better outcomes compared to Euclidean (or Manhattan) distance.
So, I would like to submit this enhancement request: include fractional distance in PaCMAP package.
Kind regards,
Ivan
I think that using minkowski distance can do the job. Perhaps, a way to code this distance:
#--->
@numba.njit(fastmath= True)
def fractional_dist(p_vec, q_vec, fraction= 5.0):
result= 0.0
for isamp in range(0, p_vec.shape[0]):
if (p_vec[isamp] > q_vec[isamp]):
result += (p_vec[isamp] - q_vec[isamp]) ** fraction
else:
result += (q_vec[isamp] - p_vec[isamp]) ** fraction
dist= result ** (1 / fraction)
return dist
#--->
I found a best solution to this enhancement request, which I think it can also help PaCMAP. The solution is this package https://github.com/lmcinnes/pynndescent
According to ann-benchmarks, it is faster than annoy package. In addition to this, it supports 27 distance metrics and even, it lets you to code your own custom distance function.
To use PyNNDescent package is simple as shown below:
from pynndescent import NNDescent
tree= NNDescent(X, metric= 'minkowski', metric_kwds= {'p': 5}, n_neighbors= n_neighbors, random_state= 1969, n_jobs= cpu_count)
tree.prepare()
nbrs= np.zeros((X.shape[0], n_neighbors), dtype= np.int32)
nbrs_= tree.query(X, k= n_neighbors + 1)
nbrs= nbrs_[0][:,1:].copy()
scaled_dist= np.ones((X.shape[0], n_neighbors), dtype= np.float32)
pair_neighbors= pacmap.pacmap.sample_neighbors_pair(X, scaled_dist, nbrs, np.int32(n_neighbors))
embedding= pacmap.PaCMAP(n_dims= 2, n_neighbors= n_neighbors, MN_ratio= 0.05, FP_ratio= 10.0, lr= 1.0,
pair_neighbors= pair_neighbors, num_iters= 200, apply_pca= False)
I examined the pacmap.py file, and I think that code lines 226-238 can be easily replaced by:
tree = NNDescent(X, metric= distance, metric_kwds= {'p': 5}, n_neighbors= n_neighbors, random_state= 1969, n_jobs= cpu_count)
tree.prepare()
nbrs= np.zeros((X.shape[0], n_neighbors_extra), dtype= np.int32)
knn_distances= np.zeros((X.shape[0], n_neighbors_extra), dtype= np.float32)
nbrs_= tree.query(train_attr_cube, k= n_neighbors_extra + 1)
nbrs= nbrs_[0][:,1:].copy()
knn_distances= nbrs_p[1][:,1:].copy()
Ivan