FastKnn

Purpose

Provide a lib to create a fast kNN index and get results as a pandas dataframe FastKnn use mainly nmslib as (fast) kNN backend

Install

pip install git+https://github.com/Fanchouille/fastknn.git

Use

FastKnn builds a kNN index with specified index_method (default: hnsw) and index_space (default: cosinesimil)

See here for different spaces
See here for different methods

This code has been tested with hnsw method and cosinesimil / l2 space for dense data and cosinesimil_sparse / cosinesimil_sparse_fast space

Example with dense data:

from fastknn import FastKnn

# Create index...
fastknn = FastKnn(data, id_dict)

# Save index
fastknn.save("test_fastknn")

# ...or load if exists
fastknn = FastKnn(fastknn_folder="test_fastknn")

# Choose sample vectors
query = data[:3, :]

# Query index & get results as df
results_df = fastknn.query_as_df(query, k=10, same_ids=True, remove_identity=True)

Where data is a m x n numpy array matrix and id_dict is a python dictionary with mappings from integer index (0 to m-1) to real ids
- fastknn.datautils provides method to get data and id_dict easily from pandas dataframes
To use FastKnn in supervised mode, provide a target parameter which is a python dictionary containing labels (classes or quantity target) related to data (default: None: unsupervised mode)
Other important parameters: data_type (default: dense) and dist_type (default: float) - see main.py for examples
Once instantiated, save method saves as files:
- mappings from integer index to real ids as a json file
- index parameters as a json file
- index as a bin file
- target dictionary as a json file
Get a saved FastKnn back by specifying fastknn_folder
Query a FastKnn object by using query_as_df provided method with the following parameters
- query - p x n numpy array - matrix to be matched to data
- k - integer - the number of nearest neighbours (default 10)
- query_index - list of integer - index of the data provided in query (default: None - takes row index as index)
- nn_column - string - name of resulting column containing the nearest neighbours (default: nearest_neighbours)
- distance_column - string - name of resulting column containing the distances to nearest neighbours (default: distances)
- same_ids - bool - when querying the same data that was indexed, gets index + real ids (default: False)
- remove_identity - bool - when querying the same data that was indexed, get k nearest neighbours without the perfect identity match (default: False)
Get prediction with a FastKnn object by using prediction_as_df provided method with the following parameters
- query - p x n numpy array - matrix to be matched to data
- k - integer - the number of nearest neighbours (default 10)
- query_index - list of integer - index of the data provided in query (default: None - takes row index as index)
- same_ids - bool - when querying the same data that was indexed, gets index + real ids (default: False)
- remove_identity - bool - when querying the same data that was indexed, get k nearest neighbours without the perfect identity match (default: False)
- prediction_type - string - classification (majority voting on the k nearest neighbours) or regression (mean on the k nearest neighbours)(default: classification)

Development

Clone project

Install Anaconda local environment as below:

./install.sh

Activate Anaconda local environment as below:

conda activate ${PWD}/.conda