/fastknn

Primary LanguagePython

FastKnn

Purpose

Provide a lib to create a fast kNN index and get results as a pandas dataframe FastKnn use mainly nmslib as (fast) kNN backend

Install

pip install git+https://github.com/Fanchouille/fastknn.git

Use

FastKnn builds a kNN index with specified index_method (default: hnsw) and index_space (default: cosinesimil)

  • See here for different spaces
  • See here for different methods

This code has been tested with hnsw method and cosinesimil / l2 space for dense data and cosinesimil_sparse / cosinesimil_sparse_fast space

Example with dense data:

from fastknn import FastKnn

# Create index...
fastknn = FastKnn(data, id_dict)

# Save index
fastknn.save("test_fastknn")

# ...or load if exists
fastknn = FastKnn(fastknn_folder="test_fastknn")

# Choose sample vectors
query = data[:3, :]

# Query index & get results as df
results_df = fastknn.query_as_df(query, k=10, same_ids=True, remove_identity=True)
  • Where data is a m x n numpy array matrix and id_dict is a python dictionary with mappings from integer index (0 to m-1) to real ids

    • fastknn.datautils provides method to get data and id_dict easily from pandas dataframes
  • To use FastKnn in supervised mode, provide a target parameter which is a python dictionary containing labels (classes or quantity target) related to data (default: None: unsupervised mode)

  • Other important parameters: data_type (default: dense) and dist_type (default: float) - see main.py for examples

  • Once instantiated, save method saves as files:

    • mappings from integer index to real ids as a json file
    • index parameters as a json file
    • index as a bin file
    • target dictionary as a json file
  • Get a saved FastKnn back by specifying fastknn_folder

  • Query a FastKnn object by using query_as_df provided method with the following parameters

    • query - p x n numpy array - matrix to be matched to data
    • k - integer - the number of nearest neighbours (default 10)
    • query_index - list of integer - index of the data provided in query (default: None - takes row index as index)
    • nn_column - string - name of resulting column containing the nearest neighbours (default: nearest_neighbours)
    • distance_column - string - name of resulting column containing the distances to nearest neighbours (default: distances)
    • same_ids - bool - when querying the same data that was indexed, gets index + real ids (default: False)
    • remove_identity - bool - when querying the same data that was indexed, get k nearest neighbours without the perfect identity match (default: False)
  • Get prediction with a FastKnn object by using prediction_as_df provided method with the following parameters

    • query - p x n numpy array - matrix to be matched to data
    • k - integer - the number of nearest neighbours (default 10)
    • query_index - list of integer - index of the data provided in query (default: None - takes row index as index)
    • same_ids - bool - when querying the same data that was indexed, gets index + real ids (default: False)
    • remove_identity - bool - when querying the same data that was indexed, get k nearest neighbours without the perfect identity match (default: False)
    • prediction_type - string - classification (majority voting on the k nearest neighbours) or regression (mean on the k nearest neighbours)(default: classification)

Development

Clone project

Install Anaconda local environment as below:

./install.sh

Activate Anaconda local environment as below:

conda activate ${PWD}/.conda