Final Project SDSC3001 Big Data: The Art and Science of Scaling
- Lecturer: Dr. YANG Yu
- City University of Hong Kong
This project was created by:
- ABDINEGARA, Beatrice
- GUSTAF, Benedict Ronaldo
- HA, Quang Minh
- HALIM, Seivabel Jessica
- YENNOTO, Keane Dylan
This repository contains a series of Jupyter Notebooks and supporting Python scripts for experimenting with Hierarchical Navigable Small World (HNSW) graphs for Approximate Nearest Neighbors (ANN) search. The focus is on building, analyzing, tuning, and testing HNSW-based models to understand their performance in terms of recall and query time comparing to other algorithms for ANN.
To set up the environment for running these experiments, follow these steps:
- Install the required packages: pip install -r requirements.txt
This notebook is the starting point of the experiments. It includes the initial setup, building of the HNSW graph with other algorithms including Brute Force, Annoy, and FLANN. Building time, query time, and recall are measured as performance metric.
In this notebook, we analyze the recall and query time performance of the different algorithms. We also explore the impact of different values of k
(number of nearest neighbors).
This notebook is dedicated to tuning the parameters of the HNSW model. It uses the config.yml
file for parameter configurations.
After tuning the parameters, this notebook tests the performance of the tuned HNSW model, focusing on recall and query time.
-
Configuration (
config.yml
): This YAML file contains the configuration parameters used for tuning the HNSW model in3-tuning_parameter_hnsw.ipynb
. -
Experiment Results (
exp_result.py
): File to store the experiment results. -
Helpers (
helpers.py
): This script includes helper functions used across different notebooks for various tasks like data loading, preprocessing, and functions for building and running the algorithms.
Reference: https://github.com/erikbern/ann-benchmarks/tree/main