This is a source code for the algorithm described in the paper Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces (Submitted to PVLDB 2023). We call it as LG project.
LG project is written by C++ and can be complied by g++ in Linux and MSVC in Windows. It adopt openMP
for parallelism.
We can use Visual Studio 2019 to build the project with importing all the files in the directory ./cppCode/LSH-APG/src/
.
cd ./cppCode/LSH-APG
make
The excutable file is then in dbLSH directory, called as lgo
lgo datasetName
(the first parameter specifies the procedure be executed and change)
- datasetName : dataset name
FOR EXAMPLE, YOU CAN RUN THE FOLLOWING CODE IN COMMAND LINE AFTER BUILD ALL THE TOOLS:
cd ./cppCode/LSH-APG
./lgo audio
In our project, the format of the input file (such as audio.data_new
, which is in float
data type) is the same as that in LSHBOX. It is a binary file, which is organized as the following format:
{Bytes of the data type (int)} {The size of the vectors (int)} {The dimension of the vectors (int)} {All of the binary vector, arranged in turn (float)}
For your application, you should also transform your dataset into this binary format, then rename it as [datasetName].data_new
and put it in the directory ./dataset
.
A sample dataset audio.data_new
has been put in the directory ./dataset
.
Also, you can get it, audio.data
, from here(if so, rename it as audio.data_new
). If the link is invalid, you can also get it from data.
For the datasets we use, you can get the raw data from following links: MNIST, Deep1M, GIST, TinyImages80M, SIFT. Next, you should transform your raw dataset into the mentioned binary format, then rename it is [datasetName].data_new
and put it in the directory ./dataset
.