This package provides two external LSH schemes QALSH and QALSH+ for high-dimensional c-Approximate Nearest Neighbor (c-ANN)
search under lp norm from the following two papers, where 0 < p <= 2.
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, Wilfred Ng. Query-Aware Locality-Sensitive
Hashing for Approximate Nearest Neighbor Search. Proceedings of the VLDB Endowment (PVLDB),
9(1): 1-12, 2015.
Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng, Wei Wang. Query-Aware Locality-Sensitive
Hashing Scheme for l_p Norm. The VLDB Journal, 26(5): 683–708, 2017.
The package requires g++
with c++11
support. To download and compile the code, type:
$ git clone https://github.com/HuangQiang/QALSH.git
$ cd QALSH
$ make
We use four real-life datasets Sift, Gist, Trevi, and P53 for comparison. The statistics of the datasets are summarized in the following table:
Datasets | #Data Objects | #Queries | Dimensionality | Page Size | Domain Size | Data Size |
---|---|---|---|---|---|---|
Sift | 1,000,000 | 100 | 128 | 4 KB | [0, 218] | 337.8 MB |
Gist | 1,000,000 | 100 | 960 | 16 KB | [0, 14,772] | 4.0 GB |
Trevi | 100,900 | 100 | 4,096 | 64 KB | [0, 255] | 1.5 GB |
P53 | 31,159 | 100 | 5,408 | 64 KB | [0, 10,000] | 833.7 MB |
Usage: qalsh [OPTIONS]
This package supports 6 options to evaluate the performance of QALSH, QALSH^+,
and Linear_Scan for c-ANN search. The parameters are introduced as follows.
-alg integer options of algorithms (0 - 5)
-n integer cardinality of dataset
-d integer dimensionality of dataset and query set
-qn integer number of queries
-B integer page size
-leaf integer leaf size of kd_tree
-L integer number of projections for drusilla_select
-M integer number of candidates for drusilla_select
-p float l_{p} norm, where 0 < p <= 2
-z float symmetric factor of p-stable distribution (-1 <= z <= 1)
-c float approximation ratio for c-ANN search (c > 1)
-ds string address of data set
-qs string address of query set
-ts string address of truth set
-df string data folder to store new format of data
-of string output folder to store output results
We provide the scripts to repeat experiments reported in VLDBJ 2017. A quick example is shown as follows (run QALSH+ and QALSH on Mnist
with Euclidean distance
, where p = 2.0
and z = 0.0
):
# QALSH^+
./qalsh -alg 1 -n 60000 -d 50 -B 4096 -leaf 4000 -L 30 -M 10 -p 2.0 -z 0.0 -c 2.0 -ds data/Mnist/Mnist.ds -df data/Mnist/ -of results2.0/Mnist/L2.0/
./qalsh -alg 2 -qn 100 -d 50 -qs data/Mnist/Mnist.q -ts data/Mnist/Mnist.gt2.0 -df data/Mnist/ -of results/Mnist/L2.0/
# QALSH
./qalsh -alg 3 -n 60000 -d 50 -B 4096 -p 2.0 -z 0.0 -c 2.0 -ds data/Mnist/Mnist.ds -df data/Mnist/ -of results2.0/Mnist/L2.0/
./qalsh -alg 4 -qn 100 -d 50 -qs data/Mnist/Mnist.q -ts data/Mnist/Mnist.gt2.0 -df data/Mnist/ -of results/Mnist/L2.0/
If you would like to get more information to run other algorithms, please check the scripts in the package. When you run the package, please ensure that the path for the dataset, query set, and truth set is correct. Since the package will automatically create folder for the output path, please keep the path as short as possible.
Finally, we introduce some tricks to set up parameters, i.e., B
, leaf
, L
, M
, p
, z
, and c
.
B
is the page size in bytes. It is determined by the dimensionality d
of datasets. We use the following rules to set up B
:
- (1)
d < 256
: we setB = 4096
. - (2)
256 ⩽ d < 512
: we setB = 8192
. - (3)
512 ⩽ d < 1024
: we setB = 16384
. - (4)
1024 ⩽ d < 4096
: we setB = 32768
. - (5)
4096 ⩽ d < 8192
: we setB = 65536
.
for the case d ⩾ 8192
, we can set up a corresponding larger B
value following the rules above.
leaf
is the maximum leaf size of kd-tree. Thus, it should be smaller than the cardinality of dataset, i.e., leaf < n
. Let K
be the number of blocks after kd-tree partitioning. Since we use kd-tree to divide the whole datasets into blocks, K
is 2i, where i = ceil(log_2 (n/leaf))
.
L
, and M
are two parameters introduced by Drusilla Select. Once leaf
is determined, the actual leaf size n0 (also known as the number of objects in each block) can be estimated as floor(n / 2i) or ceil(n / 2i). There are two conditions when we set up L
and M
:
- (1) L * M < n0. Since we run drusilla select for each block to select the representative objects, it is a natural condition to restrict its size (L * M) less than n0.
- (2) K * L * M ≈ n0. If the sample size (i.e., KLM) is large, we can well estimate which blocks are closer to the query, but it will a lot of extra time for estimation. If the sample size is small, the time to determine close blocks can be reduced, but these blocks may not be closer to the query than others. This condition is based on our observation. According to our experiments, we find that creating a sample set with cardinality similar to n0 can achieve a good trade-off.
p
and z
determine the distance metric and the corresponding p-stable distribution. There are three common settings:
- (1) Euclidean distance (l2 distance): we set
p=2.0
,z=0.0
and apply standard Gaussian distribution. - (2) Manhattan distance (l1 distance): we set
p=1.0
,z=0.0
and apply standard Cauchy distribution. - (3) l0.5 distance: we set
p=0.5
,z=1.0
and apply standard Levy distribution.
In addition, for other lp distance, users can set 0 < p ⩽ 2
and -1 ⩽ z ⩽ 1
.
c
is the approximation ratio for c-k-ANN search. We often set c=2.0
. But if the dataset is easy, it is also satisfied to set c=3.0
.
If you use this package for publications, please cite the papers as follows.
@article{huang2017query,
title={Query-aware locality-sensitive hashing scheme for $$ l\_p $$ norm}
author={Huang, Qiang and Feng, Jianlin and Fang, Qiong and Ng, Wilfred and Wang, Wei},
booktitle={The VLDB Journal},
volumn={26},
number={5},
pages={683--708},
year={2017},
organization={Springer}
}
@article{huang2015query,
title={Query-aware locality-sensitive hashing for approximate nearest neighbor search}
author={Huang, Qiang and Feng, Jianlin and Zhang, Yikai and Fang, Qiong and Ng, Wilfred},
booktitle={Proceedings of the VLDB Endowment},
volumn={9},
number={1},
pages={1--12},
year={2015},
organization={VLDB Endowment}
}