/ruler

Primary LanguageC++

Ruler
=====

Ruler is a toolkit for measuring distances between vectors, an operation
required by many algorithms.

For dense vectors, tools are provided for measuring sum of absolute
differences (L1) and Hamming distances, primarily because modern processors
have special instructions that can be used to measure these distances
quickly, and they are usually good surrogates for Euclidean distance.

For sparse vectors, a single tool is provided. It reads in a database of
sparse vectors stored as a list of non-zero coordinates and a "query" vector,
which is compared to each element in the database. The L1 distance between
the query vector and each element in the database is computed and written to
an output file.

Build
=====

Run "make all" to compile the programs.

Dense vectors
=============

Run "test-dense.sh" to generate a set of test vectors, then time each of the
sum of absolute distance and Hamming distance algorithms.  I.e.:

$ bash -x test-dense.sh
+ perl gentest.pl 64000
+ ./ham_naive test_1000x64_rand.in 1000 1000
Running 1000 loops
105665000
0.347653 s
+ ./ham_psadbw test_1000x64_rand.in 1000 1000
Running 1000 loops
105665000
0.044802 s
+ ./ham_popcnt_gcc test_1000x64_rand.in 1000 1000
Running 1000 loops
105665000
0.010064 s
+ ./ssd_naive test_1000x64_rand.in 1000 1000
Running 1000 loops
1013867000
0.077668 s
+ ./sad_psadbw test_1000x64_rand.in 1000 1000
Running 1000 loops
203773000
0.010926 s
+ ./sad_naive test_1000x64_rand.in 1000 1000
Running 1000 loops
203773000
0.092527 s
$ 

In this example, all of the programs are run 1,000 times on 1,000 64
dimensional vector pairs.  These vectors are generated at random using the
Perl script "gentest.pl".

Hamming distance
----------------

The first program, "ham_naive", computes the sum of Hamming distances between
the vectors.  The answer, 105665000, is available after 0.34 seconds (time
required to read in the files is discounted).  "ham_naive" computes Hamming
distance by using bit-wise XOR, then counting the 1s by AND'ing with 1 and
shifting right.

"ham_psadbw" computes Hamming distances by accumulating all the active bits
after an XOR using an algorithm that is O(log2 N), where N is the number of
total bits.  This accumulation step leaves sums in 8 bytes, which are then
summed together in a single instruction, PSADBW, which is available on
processors supporting SSE4.1.  The answer from "ham_psadbw" is available
after only 0.0448 seconds, almost an order of magnitude faster than the naive
method for computing Hamming distance.

"ham_popcnt_gcc" computes Hamming distances using XOR and POPCNT, a special
instruction available in SSE4.2 that simply counts the number of active bits
in a 64-bit word.  The answer is available here in only 0.01 second, a 4x
improvement over "ham_psadbw".

Sum of absolute differences
---------------------------

"sad_psadbw" and "sad_naive" compare L1 distance computation time with and
without the special PSADBW instruction that requires SSE4.1.  "sad_psadbw"
achieves nearly an order of magnitude speed improvement over "sad_naive", and
is almost as fast as computing Hamming distance.

Sum of squared distances
------------------------

"ssd_naive" gives the timing required for computing an L2-like distance
(without the square root), without any special speedup algorithm. "ssd_naive"
is nearly an order of magnitude slower than both "sad_psadbw" and "ham_psadbw".


Sparse vectors
==============

Run "ruler --help" to get some tips on how to call it.

Run "test-sparse.sh" to run the test case.  I.e.,

$ bash -x test-sparse.sh
+ ./ruler -n dbnames -w query.sign -D dbsign -t 1 -o dbout
dbsign file length 817450
query.sign file length 80054
Loaded database and query in 0.001026 seconds
Loaded 11 names in 3.1e-05 seconds
8894 active bins in input signature
10 signatures in db
Computed 10 distances in 0.00241 seconds
$ 

The output is then written to the file "dbout":

$ cat dbout
0023/0023030852/71DJY5APPQL.key 0
0023/0023050357/51SD5SQXR4L.key 48606
0023/0023321318/41ZYGDDP04L.key 48722
0023/0023364505/51J646CTK8L.key 48478
0023/0023548568/51QS214NVQL.key 49506
0023/0023557222/41QNMWZBBBL.key 48360
0023/0023611413/21VPC0VZAZL.key 50368
0023/0023745452/41YR6DA6QFL.key 50090
0023/0023767863/414GYDRKN7L.key 50080
0023/0023887419/515HXHXZWKL.key 49282
$ 

In this case, the first vector in the database is identical to the query
vector, so the distance is zero.

The format of the database file is:

[bytes  0- 7] 64-bit integer designating the number of sparse vectors in
              the database

[bytes  8-15] 64-bit integer designating the number of non-zero coordinates
              in the first vector

[bytes 16-23] 64-bit integer designating the number of the first non-zero
              coordinate in the first vector

[byte  24   ] value of the first non-zero coordinate in the first vector

... (etc)

The format of the query vector is similar.

The program allows you to quickly measure one-to-many distances between sparse
vectors, implicitly skipping over all of the coordinates with value zero.