kibbitz: A Java repository from peterdm

Kibbitz prefix server

The prefix server will respond to autocomplete queries based on the provided
index file (see Index below) with a requested number of suggestions.


Scoring Algorithm

This particular scoring algorithm gives preference
to completions more frequently seen in the index, and also completions
with more terms:  Log(terms in completion + 1) * Log(document frequency + 1)


Responsiveness

Time taken is proportional to the number of matches found 
(e.g. "home":327ms vs. "pizz":13ms vs. "uha":2ms) The sample queries file contains 
about 1MM queries from the AOL dataset, each truncated randomly anywhere from the 
full length down to 3 characters (for performance reasons, 1-2 character 
queries were omitted).


Throughput

With about 65% of queries eventually hitting a cache, this single-threaded program
returned an average of 520qps on a dual-core MacBook Pro processing 1MM+ queries in 
about 32 minutes.

The machine was CPU-bound during this test, which probably points to the 
computing burden of all the string-comparison required by the binary search.
(Profiling would confirm).  Given sufficient memory, using an in-memory 
Ternary-search Tree would cause prefix-search to require only log(N) in-memory
char-comparisons vs the current binary-search requirement of log(N) on-disk
string-comparisons.

Additionally, with a little extra work, this could be made thread-safe so 
that multiple requests could be served at one time.  This implementation
would benefit from additional cpus and multiple disk-controllers.


Index
  
The index is stored on disk.  (Due to it's size, it cannot be stored
in 256MB of memory.)  With sufficient memory available, a Ternary-search-tree
or maybe a Radix trie (like a Patricia-trie) would further optimize lookups.

The on-disk index is accessed through a binary-search to find terms with a 
matching prefix, and then adjacent (also-matching) terms are collected and
returned as well.


Other Optimization ideas

1. Shard the index:  An on-disk binary-search requires worst-case log(N) seeks 
and string-comparisons plus M file-read and string-comparisons to return all 
adjacent matches.  Sharding the index (e.g. Based upon the first character 
typed, so:  a.idx, b.idx, ... 0.idx, ... 9idx) would significantly reduce the 
average number of seek+string-comparisons (in this test by about 25%).
Since the process is CPU-bound, prefix-search should speed up by up to 25%
for smaller values of M.

2. SSD:  If the application could be tuned with a sharded index, multi-processor
server, etc... to the point that this became IO-bound, some further optimizations
might be found by moving the index to SSDs to take advantage of their improved 
seek-time and data-transfer characteristics.


Usage

Create index:   Extract AOL query files into top-level directory
(about 4 min)	[shell]$ ./aol-query-indexer.sh
		
Create samples:	[shell]$ ./aol-as-you-type-query-generator.sh

Run samples:	[shell]$ ./aol-query-suggest.sh ../data/index.txt ../data/randoms.txt 5
(for 5 suggestions per query)
peterdm/kibbitz