Kibbitz prefix server The prefix server will respond to autocomplete queries based on the provided index file (see Index below) with a requested number of suggestions. Scoring Algorithm This particular scoring algorithm gives preference to completions more frequently seen in the index, and also completions with more terms: Log(terms in completion + 1) * Log(document frequency + 1) Responsiveness Time taken is proportional to the number of matches found (e.g. "home":327ms vs. "pizz":13ms vs. "uha":2ms) The sample queries file contains about 1MM queries from the AOL dataset, each truncated randomly anywhere from the full length down to 3 characters (for performance reasons, 1-2 character queries were omitted). Throughput With about 65% of queries eventually hitting a cache, this single-threaded program returned an average of 520qps on a dual-core MacBook Pro processing 1MM+ queries in about 32 minutes. The machine was CPU-bound during this test, which probably points to the computing burden of all the string-comparison required by the binary search. (Profiling would confirm). Given sufficient memory, using an in-memory Ternary-search Tree would cause prefix-search to require only log(N) in-memory char-comparisons vs the current binary-search requirement of log(N) on-disk string-comparisons. Additionally, with a little extra work, this could be made thread-safe so that multiple requests could be served at one time. This implementation would benefit from additional cpus and multiple disk-controllers. Index The index is stored on disk. (Due to it's size, it cannot be stored in 256MB of memory.) With sufficient memory available, a Ternary-search-tree or maybe a Radix trie (like a Patricia-trie) would further optimize lookups. The on-disk index is accessed through a binary-search to find terms with a matching prefix, and then adjacent (also-matching) terms are collected and returned as well. Other Optimization ideas 1. Shard the index: An on-disk binary-search requires worst-case log(N) seeks and string-comparisons plus M file-read and string-comparisons to return all adjacent matches. Sharding the index (e.g. Based upon the first character typed, so: a.idx, b.idx, ... 0.idx, ... 9idx) would significantly reduce the average number of seek+string-comparisons (in this test by about 25%). Since the process is CPU-bound, prefix-search should speed up by up to 25% for smaller values of M. 2. SSD: If the application could be tuned with a sharded index, multi-processor server, etc... to the point that this became IO-bound, some further optimizations might be found by moving the index to SSDs to take advantage of their improved seek-time and data-transfer characteristics. Usage Create index: Extract AOL query files into top-level directory (about 4 min) [shell]$ ./aol-query-indexer.sh Create samples: [shell]$ ./aol-as-you-type-query-generator.sh Run samples: [shell]$ ./aol-query-suggest.sh ../data/index.txt ../data/randoms.txt 5 (for 5 suggestions per query)