/gist-erlang

Primary LanguageErlangApache License 2.0Apache-2.0

GiST implementation in pure Erlang

Description

The library implements abstract search tree according to "Generalized Search Trees for Database Systems" paper by Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer (Technical Report #1274, University of Wisconsin at Madison, July 1995).

Quadratic pick-split algorithm is implemented according to "R-Trees - A Dynamic Index Structure for Spatial Searching" by Antonin Guttman (ACM SIGMOD Record, Volume 14, Issue 2, June 1984).

Currently, only set key class is implemented. Other key classes (like, R-trees) may be introduced by implementing gist_key behaviour.

Trigram search is implemented as a straightforwad demonstration of the set key class.

Usage

1> {ok, Data} = file:read_file("test/data/words-10000.txt"),
   Words = binary:split(Data, <<"\n">>, [global, trim_all]),
   length(Words).
10000
2> Tree0 = gist_tree:new(gist_key_set, gist_node_heap, undefined).
{tree,2,20,gist_key_set,gist_node_heap,
      #{level => 0,root => undefined}}
3> Tree1 = lists:foldl(
        fun(Word, Tree) ->
            Key = gist_trigram:to_key(Word),
            gist_tree:insert(Tree, Key, Word)
        end,
        Tree0,
        Words
   ), ok.
ok
4> SearchKey = gist_key_set:to_key(['and', 'nda']).
#{'and' => 1,nda => 1}
%% or:
%% SearchKey = gist_trigram:to_key(<<"anda">>).
5> gist_tree:search(Tree1, SearchKey).
[<<"demandable">>,<<"Mandan">>,<<"garlandage">>,
 <<"husbandage">>,<<"bandarlog">>]

Performance

On 2,3 GHz 8-Core Intel Core i9 (MacBook Pro 2019) there are the following latencies (gist_sample:timings/0):

  • Trigram tree for the 100000-word wordlist (gist_key_set key scheme)
  • gist_node_heap node storage
  • Average key insert latency: 1.044ms
  • Tree depth (min fanout = 2, max fanout = 4): 10
  • Positive search latency: 0.332ms
  • Random search latency: 0.002ms

License

Apache License