This Clojure FST implementation is a wrapper above the Lucene FST package which is part of Lucene core.
Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton map an input string to an output. The output can be another string or an integer.
The FST produced by this application are implemented as a bytes array which makes FST really effective indexes in terms of speed and memory consumption. In fact, a 10 millions terms index will takes roughtly 256 MB of memory (depending of the string composition of the input strings, and if the outputs are integers or strings).
clj-fst
is an lightning-fast and memory effective way to figure out if something belong to a really huge set of things, or to get the output of an input. This is really simple but has profound implications.
You can easily install clj-fst
using Leiningen. The only thing you have to do is to add Add [clj-fst "0.1.0"]
as a dependency to your project.clj
.
Then make sure that you downloaded this dependency by running the lein deps
command.
The complete clj-fst
documentation is available here.
Here is how you can create, populate and save a FST:
;; The first thing to do is to create the Builder
(def builder (create-builder! :type :int))
;; This small `sorted-map` defines the things
;; to add to the FST
(def values (into (sorted-map) {"cat" 1
"dog" 2
"mice" 3}))
;; Populate the FST using that `sorted-map`
(doseq [[input output] values]
(add! builder {input output}))
;; Creating a new FST
(def fst (create-fst! builder))
;; Save a FST on the file system
(save! "resources/fst.srz" fst)
Additionally you can load a previously saved FST:
;; Load a FST from the file system
(load! "resources/fst.srz)
You can query a FST to get the output related to an input:
;; Query the FST
(get-output "cat" fst)
You can iterate over a FST using FST enumerations:
;; Create the FST enumeration
(def enum (create-enum! fst))
;; Get the first item in the FST
(next! enum)
;; Get the current FST item pointed by the enumerator
(current! enum)
;; Search for different input terms
(get-ceil-term! "cat" enum)
(get-floor-term! "cat" enum)
(get-exact-term! "cat" enum)