Contents
PyDAWG
is a python module implements DAWG graph structure,
which allow to store set of strings and check existence of a string in
linear time (in terms of a string length).
DAWG is constructed by incremental algorithm described in Incremental algorithm for sorted data, Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson, Computational Linguistics, 26(1), March 2000. Prof. Jan Daciuk offers also some useful documentation, presentations and even sample code on his site.
The algorithm asserts that input words are sorted in
lexicographic order; default Python sort()
orders strings correctly.
Also minimal perfect hashing (MPH) is supported, i.e. there is a function that maps words to unique number; this function is bidirectional, its possible to find number for given word or get word from number.
There are two versions of module:
- C extension, compatible only with Python3;
- pure python module, compatible with Python 2 and 3.
Python module implements subset of C extension API.
Library is licensed under very liberal three-clauses BSD license. Some portions has been released into public domain.
Full text of license is available in LICENSE file.
Wojciech Muła, wojciech_mula@poczta.onet.pl
Compile time settings (can be change in setup.py):
DAWG_UNICODE
--- if defined, DAWG accepts and returns unicode strings, else bytes are supportedDAWG_PERFECT_HASHING
--- when defined, minimal perfect hashing is enabled (methods word2index and index2word are available)
Just run:
$ python setup.py install
If compilation succed, module is ready to use.
Module pydawg
provides class DAWG
and following members:
EMPTY
,ACTIVE
,CLOSED
--- symbolic constants forstate
member ofDAWG
objectperfect_hashing
-- see Minimal perfect hashingunicode
-- see Unicode and bytes
Type of strings accepted and returned by DAWG
methods can be
either unicode or bytes, depending on compile time
settings (preprocessor definition DAWG_UNICODE
). Value of
module member unicode
informs about chosen type.
DAWG
class is picklable, and also provide independent
way of marshaling with methods binload()
and bindump()
.
state
[read-only integer]Following values are possible:
pydawg.EMPTY
--- no words in a set;pydawg.ACTIVE
--- there is at least one word in a set, and adding new words is possible (seeadd_word
&add_word_unchecked
);pydawg.CLOSED
--- there is at least one word in a set, but adding new words is not allowed (seeclose
/freeze
).
add_word(word) => bool
- Add word, returns True if word didn't exists in a set.
Procedure checks if
word
is greater then previously added word (in lexicography order). add_word_unchecked(word) => bool
- Does the same thing as
add_word
but do not checkword
order. Method should be used if one is sure, that input data satisfy algorithm requirements, i.e. words order is valid. exists(word) => bool
orword in ...
- Check if word is in set.
match(word) => bool
- Check if word or any of its prefix is in a set.
longest_prefix(word) => int
- Returns length of the longest prefix of word that exists in a set.
len()
protocol- Returns number of distinct words.
words() => list
- Returns list of all words.
find_all([word, [wildchar, [how]]]) => iterator
Returns iterator that match words depending on
word
argument.find_all()
- does the same job as
iter()
find_all(prefix)
- Yields words that share a prefix
find_all(pattern, wildchar, [how])
Yields words that match a
pattern
with givenwildchar
(wildchar matches any char). Parameterhow
controls which words are matched:MATCH_EXACT_LENGTH
- words with the same length as a pattern
MATCH_AT_LEAST_PREFIX
- words of length not less then pattern
MATCH_AT_MOST_PREFIX
- words of length no greater then pattern
clear()
- Erase all words from set.
close()
orfreeze()
Don't allow to add any new words,
state
value becomepydawg.CLOSED
. Also free memory occupied by a hash table used to perform incremental algorithm (see alsoget_hash_stats()
).Can be reverted only by
clear()
.
Class supports iter
protocol, i.e. iter(DAWGobject)
returns
iterator, a lazy version of words()
method.
Minimal perfect hashing (MPH) allows to find unique number
representing any word from DAWG, and also find word with given number.
Numbers are in always in range 1 ... len(DAWG)
.
Finally, this feature makes possible to perform fast lookups as in a regular dictionary.
Algorithm used for MPH is described in Applications of Finite Automata Representing Large Vocabularies, Claudio Lucchesi and Tomasz Kowaltowski, Software Practice and Experience, 23(1), pp. 15--30, Jan. 1993.
MPH feature is enabled during compilation time if preprocessor
definition DAWG_PERFECT_HASHING
exists. Module member
perfect_hashing
reflects this setting.
Warning
Words numbering is done for the whole DAWG. If new words
are added with add_word
or add_word_unchecked
,
then current numbering is lost and when method word2index
or index2word
is called, then DAWG is renumbered.
Because of that frequent mixing these two groups of method will degrade performance.
word2index(word) => index
- Returns index of word, or None if word is not present in a DAWG.
index2word(index) => word
- Returns words associated with index, or None if index isn't valid.
D = pydawg.DAWG() # fill DAWG with keys for key in sorted(dict): D.add_word_unchecked(key) # prepare values array V = [None] * len(D) for key, value in dict.items(): index = D.word2index(key) assert index is not None V[index - 1] = value # lookups are possible now for word in user_input: index = D.word2index(word) if index is not None: print(word, "=>", V[index - 1])
dump() => (set of nodes, set of edges)
Returns sets describing DAWG, elements are tuples.
Node tuple:
- unique id of node (number)
- end of word marker
Edge tuple:
- source node id
- edge label --- letter
- destination node id
Distribution contains program
dump2dot.py
that shows how to convert output of this function to graphviz DOT language.bindump() => bytes
- Returns binary DAWG data.
binload(bytes)
Restore DAWG from binary data. Example:
import pydawg A = pydawg.DAWG() with open('dump', 'wb') as f: f.write(A.bindump()) B = pydawg.DAWG() with open('dump', 'rb') as f: B.binload(f.read())
get_stats() => dict
Returns dictionary containing some statistics about underlaying data structure:
words_count
--- number of distinct words (same aslen(dawg)
)longest_word
--- length of the longest wordnodes_count
--- number of nodesedges_count
--- number of edgessizeof_node
--- size of single node (in bytes)sizeof_edge
--- size of single node (in bytes)graph_size
--- size of whole graph (in bytes); it's aboutnodes_count * sizeof_node + edges_count * sizeof_edge
get_hash_stats() => dict
Returns some statistics about hash table used by DAWG.
table_size
--- number of table's elementselement_size
--- size of single table itemitems_count
--- number of items saved in a tableitem_size
--- size of single item
Approx memory occupied by hash table is
table_size * element_size + items_count * item_size
.