/PyPruningRadixTrie

PyPruningRadixTrie - Python version of super fast Radix trie for prefix search & auto-complete

Primary LanguagePythonOtherNOASSERTION

PyPruningRadixTrie

GitHub CI PyPI version OSS Lifecycle

Python Port of Pruning Radix Trie by Wolf Garbe.

Changes compared to original version

  • Removed parameter to disable pruning behavior.
    • See test/non_pruning_radix_trie.py for a non-pruning version that you can use to see the speed improvement.
  • Added outline for more generic InputProvider and providers that read CSV or JSON as examples

What and Why

A Trie is a tree data structure that is commonly used for searching terms that start with a given prefix.
It starts with an empty string at the base of the trie, the root node.
Adding a new entry to the trie creates a new branch. This branch shares already present characters with existing nodes and creates new nodes when it's prefix diverges from the present entries.

# trie containing flower & flowchart (1 char = 1 node)

'' - f - l - o - w - e - r
                 |
                 c - h - a - r - t

A RadixTrie is the space optimized version of a Trie.
It combines the nodes with only one sub-node into one, containing more than one character.

# radix trie containing flower & flowchart

'' - flow - er
      |
     chart

The prefix Pruning references the algorithm to query the RadixTrie.
In order for the pruning to work, the nodes in RadixTrie are stored in a sorted manner.
This structure allows cutting off unpromising branches during querying the trie which makes the algorithm way faster compared to a non-pruning trie.

Usage

Get from PyPI:

pip install pypruningradixtrie

Create the PRT:

# empty trie
trie = PruningRadixTrie()

# fill with data from CSV file on creation
trie = PruningRadixTrie('./test_data.csv', CSVInputProvider(',', lambda x: float(x[1])))

Add entries:
CSV:

# fill with data from CSV file, score is at position 1, term at position 0
fill_trie_from_file(trie, './test_data.csv', CSVInputProvider(',', lambda x: float(x[1]), 0))

JSON:

# define a functon to calculate the score out of a JSON entry
def score_fun(json_entry: Dict[str, Any]) -> float:
  return json_entry["pages"] * json_entry["year"] / 10.0

# "title" = key for term to insert into PRT
fill_trie_from_file(trie, './test_data.json', JSONInputProvider("title", score_fun))

Single Entry:

# insert single entry
insert_term(trie, term="flower", score=20)

Use the PRT:

# get the top 10 entries that start with 'flower'
trie.get_top_k_for_prefix('flower', 10)