This project provides a very basic implementation of a modified Trie tree that indexes tokens instead of characters. The trie tree uses dictionaries to keep track of the lists of child nodes and stores the value of the phrases in its terminal nodes. In a production setting I would explore implementing a linked list based approach where lists of child nodes are singly linked list.
The document we are processing with Tokenized Trie goes through some basic preprocessing. This preprocessing includes:
- removing punctuation and stop words
- tokenization
There are a number of improvements that could be made to this implementation including:
- loading all text and seed data from files where the files are passed into the driving module as paths
- serializing tree to a format that would allow us to flush the tree to disk and load it again so that we don't have to rebuild it each time we run it
- implement unit tests for all functions
- refactor tree data structure to use singly linked lists instead of dictionaries
- implement fuzzy search matching
- Python 3.6+
- virtualenv (
pip install virtualenv
) - virtualenvwrapper (
pip install virtualenvwrapper
)
mkvirtualenv trie-search
(if python 3.X+ is your default python version) OR- Type which python3, to get the path of your python3 (i.e. /usr/local/bin/python)
mkvirtualenv -p [Path To Python3] trie-search
- cd to project directory
pip install spacy
python -m spacy download en_core_web_sm
Go to the root of this project and type:
python run.py