
TF-IDF report over a watched directory

Primary LanguagePython


Python 2.7

This program watches over a directory and returns the N top ranked files for a given query string.


Term Frequency - Inverse Document Frequency is an algorithm for computing the relevance of a word in a file against itself and the corpus of all the others files in the directory.

The time complexity in the worst case is:

  • equation
  • equation assuming there are the same number of terms as files and words in files

And the space is equation as an array and a dict of files are stored.


In order to watch over a directory TFIDF uses the watchdog module.


$ python setup.py install

This will add tfidf script to PATH. In OSX/UNIX it will be added to /usr/local/bin


$ python tfidf.py -d dir -n N -p P -t "terms"

Run tests

$ python -m unittest discover -s test -t tfidf