/Document_Tokenizer

A set of tools to tokenize and extract information from documents.

Primary LanguagePython

Document_Tokenizer

author: D.E. O'Kane

License: GPLv3

A set of tools and classes to tokenize and extract information from documents of all kinds. 
The main goal of this project is to build a set of tools and classes much in the same vein as
the well known Natural Language Toolkit (nltk.org), i.e. word stemmers, natural language 
processing routines, and possibly even statistical routines.

News:

Release of v 2.0! At last there is now a (somewhat) refined user interface! The basic tools work,
namely the LSA and RE analysis tools, and there is a bit of help documentation for each function
to help you in your strugglings. This release represents a major step forward in functionality,
user interface, and program flexibility.

-->Functionality: Users can now change the working directory of the program and subsequent csv
   files are written to the working directory.
-->User Interface: Users now don't have to be as programmatically inclined. All the tools that
   are used are now collected in an interactive interface along with helpful documentation.
-->Program Flexibility: As from before, users can use the program to perform merger & aquisition
   specific processes (provision searches) or also perform more general analysis, i.e. PCA.

My current wish list includes:

> A word stemmer (Porter or otherwise. I need a robust stemming method.)
> Descriptive statistics routines for a data set, e.g. word counts, percentage of document 
that is single occurence words (i.e. hapax legomena).