/TextExtractor

python modules :: Modules to extract text from different formats, remove header and footer and seperate sentences

Primary LanguagePython

Teamproject


This project contains python-modules to

  • extract text from different formats (*.doc, *.docx, *.odt, *.pdf, *.rtf)
  • removes header and footer
  • seperate sentences

It contains setup-files for the server distribution of ubuntu and the python-version 3.4.3.

If you would like to install these files, you go into the folder install and type ./inst.sh.

The seperator-module use the Natural Language Toolkit and is distributed under the terms of the Apache License Version 2.0.

We refer to the following book:

Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

The docx-module of the converter use docx2txt and is distributed under the terms of the GPLv3.