Thanks for using Tinasoft Pytextminer Pytextminer is a part of a larger software : Tinasoft Desktop you can find it at http://github.com/moma/tinasoft.desktop/ A text-mining python 2.6 module producing bottom-up semantic network production. It uses : - NLTK, the natural language processing toolkit (http://www.nltk.org/), - SQLite3, the embedded database library (http://sqlite.org), - Twisted web server (http://twistedmatrix) with jsonpickle as a serializer (http://jsonpickle.github.com), - Numpy for n-dimensionnal arrays processing (http://numpy.scipy.org/), - pyTenjin for graph gexf files export (http://www.kuwata-lab.com/tenjin/) Classical task are : - multiple kinds of source file support - extraction of key-phrases (ngrams) using various simple Natural Language Processing methods (stopwords, part-of-speech tagging, stemming, etc) - creation of document/corpus/ngram graphs databases - key-phrases cooccurrences calculation on a corpus basis - production of graphs of multiple entities and multiple relations (hybrid storage into GEXF files, http://gexf.net, and into sqlite database) - an httpserver exposing the API, sending json results This software is part of TINA, an European Union FP7 coordination action - FP7-ICT-2009-C : - http://tinasoft.eu/ The software implements scientific results by David Chavalarias (CREA lab; CNRS/Ecole Polytechnique UMR 7656, http://chavalarias.com) and Jean-Philippe Cointet (INRA SENS, http://jph.cointet.free.fr). SOURCE CODE REPOSITORY https://forge.iscpif.fr/projects/tinasoft-pytextminer http://github.com/moma/TinasoftPytextminer AUTHORS - Researchers and engineers at CREA lab (UMR 7656, CNRS, Ecole Polytechnique, France) julian bilcke <julian.bilcke (at) iscpif (dot) fr> david chavalarias <david.chavalarias (at) polytechnique (dot) edu> jean philippe cointet <jphcoi (at) yahoo (dot) fr> elias showk <elishowk (at) nonutc (dot) fr> MAINTAINER elias showk <elishowk (at) nonutc (dot) fr> DOCUMENTATION, SUPPORT AND FEEDBACK http://tinasoft.eu/ (project homepage) https://forge.iscpif.fr/projects/tinasoft-pytextminer (software development) PYTEXTMINER AS A USER Download standalone packages from http://tinasoft.eu DEVELOPER DOCUMENTATION http://tina.csregistry.org/tinauserdoc PYTEXTMINER AS A DEVELOPER * we provide a http server exposing the main API from the TinaApp class * alternatively, the apitests.py script provides examples to properly use the TinaApp class methods - get the source code : https://forge.iscpif.fr/projects/tinasoft-pytextminer/repository OR git clone https://sources.iscpif.fr/tinasoft.pytextminer.git PYTHON : you'll need Python 2.6 interpreter : http://python.org/ INSTALL THE PYTHON PACKAGE $ sudo python setup.py install or $ sudo python setup.py develop Dependencies should be checked : numpy, nltk, twisted, jsonpickle, tenjin, pyyaml OTHERWISE MANUALLY INSTALL PYTHON DEPENDENCIES - they're listed in setup.py DOWNLOAD NLTK DATA You'll need to install manually required nltk corpus data $ export NLTK_DATA="your/path/to/TinasoftPytextminer/shared/nltk_data" $ python > import nltk.download() Downloader> d punkt Downloader> d brown Downloader> d conll2000 on MS WINDOWS: $ set NLTK_DATA="TinasoftPytextminer\shared\nltk_data" $ PATH C:\Python26;%PATH% $ python apitests.py ... (see usage) - finally open your web browser at http://localhost:8888 (no internet connection needed) GNU/LINUX (and probable UNIX-like systems) - use the standalone freezed httpserver software $ export NLTK_DATA=shared/nltk_data $ python apitests.py ... (see usage) DEVELOPER DOCUMENTATION http://tina.csregistry.org/tinadevdoc CONFIGURATION config_*.yaml are a YAML configuration files. The main application (TinaApp class) searches it during init, its path is a required parameter GUIDELINES - declare each column name of your csv file into the corresponding field name of the configuration file - not declared columns will be ignored by the software - here are possible required and optional entries : #### REQUIRED titleField: document title contentField: document content authorField: document acronyme corpusNumberField: corpus number docNumberField: document number ##### optional index1Field: document index 1 index2Field: document index 2 dateField: document publication date keywordsField: document keywords - check out the format of your csv file (encoding, delimiter, quoting character) and write them into fields "locale", "delimiter" and "quotechar" - "minSize", and "maxSize" means the length of n-grams extracted - all other fields are the script configuration, or the default values for testing purpose WARNING : in YAML all tabulations are spaces, all string values must be quoted (eg : 'prop_title'). Further information at http://en.wikipedia.org/wiki/YAML SOURCE FILES DIRECTORY - "source_files" is dedicated to the storage of your source files - these files are used during indexation and extraction steps of the workflow - given an existing file name in this directory, the software will be able to read it TESTED OPERATING SYSTEMS Tinasoft Pytextminer was tested on the following platforms: GNU/Linux (amd4, i386) with Python 2.6 Windows XP (32bit) with Python 2.6 Mac OS X >= 10.6 COPYRIGHT AND LICENSE Copyright (C) 2009-2011 CREA Lab, CNRS/Ecole Polytechnique UMR 7656 (Fr) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/gpl.html>.
elishowk/TinasoftPytextminer
A python text-mining module producing semantic network graphs
PythonNOASSERTION