/extract-proper-nouns

Extract proper nouns from an English text with NLTK POS tagging

Primary LanguagePython

This script allows to extract proper nouns from an English text with NTLK.

Install dependencies
--------------------
* Install NTLK according your OS (pkg install ntlk on FreeBSD for example)
* Install numpy (pkg install py27-numpy)
* Download the needed NLTK resources with ntlk.download():
** averaged_perceptron_tagger
** maxent_treebank_pos_tagger
** punkt
** treebank

Source text
-----------
You need a copy of the text you want to extract from as plain text.

Source English word list
------------------------
The expected format is a list in lowercase, each line a substantive word.
Filename should be wordsEn.txt or modified in eliminate-common-nouns script.

Such file was available at [SIL](http://web.archive.org/web/20141122213941/http://www-01.sil.org/linguistics/wordlists/english/).

Usage
-----
./extract-proper-nouns source.txt > nouns.txt

To sort them and eliminate duplicates:
./extract-proper-nouns source.txt | sort | uniq > nouns.txt

To discard known English words:
./eliminate-common-nouns nouns.txt

Acknowledgment
--------------

Thank you to Rama for NLTK suggestion and some brief guidance.

The original code idea is from Alvations, and could be seen at http://stackoverflow.com/a/17672491/1930997.