/nn_search2

Text analysis and part-of-speech searching utility

Primary LanguageHTML

nn-search2 is a part-of-speech tagging and text search utility based on NLTK, TextBlob and matplotlib. It uses state-of-the-art POS-tagger based on Averaged Perceptron which provides fast and accurate results. One of the main nn-search2 features is full text part-of-speech search. Basically, you can search text using word's part-of-speech tag like (NN -- noun, VB -- verb, JJ -- adjective, etc.). Part-of-speech search query can include word ranges such that you can find chains of nouns, verbs, adjectives and fixed expressions using fairly simple query syntax: "the"DT "old"JJ{2} "oak"NN{1}. In addition, nn-search2 can do basic text analysis such as determining lexical diversity, subjectivity and polarity of the given text and generate nice-looking POS-tag distribution plots (see the screenshots below).

How does it look?

Here is the main window with some loaded text.

0

Imagine you want to find a sequence of ANY determiner (like "a", "an" or "the") and ANY noun in range of max 2 other tokens or words. You enter the following search query DT NN{2}. Press "Process" and then "Search".

1

In addition to your search results default text view 1, there are 2 alternative text views. View 2, shows the search results per sentence.

2

View 3, shows explicitly only the matched results of your search query.

3

Also, nn-search2 provides some text and search results statistics (see docs/html/index.html) which you can access via right panel buttons.

4

There is also a separate POS-tagger for batch processing of one or more text files.

5

Standalone POS-tagger is also available via console.

6

Why would I need it?

Ok, why would somebody need to search parts-of-speech? Well, imagine you have a story e.g. "Alice in Wonderland" and for some crazy reason you decided to find out what kind and how many unique names have been used by Lewis Carrol. Naive approach would be searching for words with first character capitalized, but then you'd have to filter out a lot of false positives, by hand, which can be tedious and even impossible in some cases. You can use part-of-speech tagger which could tag your text and then using some regular expressions you'd probably get what you want, but the accuracy would still be not great because it will not only depend on your POS-tagger implementation, it would be matching text patterns not part-of-speech categories. To accomplish the same task with higher accuracy nn_search2 requires only the knowledge of available part-of-speech tags and some simple syntax.

Let's do it!

Load "Alice in Wonderland" and hit "Process" button. Use NNP POS-tag in your search query to find named entities and personal names. See what have been found.

7

Afer getting the results you wanted: Alice, Australia, White Rabbit, Dinah, Eaglet, Edwin, Morcar, Edgar, Caterpillar and even WAISTCOAT-POCKET:) you decide to find out what all these guys were doing in the story. Well, the following search query NNP VB{3} would help you with that.

8

This was a small example of a possible use case. As you've seen, the results need some manual correction. This is simple, because you are free to edit and save any text in nn_search2.

Well, what if you also want to know what all these guys did and to whom? Use the following search query: NNP VB{3} NN{1}.

9

I think you got the picture. As you've noticed, not all found results are correct. Unfortunately NLTK's tagger makes mistakes even though it has a fairly good accuracy (no perfect taggers exist). Also, be patient, the bigger your text and the shorter your search query the more time it will be required to display the results.

How to make a search query?

Examples:

DT NN

DT NN{1} VB

"the" "Aesir"NNP

TO "produce" "thunder"NN{1}

"The"{0} NNP{0} "are" NN{3}

In order to use nn_search2 you need to know how to write a search query. The syntax is simple, by default nn_search2 assumes that you are searching for part-of-speech tags (POS-tags). Make sure you know at least the most basic ones. nn_search2 uses your query to search only within one sentence. So, NNP VB will be searched within the limits of a single sentence not a paragraph or a whole text.

If you want to search for occurrences of nouns, you enter NN, that's it! Say, you want to find only nouns that appear in a range of 5 words from the beginning of the sentence, you type NN{5}. Range syntax is NN{number} where number means the number of words before the current tag. So, you can as well create chain queries like NN RB{2} VB{1}, which basically will attempt to find occurences of a noun, an adverb with 2 words before the adverb and a verb with one word before the verb. Again, the number stands for the number of words before the query tag (RB) after the last successful match (NN), so if NN is not matched RB and VB will never be matched as well. If you try DT NN query without any numbers, nn_search2 will attempt to find the longest match within a sentence, therefore it's a good practice to use word ranges.

That's nice but can I just search for some words? To do this, type a word and surround it with double quotes "viking". If you want to specify a range "viking"{3}, "merry" "viking"{1}. Pretty simple. You can as well combine POS-tags, words and ranges "Valhalla"NNP VBZ{0} DT{0} "place"NN{2}. If your search query is incorrect and can not be processed nn_search2 will display a warning message. That's it, now you know the query syntax.

How to install?

Installing and running from source on Linux

  1. install Python2.7
  2. git clone https://github.com/tastyminerals/nn_search2.git
  3. cd nn_search2
  4. run linux_install.sh
  5. start the app nn_search2

Windows binary