/In-depth-NLP

POS, NER, Sentiment & Semantics, VADER, LDA, Topic Modeling using spaCy and NLTK

Primary LanguageJupyter Notebook

In-depth-NLP

Parts of Speech Tagging (POS)

The challenge of correctly identifying parts of speech is summed up nicely in the spaCy docs:

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

In this notebook (POS & NER) we'll take a closer look at coarse POS tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective).

Named Entity Recognition (NER)

spaCy has an 'ner' pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ents property of a Doc object.

NLTK's VADER module

VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html