/NLTK

Sentiment Analysis using NLTK in Python

Primary LanguagePython

This is a Sentiment Analysis project using the library NLTK in Python.

Contents

  1. NLTK
  2. Pipeline
    1. Data Cleaning
    2. Perform Analysis
  3. Libraries
    1. Matplotlib
    2. Scikit-learn
    3. NLTK
    4. NLTK Data
    5. NLTK Stopwords Corpus
    6. NLTK WordNet
    7. NLTK WordNet

NLP

Natural Language Processing, or NLP for shot, is broadly defined as the automatic manipulation of natural language, like speech and text by software.

Pipeline

A pipeline is just a way to design a program where the output of one step is the input of the next step.

  1. Text Document
  2. Data Cleaning
  3. [Perform Analysis] (#Perform-Analysis)

Data Cleaning

Convert the raw text into a list of words that are clean text (this is a very important step).

  1. Data Cleaning (pre-processing)

    1. Convert to Lower Case
    2. Remove Punctuation and Special Characters
    3. Tokenization
    4. Remove empty line
    5. Stopwords Removal
    6. Lemmatization

Some definitions:

  • Tokenization - Convert a sentence into a single words.
  • Stopwords Removal - Remove words which are present in the sentence and make no difference to the analysis.
  • Stemming - Reduce the word to the base form. Ex.: Reading -> read.
  • Lemmatization - Process of grouping together the different inflected forms of a word then they can be analysed as a single item.
    • Lemmatization runs 2 times with different parameters. That happens in order to clean words that were not clean the first time.

Vectorization

Convert words into numbers.

Perform Analysis

Plot the analysis. The result should be like the picture below.

Result

Libraries

Matplotlib

Run the Python interpreter and type the command:

% pip install matplotlib

Source: https://matplotlib.org/stable/users/installing.html

scikit-learn

Run the Python interpreter and type the command:

% pip install scikit-learn

Source: https://scikit-learn.org/stable/index.html

NLTK

Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data.

Run the Python interpreter and type the command:

% pip install nltk

Source: https://www.nltk.org/install.html

NLTK Data

To install the data, first install NLTK, then follow the instructions below.

Run the Python interpreter and type the commands:

>>> import nltk
>>> nltk.download()

A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to:

  • C:\nltk_data (Windows)
  • /usr/local/share/nltk_data (Mac)
  • /usr/share/nltk_data (Unix)

Next, go to the tab All Pakages select the packages punky, and press the buttons Download. Leave like the picture below.

Data

Source: https://www.nltk.org/data.html

NLTK Stopwords Corpus

The steps to download the stopwords data is similar then NLTK. Follow the instructions below.

Go to the tab All Pakages select the packages stopwords, and press the buttons Download. Leave like the picture below.

Stopwords

NLTK WordNet

The steps to download the wordnet data is similar then NLTK. Follow the instructions below.

Go to the tab All Pakages select the packages wordnet, and press the buttons Download. Leave like the picture below.

WordNet