Bag-of-words

Description

This is a mini-project I made when I first started learning NLP (Natural Language Processing), and it's a statistical language model based on word count that represents a text as the bag of the words composing it. It's a very simple representation and it was a first introductory step that helped me get more familiar with NLP.

Contents

Collected data

This project contains .txt and .csv files that correspond to the textual data we'll be processing. We have three main data sources:

  • amazon_cells_labelled.txt and its respective .csv file.
  • imdb_labelled.txt and its respective .csv file.
  • yelp_labelled.txt and its respective .csv file.

These files contain labelled data, with the labels being either 0 or 1, 1 for a positive statement and 0 for a negative one.

Description of the content of files

  • The amazon_cells_labelled.txt file contains customer feedbacks on a cellphone sold on Amazon.
  • The imdb_labelled.txt, as it name suggests, are reviews on the imdb site for film/cinema productions.
  • The yelp_labelled.txt contain customer reviews and recommendations on places (restaurants, bars, etc.) they had visited on yelp website.

Processing code

In the main.py Python file lies all the code that I had written to parse, process and structure the available textual data into the BoW model. In this file, we started by importing the necessary librairies and then we moved to the processing of data as follows:

  1. Implementing a clean_text() function to clean our textual data. It takes in a dataframe from which we remove symbols, links, punctuation, stopwords, etc. The text is tokenized as a list for easier manipulation of its contents. We subsequently perform a stemming of these words to reduce them to their original bases.
  2. Performing an 80/20 train-test split on the processed data.
  3. Importing the GaussianNB (Naive-Bayes) model in which we fit the training data.
  4. Applying the model to make predictions and calculating performance metrics.