/nlp-tokenization-stemming

Preprocess a sample text by performing word and sentence tokenization, stopword removal, and stemming

Primary LanguageJupyter Notebook

100 Days of ML - Day 15

NLP, Tokenization, and Stemming

On Day 15 of the 100 Days of ML challenge, we explore the Natural Language Processing (NLP) concepts tokenization and stemming using the NLTK library in Python. This project aims to preprocess a sample text by performing word and sentence tokenization, stopword removal, and stemming.

Project Structure

The project is structured as follows:

  • Import the required libraries
  • Download necessary NLTK data
  • Define a sample text for preprocessing
  • Perform word and sentence tokenization
  • Remove stopwords from the tokenized words
  • Reduce words to their root form using stemming
  • Visualize the results with a bar chart of word frequencies
  • Implement a unit test to ensure the code works correctly

Requirements

To run this project, you will need:

Python 3.6 or higher
NLTK library
Matplotlib library

Results

After running the project, you should see:

Word and sentence tokenized version of the sample text
Filtered words with stopwords removed
Stemmed words reduced to their root form
A bar chart displaying the word frequencies
A message indicating that the unit test has been completed and the code is working correctly