100 Days of ML - Day 15

NLP, Tokenization, and Stemming

On Day 15 of the 100 Days of ML challenge, we explore the Natural Language Processing (NLP) concepts tokenization and stemming using the NLTK library in Python. This project aims to preprocess a sample text by performing word and sentence tokenization, stopword removal, and stemming.

Project Structure

The project is structured as follows:

Import the required libraries
Download necessary NLTK data
Define a sample text for preprocessing
Perform word and sentence tokenization
Remove stopwords from the tokenized words
Reduce words to their root form using stemming
Visualize the results with a bar chart of word frequencies
Implement a unit test to ensure the code works correctly

Requirements

To run this project, you will need:

Python 3.6 or higher
NLTK library
Matplotlib library

Results

After running the project, you should see:

Word and sentence tokenized version of the sample text
Filtered words with stopwords removed
Stemmed words reduced to their root form
A bar chart displaying the word frequencies
A message indicating that the unit test has been completed and the code is working correctly

nadinejackson1/nlp-tokenization-stemming

100 Days of ML - Day 15

NLP, Tokenization, and Stemming

Project Structure

Requirements

Results