On Day 15 of the 100 Days of ML challenge, we explore the Natural Language Processing (NLP) concepts tokenization and stemming using the NLTK library in Python. This project aims to preprocess a sample text by performing word and sentence tokenization, stopword removal, and stemming.
The project is structured as follows:
- Import the required libraries
- Download necessary NLTK data
- Define a sample text for preprocessing
- Perform word and sentence tokenization
- Remove stopwords from the tokenized words
- Reduce words to their root form using stemming
- Visualize the results with a bar chart of word frequencies
- Implement a unit test to ensure the code works correctly
To run this project, you will need:
Python 3.6 or higher
NLTK library
Matplotlib library
After running the project, you should see:
Word and sentence tokenized version of the sample text
Filtered words with stopwords removed
Stemmed words reduced to their root form
A bar chart displaying the word frequencies
A message indicating that the unit test has been completed and the code is working correctly