NLP Topic Modeling Project For Clusting BBC News Articles

General Information About The Project:

The project is written in Python in a Jupyter Notebook format
The motivation behind the dataset is to produce a machine learning model that can categorize news articles on their specifc Topic
The work is to break down the data into 5 categories (i.e., K = 5). Based on the each category's keywords we have interpreted their meaning
The dataset behind the project is available from kaggle.com Dataset Link
The purpose of the dataset is to cluster articles based on their specifc topic and compare LDA vs. NFM
The project initially starts by conducting EDA in preprocessing.ipynb file
The second part of the project conducts the unsupervised ML analysis in maincode.ipynb

LDA and NFM were applied on BBC news articles of March 2023 to June 2023
A K of 5 was selected for both Topic Modeling algorithms
NFM seemed to have an issue with one of the topics by have various repeated words which could be an indicator for selecting a too large of a K value
Topics of both algorithms covered: European Soccer, International Soccer, Internal Politics of the U.K, and The Ukrainian War. All of which are topics that are very relevant during this work
A future improvement to the work is to conduct Coherence Analysis of various K values then selecting the value that produced the highest K