- The project is written in Python in a Jupyter Notebook format
- The motivation behind the dataset is to produce a machine learning model that can categorize news articles on their specifc Topic
- The work is to break down the data into 5 categories (i.e., K = 5). Based on the each category's keywords we have interpreted their meaning
- The dataset behind the project is available from kaggle.com Dataset Link
- The purpose of the dataset is to cluster articles based on their specifc topic and compare LDA vs. NFM
- The project initially starts by conducting EDA in preprocessing.ipynb file
- The second part of the project conducts the unsupervised ML analysis in maincode.ipynb
- There are 5 features however this analysis focuses on only two which are:
- title, the real title of a news article
- description, the text of the news article
- LDA and NFM were applied on BBC news articles of March 2023 to June 2023
- A K of 5 was selected for both Topic Modeling algorithms
- NFM seemed to have an issue with one of the topics by have various repeated words which could be an indicator for selecting a too large of a K value
- Topics of both algorithms covered: European Soccer, International Soccer, Internal Politics of the U.K, and The Ukrainian War. All of which are topics that are very relevant during this work
- A future improvement to the work is to conduct Coherence Analysis of various K values then selecting the value that produced the highest K