NLP Topic Modeling Project For Clusting BBC News Articles

General Information About The Project:

  • The project is written in Python in a Jupyter Notebook format
  • The motivation behind the dataset is to produce a machine learning model that can categorize news articles on their specifc Topic
  • The work is to break down the data into 5 categories (i.e., K = 5). Based on the each category's keywords we have interpreted their meaning
  • The dataset behind the project is available from kaggle.com Dataset Link
  • The purpose of the dataset is to cluster articles based on their specifc topic and compare LDA vs. NFM
  • The project initially starts by conducting EDA in preprocessing.ipynb file
  • The second part of the project conducts the unsupervised ML analysis in maincode.ipynb

EDA Conclusions Written Information:

  • There are 5 features however this analysis focuses on only two which are:
    • title, the real title of a news article
    • description, the text of the news article

Conclusions:

  • LDA and NFM were applied on BBC news articles of March 2023 to June 2023
  • A K of 5 was selected for both Topic Modeling algorithms
  • NFM seemed to have an issue with one of the topics by have various repeated words which could be an indicator for selecting a too large of a K value
  • Topics of both algorithms covered: European Soccer, International Soccer, Internal Politics of the U.K, and The Ukrainian War. All of which are topics that are very relevant during this work
  • A future improvement to the work is to conduct Coherence Analysis of various K values then selecting the value that produced the highest K