/NMF_Topic_modeling_sklearn

Python code to implement nmf with sklearn

Primary LanguageJupyter NotebookMIT LicenseMIT

This code is a Python script for topic modeling using Non-Negative Matrix Factorization (NMF). Let's break down the analysis step by step:

  1. Importing Libraries: The code starts by importing necessary libraries including pandas, numpy, NMF from sklearn.decomposition, and CountVectorizer from sklearn.feature_extraction.text. These libraries are essential for data manipulation, numerical computations, and performing NMF.

  2. Reading Data: It reads training and test data from CSV files located at "../dataset/topic/train.csv" and "../dataset/topic/test.csv" respectively using pd.read_csv.

  3. Data Exploration: It explores the training data by checking its shape, displaying the first few rows, and identifying duplicated rows based on the "ABSTRACT" column.

  4. Data Preparation: It prepares the training data for machine learning by dropping the "ID" column and creating variables (cv and nmf_model) for Count Vectorization and NMF respectively. CountVectorizer is configured with parameters max_df, min_df, and stop_words to preprocess the text data.

  5. X-Y Transformation: It creates feature matrices X_train and y_test using the "ABSTRACT" column from the training and test datasets respectively. The fit_transform method of CountVectorizer is used on training data while transform is used on test data.

  6. NMF Model Fitting: The NMF model is fitted to the training data using the fit method.

  7. Identifying Important Words for Topics: It prints the most important words for each topic by accessing the components of the fitted NMF model. These words are determined by their weights in the NMF components.

  8. Predicting Topics for Test Data: It predicts the topics for the test data by transforming the test data using the fitted NMF model and then identifying the index of the maximum value along each row.

  9. Updating Test Data with Predicted Topics: It updates the test dataframe df_test by adding a new column "Topics" containing the predicted topics.

Overall, this script performs topic modeling on text data using NMF and then predicts topics for unseen data. It's a concise and structured approach for topic modeling in Python.