This code is a Python script for topic modeling using Non-Negative Matrix Factorization (NMF). Let's break down the analysis step by step:
-
Importing Libraries: The code starts by importing necessary libraries including
pandas
,numpy
,NMF
fromsklearn.decomposition
, andCountVectorizer
fromsklearn.feature_extraction.text
. These libraries are essential for data manipulation, numerical computations, and performing NMF. -
Reading Data: It reads training and test data from CSV files located at "../dataset/topic/train.csv" and "../dataset/topic/test.csv" respectively using
pd.read_csv
. -
Data Exploration: It explores the training data by checking its shape, displaying the first few rows, and identifying duplicated rows based on the "ABSTRACT" column.
-
Data Preparation: It prepares the training data for machine learning by dropping the "ID" column and creating variables (
cv
andnmf_model
) for Count Vectorization and NMF respectively. CountVectorizer is configured with parametersmax_df
,min_df
, andstop_words
to preprocess the text data. -
X-Y Transformation: It creates feature matrices
X_train
andy_test
using the "ABSTRACT" column from the training and test datasets respectively. Thefit_transform
method of CountVectorizer is used on training data whiletransform
is used on test data. -
NMF Model Fitting: The NMF model is fitted to the training data using the
fit
method. -
Identifying Important Words for Topics: It prints the most important words for each topic by accessing the components of the fitted NMF model. These words are determined by their weights in the NMF components.
-
Predicting Topics for Test Data: It predicts the topics for the test data by transforming the test data using the fitted NMF model and then identifying the index of the maximum value along each row.
-
Updating Test Data with Predicted Topics: It updates the test dataframe
df_test
by adding a new column "Topics" containing the predicted topics.
Overall, this script performs topic modeling on text data using NMF and then predicts topics for unseen data. It's a concise and structured approach for topic modeling in Python.