Kpop-Analysis

Overview

Created a web application that returns the predicted number of hours one listens to K-pop on a daily basis using FLASK with MAE ~ 1.2 hours.
Engineered variables/features from the text of each column.
Explored the data to analyze the relationships among the features (both continuous and categorical features).
Built five different regression models - linear, lasso, ridge, random forest, and XGBoost.
Optimized the random forest and the XGBoost model using GridsearchCV to find the optimal parameters.

As someone who was born and raised in South Korea, I grew up listening to K-pop. Over the years, K-pop became a global phenomenon and it still blows my mind how popular it became.
So, I thought it would be cool to analyze K-pop using machine learning to explore interesting insights. Thank you Chanin (aka. Data Professor) for the idea!

I had to do a little bit of googling to find the dataset. After some searching, I came across this website with an excel file. It’s a survey conducted for a study on social media and K-pop which I found very interesting. I liked the questions they asked and I also liked that the survey was conducted recently.
The dataset contains 240 K-pop fans from all over the world and there were 22 survey questions. Dataset link: Rraman, Saanjanaa (2020): KPOP DATA.xlsx. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12093648.v2

Data cleaning is an important step as you want the cleanest data for EDA and model building. If you put garbage in, then you get garbage out from the model.
Datasets might have leading and trailing white spaces. So, I decided to remove those white spaces using this function. Then I removed the first column “Timestamp” as it’s not useful.
Since the column names are the questions and they are too long, I decided to give them code names to simply the columns.
There are three columns with null values. First, let’s check the columns with only one null value.
I found out that the null values in life_chg and money_src were “n/a”, so I simply replaced them with a string “none”.
For the “daily_MV_hr” column, I decided to replace the null values with the average. There are multiple ways of handling null values (deleting the row, assigning a unique category, or you can run a regression model to predict the missing values, etc), but I thought replacing them with the average value would be the best choice.
I took the mean of 1 and 4 which is 2.5 hrs and removed the word “hours”. I noticed that some of the categories were in ranges, so I took the average of those ranges for the sake of simplicity. I created a special function to take care of this.
I realized that this dataset is kind of messy. So I repeated similar steps to clean each column.
I saved the cleaned data frame to a CSV file for the next part of the tutorial.

Checked histogram, boxplots, and correlation matrix of continuous variables
We can see these relationships:
- Number of years they listened to k-pop is positively correlated with the number of hours they listen to k-pop, money they spend on merchandise, and age.
- The number of hours k-pop fans spend on watching k-pop youtube music video is positively correlated with the number of hours they listen to k-pop.
- The more time they spend on listening to k-pop, the more money they spend on purchasing k-pop merchandise.
- The more k-pop youtube videos they watch and the more k-pop they listen, the more groups they like.
- The younger they are, the more time they spend on listening to k-pop and watching k-pop videos.
- Age has nothing to do with how much money they spend on purchasing k-pop merchandise per year.
Checked bar plots for categorical variables.
Found relationships among continuous variables and categorical variables using pivot tables.

Built five different regression models - Linear, Lasso, Ridge, Random Forest, and XGBoost. Optimized the Random Forest and the XGBoost model using GridsearchCV to find the optimal parameters.
XGBoost is the best model (MAE ~ 1.2 hours)