/Spotify_ML

Primary LanguageJupyter Notebook

Money Makers: Understanding the Factors that Influence Stream Revenue

download

Background

The music industry is a completely new space with the integration of technology. Music previously generated money by selling records and radio airplay. The emergence of online streaming platforms has changed the game for good. It is no longer necessary to buy an entire record to enjoy some songs on an album. Consumers can pick and choose individual songs off any album now. But if that’s the case, how do songs generate money, and what songs seem to be cash cows?

Our project explores the different genres of music and what the revenue implications are for the top ones. We utilize streaming data from Spotify to analyze what the trends are and where the popularity is. We leverage a variety of Python libraries such as pandas, numpy, sklearn, seaborn, and matplotlib to analyze and visualize the data. We set out to understand numerical and categorical features of the data and separate the relevant information. This is necessary to effectively train, test, and model the data with machine learning.

File:

Spotify.csv

Objective

  • Data Acquisition: Utilize Spotify top 200 songs over the past two years data
  • Data Exploration: Leverage libraries like pandas and numpy to manipulate data
  • Data Modeling: Train and test split the data to scale, model, and predict it
  • Visualization/Performance Analysis: Use matplotlib and hvplot to visualize and better analyze the data

Hypothesis

We can take the top 200 songs from the last two years on Spotify to train a machine learning model that will produce a forecast of the most streamed genres. This information will be used to predict the amount of revenue certain songs will generate and how much certain artists will earn. This premise places importance on data analytics as one foundation for music production.

How to Run the Project

1. Install Packages

  • pandas for data manipulation;
  • numpy for numerical operations;
  • matplotlib , seaborn and hvplot for visualization.
  • sklearn.preprocessing for statistical modeling including classification, regression, clustering and dimensionality reduction
  • xgboost for energizing machine learning model performance and computational speed
  • Pytorch for machine learning

2. Data Exploration

In the filtering phase, we identified certain categorical features containing irrelevant data that did not align with our parameters. To address this issue, we filtered out data entries that did not specify a certain genre or those designated as global.

spotify = df.drop(columns = ['Unnamed: 0', 'uri', 'artist_names', 'artist_img', 'artist_individual', 
                             'album_cover', 'artist_id', 'track_name', 'source', 'pivot', 'release_date', 'collab'])
spotify.dropna(inplace = True)

During our data exploration phase, we conducted analyses on both numerical and categorical features. The heatmap for numerical features indicated minimal to no correlation between features and streams. For categorical features, we examined country, region, artist genre, and language. Subsequently, we narrowed our focus to the top 10 countries and top 10 genres due to the large size of our dataset.

spotify_filter = spotify_filter.loc[(spotify_filter['country'].isin(top10_country)) &
                                    (spotify_filter['artist_genre'].isin(top10_genre))]
Screenshot 2024-04-08 at 6 49 28 PM

3. Ready for Test & Train Data

First off we changed the categorical features to numbers by using dummies for our final data output. When Undersampling the minority class we used Clustering as a way to identify and separate groups onto a smaller dataset with two or more variable quantities.

cc = ClusterCentroids(random_state = 1)
X_under_resampled, y_under_resampled = cc.fit_resample(X_train_scaled, y_train)
rus = RandomOverSampler(random_state = 42)
X_over_resampled, y_over_resampled = rus.fit_resample(X_train_scaled, y_train)

4. Model Training

In this section we used three machine learning algorithms; Random Forest, XGBooster, and PyTorch for undersampling and oversampling. We decided on these alogrithms to train the resampled dataset from section 4 to help with any imbalanced classifications.

4.1 Random Forest

  • Undersample
rf_under = RandomForestClassifier(random_state = 2, max_features = 'sqrt')
clf_under = GridSearchCV(estimator = rf_under, param_grid = param_grid, cv = 5)
clf_under.fit(X_under_resampled, y_under_resampled)
  • Oversample
rf_over = RandomForestClassifier(random_state = 2, max_features = 'sqrt')
clf_over = GridSearchCV(estimator = rf_over, param_grid = param_grid, cv = 5)
clf_over.fit(X_over_resampled, y_over_resampled)

4.2 XGBooster

  • Undersample
 xgb_clf_under.fit(x_train_xgb, y_train_xgb, eval_set = [(x_valid, y_valid)], verbose = True)
  • Oversample
  xgb_clf_over.fit(x_train_xgb, y_train_xgb, eval_set = [(x_valid, y_valid)], verbose = True)

4.3 PyTorch

  • Undersample
X_tensor_under = torch.tensor(X_under_resampled, dtype = torch.float32)  
y_tensor_under = torch.tensor(y_under_resampled, dtype = torch.long)
dataset_under = TensorDataset(X_tensor_under, y_tensor_under)  
train_loader_under = DataLoader(dataset_under, batch_size = 64, shuffle = True)
  • Oversample
X_tensor_over = torch.tensor(X_over_resampled, dtype = torch.float32) 
y_tensor_over = torch.tensor(y_over_resampled, dtype = torch.long)  
dataset_over = TensorDataset(X_tensor_over, y_tensor_over) 
train_loader_over = DataLoader(dataset_over, batch_size = 64, shuffle = True)

Overall, after experimenting with various machine learning algorithms, we concluded that Random Forest and XGBoost performed the best for our model evaluation. However, PyTorch was found to be less suitable for handling our imbalanced classifications.

5. Model Evaluation

Once the model is set and trained, it is time to run the model and evaluate the results. We run both oversampled and undersampled data in the different model approaches (RandomForest, XGBooster, and PyTorch) and analyze the results. The accuracy of the RandomForest model was not favorable but not terrible either. The undersample and oversample accuracy were 0.54 and 0.55, respectively. XGBooster performed similarly with undersample and oversample accuracy of 0.53 and 0.56, respectively. It was PyTorch that really surprised us. Its undersample score was 0.11 while its oversample accuracy was 0.13. A reason for why RandomForest and XGBooster performed better is due to their ability to better handle categorical and numerical data.

pred_y_xgb_over = xgb_clf_over.predict(X_test_scaled)
Screenshot 2024-04-09 at 7 58 48 PM

6. Feature Importance

From the evaluation, we drew out the most important features of a song. The undersample importance is as follows:

image

The oversample importance is as follows:

image

In both cases, speechiness, acousticness, danceability, and loudness are the top four most important features.

7. Deeper Analysis About the Analysis

We saw what the correlation was in a previous section. In this section we ran correlation analyses on the first seven and last three classes, in order to find out why the predictions accuracy are so different between the first seven and last three classes.

The first seven correlation matrix showed:

image

The last three correlation matrix showed:

image

It is immediately visible that there is greater correlation among the last three classes than the first seven.

8. Revenue Forecast

At this point we have compiled enough data to forecast streams and revenue.

We use the test dataset to make a comparison between historical average streams and predited average streams per genre.

Screenshot 2024-04-08 at 9 18 23 PM

We know a mid-point for revenue per stream is $0.004, so we multiplied this number by the historical and predcted streams to find revenue. That comparison is as follows:

Screenshot 2024-04-08 at 9 21 13 PM

Conclusion

  • Random Forest and XGBoost had a similar performance in both under sampled and over sampled data.
  • We discovered some important features regardless of class distribution that were accurately predicted.

Credits

Thanks to this data source author: https://www.kaggle.com/datasets/yelexa/spotify200.