/data_mining_project_22

Predicting popularity of movies using the IMDb movies dataset with multiple regression algorithms such as XGBoost, Gradient Boosting, Regularization Regressors, and Stacking Regressor; Performed extensive data cleaning, feature engineering, and used transformation techniques such as winsorization and log-transformation

Primary LanguageHTML

Data Mining Project 2022

Business objective

Our objective is to identify popular movies to invest in US-produced movies’ copyrights that will likely have a high ROI, as measured by popularity amongst movie-goers.

General approach

In this project, we tested multiple supervised predictive models and dived into a detailed examination of the top three models: XGBRegressor, GradientBoostingRegressor, and RandomForestRegressor. We expect to measure performance using adjusted R2(given the number of features)and RMSE.Based on our analysis, we believe ourXGBoostmodel with the predictors explains 69% of the variation in log-transformed target variable and as measured by adjustedR2.

The data directory has the small datasets used. The ipynb and html versions of the code are in 'notebooks'.