DSGA-1001-Project2

Data cleaning and preprocessing

Following the logic given in the project document, I imputed the nan values for the movie ratings with the average of the rating given by that user for all the other movies and the ratings given by other users for that movie. This resolved most of the concerns with missing data. However I did encounter some more nan values in the Gender column and used a new category ‘-1’ to replace them. By doing this we ensure we’re not introducing any bias with our existing data categories.

  1. For each of the 400 movies, use a simple linear regression model to predict the ratings. Use the ratings of the other 399 movies in the dataset to predict the ratings of each movie (that means you’ll have to build 399 models for each of the 400 movies). For each of the 400 movies, find the movie that predicts ratings the best. Then report the average COD of those 400 simple linear regression models. Please include a histogram of these 400 COD values and a table with the 10 movies that are most easily predicted from the ratings of a single other movie and the 10 movies that are hardest to predict from the ratings of a single other movie (and their associated COD values, as well as which movie ratings are the best predictor, so this table should have 3 columns). Explanation: In this analysis, I performed 399 simple linear regressions for each of the 400 films in the dataset, treating one movie's ratings as the dependent variable and ratings of each of the other 399 movies as the independent variable. The primary objective was to identify the movie x that demonstrated the best performance in terms of the Coefficient of Determination (COD). Assumptions underlying this approach involve assuming a linear correlation between the target movie's ratings and the predictor movie, an absence of major confounders leading to spurious correlations, and the appropriateness of COD as the metric for model performance. Despite the average COD being in the range of 30-40%, the analysis revealed that, on average, the linear models yielded 0.34 average COD. Notably, the presence of both positive and negative outliers, as illustrated in Figure 1 and fig 2, raises concerns about the assumptions of the simple linear model, particularly regarding the potential influence of confounders.

  2. For the 10 movies that are best and least well predicted from the ratings of a single other movie (so 20 in total), build multiple regression models that include gender identity (column 475), sibship status (column 476) and social viewing preferences (column 477) as additional predictors (in addition to the best predicting movie from question 1). Comment on how R^2 has changed relative to the answers in question 1. Please include a figure with a scatterplot where the old COD (for the simple linear regression models from the previous question) is on the x-axis and the new R^2 (for the new multiple regression models) is on the y-axis. Explanation: Adding additional features (gender identity, sibship status, and social viewing preferences) to the model to make it a multiple linear regression model. From question 1, we got 20 movies including 10 movies with highest and 10 movies with the least CODs. We also assumed that there are no additional confounding variables. Even after adding these factors, the results showed only a modest increase in the COD and the max change was around 5%. This suggests that the additional features did not substantially enhance the predictive power of the models. This concludes that the dataset likely extends beyond the linear dynamics, and we need to employ more sophisticated regression models.

  3. Pick 30 movies in the middle of the COD range, as identified by question 1 (that were not used in question 2). Now build a regularized regression model with the ratings from 10 other movies (picked randomly, or deliberately by you) as an input. Please use ridge regression, and make sure to do suitable hyperparameter tuning. Also make sure to report the RMSE for each of these 30 movies in a table, after doing an 80/20 train/test split. Comment on the hyperparameters you use and betas you find by doing so. Explanation: We picked 30 [250:280] movies from the middle by sorting the movies dataframe by COD, and picked other 10 movies from [200:210]. We essentially applied ridge regression to a subset of movies “middle_30” using features from another subset “other_10_movies” after doing an 80/20 split. For each movie in middle_30, the script performs hyperparameter tuning using GridSearchCV to find the best regularization strength (alpha) that helps in avoiding overfitting. The selected features and target values are split into training and testing sets, and a Ridge Regression model is trained using the optimal alpha. We used np.linspace(0.0001, 120, 400) to find the best alpha in order to find lower RMSE. The performance metrics, including Root Mean Squared Error (RMSE) and Coefficient of Determination (COD), are calculated for each movie, and the results are stored in lists. With an average RMSE of 0.42 and average alpha ~70 which is fairly high. Interestingly, average cod came close to 0.32.

We obtain the following results The RMSEs are generally low, indicating that the models are performing well. The alphas vary widely, suggesting that the different models are affected by several factors such as difference in data characteristics, tuning strategies or complexity. The best alpha for each model is not always the same. For example, the best alpha for Gigli (2002) is 120.0, while the best alpha for The Passenger (1975) is 113.082713. The best COD for each model is also not always the same. For example, the best COD for Gigli (2002) is 0.405382, while the best COD for The Passenger (1975) is 0.473786. All of these movies have relatively low RMSEs, indicating that the models are performing well at predicting their ratings. However, there is a wide range of alphas and cods suggesting that the different models are achieving different levels of complexity.

  1. Repeat question 3) with LASSO regression. Again, make sure to comment on the hyperparameters you use and betas you find by doing so. Explanation: We use the same 30 movies and other 10 movies that we used in Q3 as well as the technique was also the same. The only difference was LASSO implementation and we used alpha_range = np.concatenate((np.linspace(.0001, 2, 200), np.linspace(2, 120, 200))).. The average RMSE was 0.42 and average alpha 0.0094. The Table:

We obtain the following results The Passenger (1975) has a low RMSE of 0.272593 and a high COD of 0.417560. This suggests that the movie is relatively easy to predict and that the model is relying heavily on this feature to make its predictions. Gigli (2002) has a very low RMSE of 0.263056 and a high COD of 0.332654. This suggests that the movie is extremely easy to predict and that the model is relying heavily on this feature to make its predictions. Shrek (2001) has a relatively high RMSE of 0.824971 and a low COD of 0.029164. This suggests that the movie is more difficult to predict and that the model is not relying heavily on this feature to make its predictions. Toy Story (1995) has a relatively high RMSE of 0.652650 and a very low COD of -0.142054. This suggests that the movie is more difficult to predict and that the model is actively discouraging the use of this feature in its predictions. The RMSEs for the movies in the image are relatively low, ranging from 0.26 to 0.69. This suggests that the LASSO regression model is able to predict the movie ratings with a high degree of accuracy. The alpha values for the movies range from 0.0001 to 0.020199. Higher alpha values in LASSO regression results in more regularization, shrinking the coefficients of less important features, preventing overfitting, and improving generalization performance.

  1. Compute the average movie enjoyment for each user (using only real, non-imputed data). Use these averages as the predictor variable X in a logistic regression model. Sort the movies order of increasing rating (also using only real, non-imputed data). Now pick the 4 movies in the middle of the score range as your target movie. For each of them, do a media split (now using the imputed data) of ratings to code movies above the median rating with the Y label 1 (= enjoyed) and movies below the median with the label 0 (= not enjoyed). For each of these movies, build a logistic regression model (using X to predict Y), show figures with the outcomes and report the betas as well as the AUC values. Comment on the quality of your models. Make sure to use cross-validation methods to avoid overfitting. Explanation: For this question, we computer user means from non-imputed data and then computed on non-imputed data to find four movies by sorting them first by their mean and then picking the middle four. Then we did a median split for each of the four movies, coded 1 if it's more than median and 0 otherwise.. Then, I used scikit-learn to apply logistic regression on a dataset called 'median_split,' where each column represents a different target variable. I split the data into training and testing sets for each target variable, created a logistic regression model, and trained it on the training data. Then, I calculated the cross-validated area under the ROC curve (AUC) on the training set and predicted probabilities on the test set. I stored coefficients, intercepts, AUC values, and cross-validated AUC values for each model in separate lists. Finally, I organized all this information into a pandas DataFrame named 'result_df,' making it easy to see and compare the result. The more emphasis was on finding how users rate movies on average.

The obtained results:

The logistic regression model has a good fit, with high COD values and intercepts. Beta_0 ranging from -5 to -7 and intercept ranging from 16-23. This suggests steep logistic curves, indicating that beyond a critical median rating, the likelihood of rating a movie above the median increases substantially. From the table we can see that validation and AUC are really close. Nonetheless, this method predicted quite well but we gotta look for additional factors that might influence users. I am adding ROC curves.

Extra Credit:

As logistic regression made more sense for the data we have, we wanted to do something similar to question 2. But the only difference was, we wanted to find a "Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond). We used Logistic regression for this. ROC Curve:

The result shows Look Who's Talking (1989) is the best one that is enjoyed alone with ROC = 0.58 and validation score = 0.5396109174297712.