This project leverages machine learning techniques to predict IMDb movie ratings using various features such as the number of voted users, movie duration, and critic reviews. By conducting Exploratory Data Analysis (EDA), correlation analysis, regression analysis, classification, and clustering, the project aims to uncover patterns that can effectively predict and categorize movie ratings. The ultimate goal is to provide a data-driven approach to evaluate movie quality, offering a more objective and reliable guide for movie enthusiasts.
You can view the complete project here.
The main objectives of this project were:
- Exploratory Data Analysis (EDA): To uncover patterns and insights from the dataset.
- Regression Analysis: To predict IMDb scores based on various features.
- Classification Analysis: To categorize movies into predefined IMDb score categories.
- Clustering Analysis: To group movies based on similar characteristics.
The dataset contains various features such as:
- Number of critic reviews
- Duration of the movie
- Director Facebook likes
- Actor Facebook likes
- Gross earnings
- Number of voted users
- Cast Facebook likes
- Number of user reviews
- Budget
- Title year
- IMDb score
- Aspect ratio
- Content rating
To understand the relationships between different features and the IMDb score, I calculated the correlation matrix. Key findings from the correlation analysis include:
- Number of Voted Users: Strong positive correlation (0.48)
- Duration: Moderate positive correlation (0.37)
- Number of Critic Reviews: Moderate positive correlation (0.35)
- Movie Facebook Likes: Moderate positive correlation (0.35)
- Linear Regression
- Lasso Regression
- Random Forest Regression
- Mean Squared Error (MSE)
- R-squared
- Linear Regression:
- Mean Squared Error (MSE): 0.67
- R-squared: 0.346
- Random Forest:
- Mean Squared Error (MSE): 0.47
- R-squared: 0.549
- Lasso Regression:
- Mean Squared Error (MSE): 0.75
- R-squared: 0.265
The Random Forest Regression model achieved the best performance, indicating it explained a reasonable amount of variance in IMDb scores. Other models had lower R-squared values, suggesting limited explanatory power for predicting IMDb scores.
Movies were categorized based on IMDb scores into:
- Low (0-4)
- Medium (5-7)
- High (8-10)
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest Classifier
- Logistic Regression Accuracy: 82%
- SVM Accuracy: 85%
- Random Forest Classifier Accuracy: 91%
The Random Forest Classifier achieved the highest accuracy, demonstrating effective categorization of movies based on IMDb scores.
- K-means Clustering was used to group movies based on features like IMDb scores, critic reviews, and release year.
# WCSS: Within cluster sum of squares
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i,
init='k-means++',
n_init='auto',
random_state=0)
kmeans.fit(df_normalized)
wcss.append(kmeans.inertia_)
# Plotting the results
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The optimal value for K appears to be 5, This is the point where adding more clusters beyond this value will not provide a better fit.
Here is the code snippet for applying K-means clustering:
from sklearn.cluster import KMeans
# Selecting features for clustering
features = movie_data[['imdb_score', 'num_critic_for_reviews', 'title_year', 'num_voted_users']]
# Applying K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(features)
# Adding cluster labels to the data
movie_data['cluster'] = clusters
-
Cluster 0:
- High IMDb score
- High number of critic reviews
- Recent title years
- High number of voted users
-
Cluster 1:
- Moderate IMDb score
- Fewer critic reviews
- Title year around 2009
- Lower number of voted users
-
Cluster 2:
- High IMDb score
- Fewer critic reviews
- High Facebook likes
- Older title years (around 2000)
-
Cluster 3:
- Low IMDb score
- Low number of critic reviews
- Mid-2000s title years
- Low number of voted users
-
Cluster 4:
- Moderate IMDb score
- Low number of critic reviews
- Title years around 1996
The project successfully identified key factors influencing IMDb scores and demonstrated the effectiveness of machine learning models in predicting and categorizing movie ratings. Random Forest Regression emerged as the most reliable model for predicting IMDb scores, while the Random Forest classifier effectively categorized movies based on their success. Additionally, clustering analysis revealed distinct movie profiles, offering valuable insights for targeted marketing strategies. These findings underscore the potential of machine learning in enhancing the accuracy and reliability of movie rating predictions.
- Exploratory Data Analysis: Identified important features affecting IMDb scores.
- Correlation Analysis: Factors such as the number of voted users, movie duration, and the number of critic reviews significantly impact the IMDb score.
- Regression Analysis: Random Forest Regression provided the best predictions.
- Classification Analysis: The Random Forest classifier effectively categorizes movies based on their success.
- Clustering Analysis: Movies can be grouped into distinct clusters with unique profiles, providing insights into different success factors.
Thank you for exploring this project. Feel free to check out the detailed code and analysis in the notebook.
- Python: Pandas, NumPy, Scikit-learn, Seaborn, Matplotlib
- Machine Learning: Decision Tree, Random Forest, Linear Regression
- Data Visualization: Matplotlib, Seaborn