Movie Recommendation by Genre

This is an unsupervised Clustering Model which clusters movies based on their Genres.

The data for this project was collected from Data World's IMDB dataset.

The data looks like this.

dataset

We first filtered the data and used only movies from and after 2018.

Then we convert the data into a corpus and modify it to make it fit for our analysis by converting all text to lower-case, removing white spaces, punctuations, stopwords and such. Then we stem the document and make a Document Term Matrix.

Now we select 20 random movies from the dataset and run hierarchical clustering using ward method. We plot a dendrogram to see how the movies are clustered.

dendrogram

We can see that similar movies are on the same branch.

Now we know that our data has 20 overall genres. So we first make 20 clusters by k-means clustering and plot it.

20_clusters

So we can see that some clusters have exceptionally high values. Also, clustering individual data as a set is not ideal. So we try to find the optimum number of clusters by plotting an Elbow Curve.

elbow curve

From the above elbow curve we can see that after 7 clusters the value is more or less the same. And there is very small change. So we come to the conclusion that 7 is the optimum number of clusters for our data.

Now we again perform K-means Clustering with 7 clusters and plot our results.

7_clusters

These are our final 7 clusters of genres which group similar movies together and thus help in recommending movies.