This project entails using Natural Language Processing (NLP) techniques (ie: regular expressions, tokenization, stemming, vectorization for TF-IDF) and clustering algorithms (KMeans and Hierarchical clustering) in Python to model the "similarities" between films based on their plots provided by IMDb and Wikipedia. The dataset contains the titles of the top 100 movies on IMDb. Steps taken include the following:
Merging the Wikipedia and IMDb plots
Breaking out the sentences and words in the plots
Stemming the word tokens to their base form
Creating the term frequency inverse document frequency object
Computing the euclidean distance of common terminology across plots + leveraging the KMeans algorithm for clustering the distances
Using max / complete linkage to compute the similarity between film plots
Creating a dendrogram to visualize the films clustered together and their respective hierarchies.