This project is a notebook about the TMDB 5000 Movie Dataset and propose three clustering models using, respectively, K-Means, Mini Batch K-Means and DBSCAN.
This dataset includes a list of 5000 movies with information of name, genre and other movie information extracted from the imdb.
The dataset common:
homepage
id
original_title
overview
popularity
production_companies
production_countries
release_date
spoken_languages
status
tagline
vote_average
Lost columns:
actor1facebook_likes
actor2facebook_likes
actor3facebook_likes
aspect_ratio
casttotalfacebook_likes
color
content_rating
directorfacebooklikes
facenumberinposter
moviefacebooklikes
movieimdblink
numcriticfor_reviews
numuserfor_reviews
First we create new columns and refactor some others, to use PCA to create an 2-dimensional dataset, which is possible to plot and visualize the clustering separately.
In this jupyter notebook we execute some experiments and compare with silhouette_avg metric.
- Install all requirements and active the conda environment.
conda env create -f environment.yml
conda activate am-t1-credit-card-fraud-detection //name of my last project
jupyter notebook main.ipynb