IMDb toy project
A Project started in Luiss for Machine Learning course in MABDA2016 class, by Francesco Pastore & Mario Catuogno.
The program uses a set of parameters, to cluster and classify movies and propose one that you haven't seen yet. It is possible to catalogue previous seen movies to help the program to better select the results.
The project have a long-term goal, we have just started the descriptive analysis of the huge dataset of IMDb, and we are still in the Data Cleansing stage.
- The program uses clustering algorithm to create subset of movies based on genre/actor/year (in the future also sentiment adjectives)
- The software make a classification of movies based on your preferences (assigned score)
- You can manually add/delete movies that you have watched
- You can use moods label to find the perfect movie
- The more movie you add, the more the algorithm becomes precise in selecting the perfect movie for the user
- Similar to Netflix "Chosen for you" but not limited by the Netflix catalogue
- The software uses a cleansed database from IMDB uploaded on the web for future updates (weekly/monthly probably)
The purpose of the project is to develop a tool which helps the user in choosing a not-seen-yet movie which he can enjoy with a certain probability.
Goal | Description | Status |
---|---|---|
Import | Import the following dataset: actors.list, actresses.list, directors.list, editors.list, genres.list, movies.list, plots.list, running-time.list | 10% |
Cleansing | Clean the dataset | 0% |
Create DB | Create a SQLite DB containing the data used by the algorithms | 0% |
DB Analysis | Analyze data and crete some descriptive statistics reports | 0% |
Function:Update | Create an "update()" function to update data from IMDb for future movies | 0% |
Function:Cluster | Create a clustering function to create k clusters | 0% |
Function:Classification | Create a classification function for movie clusters | 0% |
Function:SentimentAnalysis | Create a function to perform a Sentiment Analysis on movie plots | 0% |
We choose to use R as programming language for this project, in combination with the following additional packages:
- Data Importing:
readr
,RSQLite
- Data Manipulation:
dplyr
- Data Visualization:
ggplot2
To convert the dataset from IMDb we will probably use Python with the following package:
- Data Transformation:
imdbpy
imdbpy should be able to extract the .list file and convert it into a .csv file.
- get_history
- intial_history
- suggestion_history
- get_mood_social_context
- get_date
- get_trend
- imdb_updates
- sentiment_complete_analysis
- sentiment_cluster
The function get_history()
allows to retrieve
The data is downloaded from the IMDb website, through the ftp link. There are 52 files for a total of 12.458.029.182 bytes (12.46GB of compressed data).
In the initial stage of the project the dataset used so far are the following ones:
Dataset | Size | Records |
---|---|---|
actors.list | 1.14GB | 20.152.897 |
actressess.list | 687MB | 12.097.863 |
directors.list | 120MB | 3.043.694 |
editors.list | 87MB | 2.086.225 |
genres.list | 89MB | 2.353.809 |
movies.list | 188MB | 3.955.400 |
plot.list | 402MB | 7.626.679 |
running-times.list | 56MB | 1.313.544 |