IMDb

IMDb toy project

1. Introduction
2. Functions
3. Requirements
4. Performance
5. Data
- 5.1 Dataset description

Introduction

A Project started in Luiss for Machine Learning course in MABDA2016 class, by Francesco Pastore & Mario Catuogno.

Project overview

The program uses a set of parameters, to cluster and classify movies and propose one that you haven't seen yet. It is possible to catalogue previous seen movies to help the program to better select the results.

The project have a long-term goal, we have just started the descriptive analysis of the huge dataset of IMDb, and we are still in the Data Cleansing stage.

Project features

The program uses clustering algorithm to create subset of movies based on genre/actor/year (in the future also sentiment adjectives)
The software make a classification of movies based on your preferences (assigned score)
You can manually add/delete movies that you have watched
You can use moods label to find the perfect movie
The more movie you add, the more the algorithm becomes precise in selecting the perfect movie for the user
Similar to Netflix "Chosen for you" but not limited by the Netflix catalogue
The software uses a cleansed database from IMDB uploaded on the web for future updates (weekly/monthly probably)

Purpose of the project

The purpose of the project is to develop a tool which helps the user in choosing a not-seen-yet movie which he can enjoy with a certain probability.

Project roadmap

Goal	Description	Status
Import	Import the following dataset: actors.list, actresses.list, directors.list, editors.list, genres.list, movies.list, plots.list, running-time.list	10%
Cleansing	Clean the dataset	0%
Create DB	Create a SQLite DB containing the data used by the algorithms	0%
DB Analysis	Analyze data and crete some descriptive statistics reports	0%
Function:Update	Create an "update()" function to update data from IMDb for future movies	0%
Function:Cluster	Create a clustering function to create k clusters	0%
Function:Classification	Create a classification function for movie clusters	0%
Function:SentimentAnalysis	Create a function to perform a Sentiment Analysis on movie plots	0%

Programming language used

We choose to use R as programming language for this project, in combination with the following additional packages:

Data Importing: readr, RSQLite
Data Manipulation: dplyr
Data Visualization: ggplot2

To convert the dataset from IMDb we will probably use Python with the following package:

Data Transformation: imdbpy

imdbpy should be able to extract the .list file and convert it into a .csv file.

Functions

List of functions

get_history
- intial_history
- suggestion_history
get_mood_social_context
get_date
get_trend
- imdb_updates
sentiment_complete_analysis
- sentiment_cluster

Functions description

The function get_history() allows to retrieve

Functional analysis

Non-functional analysis

Requirements

Functional requirements

UI requirements

Usability

Performance

Capacity

Availability

Latency

Monitoring

Maintenance

System interface

Data

The data is downloaded from the IMDb website, through the ftp link. There are 52 files for a total of 12.458.029.182 bytes (12.46GB of compressed data).

Dataset description

In the initial stage of the project the dataset used so far are the following ones:

Dataset	Size	Records
actors.list	1.14GB	20.152.897
actressess.list	687MB	12.097.863
directors.list	120MB	3.043.694
editors.list	87MB	2.086.225
genres.list	89MB	2.353.809
movies.list	188MB	3.955.400
plot.list	402MB	7.626.679
running-times.list	56MB	1.313.544

MarioCatuogno/IMDb