/IMDb

:movie_camera: IMDb toy project

Primary LanguageRMIT LicenseMIT

IMDb

IMDb toy project

Table of contents

Introduction

A Project started in Luiss for Machine Learning course in MABDA2016 class, by Francesco Pastore & Mario Catuogno.

Project overview

The program uses a set of parameters, to cluster and classify movies and propose one that you haven't seen yet. It is possible to catalogue previous seen movies to help the program to better select the results.

The project have a long-term goal, we have just started the descriptive analysis of the huge dataset of IMDb, and we are still in the Data Cleansing stage.

Project features

  • The program uses clustering algorithm to create subset of movies based on genre/actor/year (in the future also sentiment adjectives)
  • The software make a classification of movies based on your preferences (assigned score)
  • You can manually add/delete movies that you have watched
  • You can use moods label to find the perfect movie
  • The more movie you add, the more the algorithm becomes precise in selecting the perfect movie for the user
  • Similar to Netflix "Chosen for you" but not limited by the Netflix catalogue
  • The software uses a cleansed database from IMDB uploaded on the web for future updates (weekly/monthly probably)

Purpose of the project

The purpose of the project is to develop a tool which helps the user in choosing a not-seen-yet movie which he can enjoy with a certain probability.

Project roadmap

Goal Description Status
Import Import the following dataset: actors.list, actresses.list, directors.list, editors.list, genres.list, movies.list, plots.list, running-time.list 10%
Cleansing Clean the dataset 0%
Create DB Create a SQLite DB containing the data used by the algorithms 0%
DB Analysis Analyze data and crete some descriptive statistics reports 0%
Function:Update Create an "update()" function to update data from IMDb for future movies 0%
Function:Cluster Create a clustering function to create k clusters 0%
Function:Classification Create a classification function for movie clusters 0%
Function:SentimentAnalysis Create a function to perform a Sentiment Analysis on movie plots 0%

Programming language used

We choose to use R as programming language for this project, in combination with the following additional packages:

  • Data Importing: readr, RSQLite
  • Data Manipulation: dplyr
  • Data Visualization: ggplot2

To convert the dataset from IMDb we will probably use Python with the following package:

  • Data Transformation: imdbpy

imdbpy should be able to extract the .list file and convert it into a .csv file.

Functions

List of functions

  • get_history
    • intial_history
    • suggestion_history
  • get_mood_social_context
  • get_date
  • get_trend
    • imdb_updates
  • sentiment_complete_analysis
    • sentiment_cluster

Functions description

The function get_history() allows to retrieve

Functional analysis

Non-functional analysis

Requirements

Functional requirements

UI requirements

Usability

Performance

Capacity

Availability

Latency

Monitoring

Maintenance

System interface

Data

The data is downloaded from the IMDb website, through the ftp link. There are 52 files for a total of 12.458.029.182 bytes (12.46GB of compressed data).

Dataset description

In the initial stage of the project the dataset used so far are the following ones:

Dataset Size Records
actors.list 1.14GB 20.152.897
actressess.list 687MB 12.097.863
directors.list 120MB 3.043.694
editors.list 87MB 2.086.225
genres.list 89MB 2.353.809
movies.list 188MB 3.955.400
plot.list 402MB 7.626.679
running-times.list 56MB 1.313.544