/rmdb

Imports IMDb datasets into a relational database for easy querying.

Primary LanguagePHPMIT LicenseMIT

rmdb

Imports IMDb datasets into a relational database for easy querying.

Don't host this somewhere publicly accessible, otherwise you run afoul of IMDb's licensing terms for their datasets.

Usage:

  • Create MySQL database, import config/schema.sql
  • Enter DB details in config/config.json
  • php rmdb/public/index.php [command]

Commands:

help			- Print this help
download		- Download zipped TSV files from datasets.imdbws.com. Needs 1.5GB+ of storage  
extract			- Extract zipped TSV files. Needs 7GB+ of storage  
import		        - Import everything, replace existing data. Runs all the below commands:  
import-names		- Import cast & crew  
import-titles		- Import movies, TV episodes, short films, ...  
import-titles-akas	- Import foreign names for titles  
import-episodes		- Import episode <> show/series relation  
import-ratings		- Import average ratings for titles  
import-principals	- Import directors, writers. Should be redundant if you import names

Schema

See config/schema.sql which pretty much maps to the IMDb dataset schema. Overview:

Table Description
titles All the movies, short films, episodes
genres All genres, like Action, Sci-Fi, Comedy
titles_genres Maps titles to genres
names All people with birth and death year
categories Main types of jobs like Actor, Actress, Composer
jobs Freeform job titles like Act Three written By, Live Show Editor
principals Maps names to titles, category and job, ordered by ordering
principals_characters Maps principals to a character name
professions Main professions like Producer, Stunts, casting Director
names_primaryprofessions Maps names to professions
names_knownfortitles Maps people to titles they are most known for
titleakas Foreign language / additional titles
titleakaattributes Type of additional titles like Fake Working Title, Berlin Film Festival Title
titleakatypes Smaller list of title types, similar to titleakaattributes, like Alternative, Working, IMDb Display
titleakas_titleakaattributes Maps titleakaattributes to titleakas
titleakas_titleakatypes Maps titleakas to titleakatypes
episodes Connects episode titles to their shows

Todo

Get more data from other databases like TMDB. Country of origin, language and box-office would be interesting. But this probably needs to be imported one-by-one via their API.

Examples

What's the genre distribution per year?

SELECT start_year,
       COUNT(*) AS count
FROM   titles
       LEFT JOIN titles_genres AS tg1
              ON tg1.title_id = titles.id
WHERE  tg1.genre_id = "drama" -- repeat for a few genres

       -- Exclude tv episodes etc.
       AND title_type = "movie"

       AND start_year < YEAR(CURRENT_DATE())

GROUP  BY start_year
ORDER  BY start_year ASC; 

Although after 2000 the amount of movies made, or listed in IMDb, rises rapidly, so the data might need some normalization. Or limit to before 2000 like here:

western-scifi-horror

The 10 most prolific directors of well-rated movies in the 60s

SELECT n.primary_name AS name,
       COUNT(*)       AS movies
FROM   principals
       LEFT JOIN names AS n
              ON n.id = name_id
       LEFT JOIN titles AS t
              ON t.id = title_id
WHERE  category_id = "director"
       AND t.title_type = "movie"
       AND t.start_year >= 1960
       AND t.start_year <= 1969
       AND t.runtime_minutes >= 90
       AND t.average_rating > 6
       AND t.num_votes > 10000
GROUP  BY name_id
ORDER  BY movies DESC
LIMIT  10; 
name movies
Jean-Luc Godard 7
Blake Edwards 6
Akira Kurosawa 5
Michelangelo Antonioni 5
Alfred Hitchcock 5
John Frankenheimer 5
Stanley Kramer 4
Robert Aldrich 4
Sergio Leone 4
Roman Polanski 4

The 50 highest rated horror comedies

SELECT CONCAT("[", primary_title, "](https://www.imdb.com/title/", id, ")") AS primary_title,
       start_year,
       average_rating,
       num_votes
FROM   titles
       LEFT JOIN titles_genres AS tg1
              ON tg1.title_id = titles.id
       LEFT JOIN titles_genres AS tg2
              ON tg2.title_id = titles.id
WHERE  tg1.genre_id = "horror"
       AND tg2.genre_id = "comedy"

       -- Exclude TV episodes etc.
       AND title_type = "movie"
       
       -- Exclude little voted on movies where average_rating is often too high. 
       -- Higher num_votes = more popular
       AND num_votes > 20000

ORDER  BY average_rating DESC,
          num_votes DESC
LIMIT  50; 
primary_title start_year average_rating num_votes
Shaun of the Dead 2004 7.9 547079
Evil Dead II 1987 7.7 161069
Zombieland 2009 7.6 562503
What We Do in the Shadows 2014 7.6 178609
One Cut of the Dead 2017 7.6 23294
Tucker and Dale vs Evil 2010 7.5 176660
An American Werewolf in London 1981 7.5 104122
Dead Alive 1992 7.5 96468
Stree 2018 7.5 33219
Army of Darkness 1992 7.4 174895
The Rocky Horror Picture Show 1975 7.4 148043
Bhool Bhulaiyaa 2007 7.4 26087
Gremlins 1984 7.3 219049
The Return of the Living Dead 1985 7.3 60200
House 1977 7.3 26640
The Lost Boys 1987 7.2 138911
Re-Animator 1985 7.2 63177
Bhoot Police 2021 7.2 26129
Tremors 1990 7.1 135475
The Frighteners 1996 7.1 87546
Little Shop of Horrors 1986 7.1 74525
The Fearless Vampire Killers 1967 7.1 31521
Dellamorte Dellamore 1994 7.1 21375
Bubba Ho-Tep 2002 6.9 48378
Warm Bodies 2013 6.8 229755
Ready or Not 2019 6.8 141645
Odd Thomas 2013 6.8 52720
Creepshow 1982 6.8 46916
Zombieland: Double Tap 2019 6.7 173135
Trick 'r Treat 2007 6.7 90509
Fresh 2022 6.7 38519
Housebound 2014 6.7 34328
Fido 2006 6.7 29187
Behind the Mask: The Rise of Leslie Vernon 2006 6.7 23800
Night of the Creeps 1986 6.7 23355
Happy Death Day 2017 6.6 136327
Death Becomes Her 1992 6.6 115999
May 2002 6.6 36838
Slither 2006 6.5 81269
Arachnophobia 1990 6.5 69144
The Witches of Eastwick 1987 6.5 69079
Bad Taste 1987 6.5 46735
Better Watch Out 2016 6.5 37711
Fright Night 2011 6.4 105215
Gremlins 2: The New Batch 1990 6.4 102085
Severance 2006 6.4 38791
The People Under the Stairs 1991 6.4 34750
Mayhem 2017 6.4 20928
The Babysitter 2017 6.3 89374
Dead Snow 2009 6.3 66839

Popular Sci-Fi movies of the 70s

Note: No Star Wars, it only has Action, Adventure and Fantasy genres.

SELECT CONCAT("[", primary_title, "](https://www.imdb.com/title/", id, ")") AS
       primary_title,
       start_year,
       average_rating,
       num_votes
FROM   titles
       LEFT JOIN titles_genres AS tg1
              ON tg1.title_id = titles.id
WHERE  tg1.genre_id = "sci-fi"

       -- Exclude TV episodes etc.
       AND title_type = "movie"

       -- Exclude little voted on movies where average_rating is often too high. 
       -- Higher num_votes = more popular
       AND num_votes > 10000

       AND start_year >= 1970
       AND start_year <= 1979

ORDER  BY average_rating DESC,
          num_votes DESC; 
primary_title start_year average_rating num_votes
Alien 1979 8.5 855270
A Clockwork Orange 1971 8.3 811076
Ivan Vasilyevich Changes His Profession 1973 8.2 16458
Stalker 1979 8.1 131654
Solaris 1972 8 89827
Fantastic Planet 1973 7.7 30480
Close Encounters of the Third Kind 1977 7.6 198472
Superman 1978 7.4 172371
Invasion of the Body Snatchers 1978 7.4 59190
The Andromeda Strain 1971 7.2 36803
Sleeper 1973 7.1 42854
Time After Time 1979 7.1 18387
Soylent Green 1973 7 64000
The Boys from Brazil 1978 7 27798
Westworld 1973 6.9 57182
The Stepford Wives 1975 6.9 17631
Mad Max 1979 6.8 204229
Logan's Run 1976 6.8 56057
The Brood 1979 6.8 29614
Slaughterhouse-Five 1972 6.8 13000
THX 1138 1971 6.7 51240
Phantasm 1979 6.6 36464
Silent Running 1972 6.6 28965
The Man Who Fell to Earth 1976 6.6 26180
Rollerball 1975 6.6 24752
Horror Express 1972 6.5 10758
Star Trek: The Motion Picture 1979 6.4 89032
The Omega Man 1971 6.4 31325
Shivers 1975 6.4 20837
A Boy and His Dog 1975 6.4 17756
Escape from the Planet of the Apes 1971 6.3 35051
Rabid 1977 6.3 18535
The Fury 1978 6.3 15211
Moonraker 1979 6.2 99575
Death Race 2000 1975 6.2 27669
Dark Star 1974 6.2 24315
Conquest of the Planet of the Apes 1972 6.1 32523
The Crazies 1973 6.1 13134
Beneath the Planet of the Apes 1970 6 46579
The Black Hole 1979 5.9 25455
Piranha 1978 5.9 21507
Zardoz 1974 5.8 22467
Futureworld 1976 5.7 10713
Battle for the Planet of the Apes 1973 5.4 30854

<clickbait>The 10 worst directors that somehow keep making movies</clickbait>

SELECT CONCAT("[", d_name, "](https://www.imdb.com/name/", name_id, ")") AS name,
       ROUND(avg, 2) AS average_rating,
       movies
FROM   (SELECT name_id,
               n.primary_name                       AS d_name,
               SUM(t.average_rating) / COUNT(*)     AS avg,
               COUNT(*)                             AS movies
        FROM   principals
               LEFT JOIN names AS n
                      ON n.id = name_id
               LEFT JOIN titles AS t
                      ON t.id = title_id
        WHERE  category_id = "director"
               AND t.title_type = "movie"
               AND t.num_votes > 1000
               AND n.death_year IS NULL
        GROUP  BY name_id
        ORDER  BY avg ASC) AS t
WHERE  movies > 10
ORDER  BY avg ASC
LIMIT  10; 
name average_rating movies
Uwe Boll 3.603 29
Fred Olen Ray 3.7 11
Albert Pyun 4.146 28
Bert I. Gordon 4.158 12
Timothy Woodward Jr. 4.191 11
Jim Wynorski 4.25 14
Nico Mastorakis 4.307 15
Charles Band 4.515 13
Steven C. Miller 4.636 11
Gregory Hatanaka 4.727 11