TMDB 5000 Movie Recommendation Engine

This project constructs a shiny website with a movie recommendation engine.

Please download the img.zip from https://www.dropbox.com/s/pnbcb5nhfk6qylf/img.zip?dl=0 Then copy this folder into app/www/.

Background

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

Data Source Transfer Summary

We (Kaggle) have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.

The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.

Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.

The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.

Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions from this kernel.

Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.

There's now a separate file containing the full credits for both the cast and crew.

All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.

Your existing kernels will continue to render normally until they are re-run.

If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:
homepage
id
original_title
overview
popularity
production_companies
production_countries
release_date
spoken_languages
status
tagline
vote_average
Lost columns:
actor_1_facebook_likes
actor_2_facebook_likes
actor_3_facebook_likes
aspect_ratio
cast_total_facebook_likes
color
content_rating
director_facebook_likes
facenumber_in_poster
movie_facebook_likes
movie_imdb_link
num_critic_for_reviews
num_user_for_reviews
Open Questions About the Data

There are some things we haven't had a chance to confirm about the new dataset. If you have any insights, please let us know in the forums!

Are the budgets and revenues all in US dollars? Do they consistently show the global revenues?

This dataset hasn't yet gone through a data quality analysis. Can you find any obvious corrections? For example, in the IMDb version it was necessary to treat values of zero in the budget field as missing. Similar findings would be very helpful to your fellow Kagglers! (It's probably a good idea to keep treating zeros as missing, with the caveat that missing budgets much more likely to have been from small budget films in the first place).

Inspiration

Can you categorize the films by type, such as animated or not? We don't have explicit labels for this, but it should be possible to build them from the crew's job titles.

How sharp is the divide between major film studios and the independents? Do those two groups fall naturally out of a clustering analysis or is something more complicated going on?

Acknowledgements

This dataset was generated from The Movie Database API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

yc3404/tmdb5000

TMDB 5000 Movie Recommendation Engine