Overview

The goal of this project is to get an idea of:

  • Your ability to work with and grok data
  • Your software engineering skill
  • Your system design skill

The data used for this project will be The Movies Dataset (pulled from https://www.kaggle.com/rounakbanik/the-movies-dataset). Please use the copy of the data set provided at https://s3-us-west-2.amazonaws.com/com.guild.us-west-2.public-data/project-data/the-movies-dataset.zip

Requirements

There are three goals to this project:

  • Design a data model that can be used to answer a series of questions.
  • Implement a program that transforms the input data into a form usable by the data model
  • Design a system that can leverage the data model and program to provide real time access to the data (This is a design task, do not implement)

The designed data model must expose the following information:

  • Production Company Details:

    • budget per year
    • revenue per year
    • profit per year
    • releases by genre per year
    • average popularity of produced movies per year
  • Movie Genre Details:

    • most popular genre by year
    • budget by genre by year
    • revenue by genre by year
    • profit by genre by year

Deliverables

Fork this repository and complete all relevant tasks in that forked repository

Data Model

Please provide a visual diagram of the chosen data model.

Implementation

The input for the program will be the raw data (CSV files) in The Movies Dataset. The output for the program will be one or more files that can be used to hydrate the data model.

Feel free to use any language you are comfortable with. (A JVM language or Python is preferable if possible)

Design

The goal of the design task is to design a system that exposes this data to end users via an HTTP API. When designing the system focus on the high level design of the system and how parts of it will interact. You don't need to go very deep on the API, don't worry about defining routes/types/etc.

Assumptions:

  • New data files are received monthly
  • The system should be scalable

The design should include:

  • Data transformation - using the program implemented above
  • Data storage - How will the data be stored?
  • Data serving - How will users access the data?

Be sure to discuss issues and trade-offs around scaling, monitoring, failure recovery, authentication, etc...