Data Warehouse Course Project

The final project for Data Warehouse in 2015

Projects Requirements

  • Data Resource
  • Data needs to be stored
    • movie ID
    • comment user's ID
    • comment user's Profilename
    • comment user's Helpfulness
    • comment score for each user
    • comment time
    • comment summary
    • comment text
    • movie actors
    • movie show time
    • movie genre
    • movie director
    • movie starring actors
    • movie version
  • Most frequent Research
    • Check for time
      • the number of ovies in XX year, xx month, xx season
      • how much new movies have been shown on Tuesday
    • check by movie name
      • how much versions a movie may possess
    • check by directors/ actors/ genres
    • combines research

The project implementation Process

Step 1.
  • Process Data on Amazon
    • write scrawl script with python and simple bash
      • get 230 thousand items on Amazon with three servers running multiple threads at one night.
    • clean data
Step 2.
  • design storing plan

    • the logical storing plan: ERD
    • the physical storing plan : database table design
  • Build Hive clusters

    • I bought three servers temporarily from https://www.digitalocean.com/ playing the role of namenode, edgenode and datanode. Their configuration informations are as follows:

      alt tag

      alt tag

      alt tag

    • Edgenode can keep watch on the performance of the cluster. alt tag

    • We need to compare the time consumed by Mysql and HDFS for one complex search.

Step 3.
  • Research by multiple conditions and add condition automatically

    alt tag

  • For example, search for the thrillers in season 1,2015

    alt tag

  • The result showing in table

    alt tag

  • You can click on any item to have a further search

    alt tag

  • displayed time consumed comparation between mysql and HDFS in histogram and pie charts.

    alt tag

Development Tools

Conclusions

  • skills about scrawling loads of data online
  • ETL skills
  • build Hive clusters
  • display research result, maybe with https://www.joomla.org/