datawrangling/trendingtopics

Create branch that uses Hive & MySQL partitioned tables instead of JSON

datawrangling opened this issue · 0 comments

For demo simplicity (and fast MySQL queries of sparklines & time series data), I just loaded the time series into Hive and MySQL as json strings for each pageid.

For better load performance try using Hive buckets http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL and Mysql partitions: http://dev.mysql.com/doc/refman/5.1/en/partitioning-management-range-list.html to store the data partitioned by date or by date-hour. Along with using the cloudera EBS based hadoop, this will allow for fast appends when new data arrives vs. reloading the entire daily timeline and rebuilding indexes.