Audiotopsy

Data pipeline for analyzing the Echonest Million Song Dataset

##Introduction

The Million Song Dataset is a collection of a million songs along with meta-data like beats per minute, duration, energy, danceability, hotttnesss factor, location information, etc. This dataset is ripe for research, but it is not feasible to fit it completely in one system. This data pipeline will allow you to ingest the entire dataset and perform analysis on it.

For additional information about the million song dataset - http://labrosa.ee.columbia.edu/millionsong/

###After setting up this pipeline, you will be able to -

Issue hive queries like-

select song_hotttnesss,artist_name,title from denny_msd_date 
where year ='2009' order by song_hotttnesss desc;

Make RESTful API calls like-

http://hostname/audiotopsy/rest/music/yearwisestats 						//Get year-wise statistics
http://hostname/audiotopsy/rest/music/countrystats 						//Get Country-wise statistics
http://hostname/audiotopsy/rest/music/yeartoptracksk?year=1970&key=19700.19TROAIBP128F4280AAF //Get top songs for a given year

A detailed description of the field list is available here http://labrosa.ee.columbia.edu/millionsong/pages/field-list

###The outline of the data pipeline -

The dataset is publicly available as an Amazon S3 bucket. The dataset is pushed to HDFS using a bash script. Hive tables are built on top of the master dataset to facilitate ad-hoc queries. To get real time access to queries, batch jobs have been scheduled to process these queries and this data is then pushed to HBase. To access this data, RESTful APIs have been built which internally interact with HBase via the HBase Java Client API.

##Instructions to set up the data pipeline

###Prerequisite -

Requires a Hadoop Cluster with Cloudera CDH 5.0.2 running
Ensure that maven is installed

Clone this repository
To fetch the Echonest Million Song Dataset and push the data to hdfs

cd bash
./get_data.sh

To create the hive tables around the master dataset to facilitate ad-hoc querying

cd ../hive
./hive_driver.sh

To run the batch jobs that will generate the views

./hive_load_tmp_data_driver.sh

To create the HBase tables and then load the views in HBase to enable real-time querying

cd ../hbase
./hbase_driver.sh #To create the HBase tables
cd ../HBaseImport
mvn package
cd ../bash
./load_hbase_data.sh #Driver script that run the jar files to get the data into HBase

##Instructions to set up the Web Application that provides RESTful API

###Prerequisite -

Requires Apache Tomcat and Maven to deploy this web application

Download the following javascript and css files. (Maintain folder structure)

//Javascript dependencies

  audiotopsy/src/main/webapp/js/
  ├── bootstrap.min.js
  ├── d3.v3.js
  ├── datamaps.all.min.js
  ├── jquery-2.1.0.js
  ├── nv.d3.js
  ├── topojson.v1.min.js
  ├── src
  │   ├── core.js
  │   ├── interactiveLayer.js
  │   ├── nv.d3.css
  │   ├── tooltip.js
  │   ├── utils.js
  │   └── models
  │       ├── axis.js
  │       ├── line.js
  │       ├── scatter.js
  │       ├── lineChart.js
  │       ├── legend.js
  │       └── backup
  │           ├── bulletChart.js
  │           └── bullet.js
  └── main.js

  //CSS dependencies

  audiotopsy/src/main/webapp/css/
  ├── bootstrap-combined.min.css
  ├── bootstrap.min.css
  ├── bootstrap-responsive.css
  ├── nv.d3.css
  └── styles.css

To create the war file for the web application

cd audiotopsy_webapp
mvn package
cd target #The war file will be created in this directory

SSH into the system where you wish to deploy the web application and copy the war file to the webapps folder in tomcat. You should be able to access the data via RESTful APIs now!

dennyac/Audiotopsy

Audiotopsy