Realtime and Big Data Analytics

CSCI-GA.3033-008 NYU Courant Institute of Mathematical Sciences Computer Science Department, Graduate Division Spring 2013

Technologies

We use Apache Hadoop/MapReduce to analyze data sets to extract patterns within them.
We use Apache HBase to store data that we will later query from our UI.
We use Apache Mahout to cluster users using K-Means Algorithm (with SequenceFiles and NamedVectors representing User -> Weights of Tags of Songs Listened)
We use HDFS for keeping the large amounts of data which is not possible with running Hadoop in local/standalone mode.
We use SQLite to store data is not in huge quantities and to retrieve from the UI.

We try to answer the following from the Million Song Dataset:

What's the most listened-to song? (100% Completed)
Who's the most listened-to artist? (100% Completed)
What's an artist's top songs? (100% Completed)
Plot a graph of the artist's song energies (0 - 100) vs. number of songs. (100% Completed)
What are an artist's similar artists? (100% Completed)