CSCI-GA.3033-008 NYU Courant Institute of Mathematical Sciences Computer Science Department, Graduate Division Spring 2013
- We use Apache Hadoop/MapReduce to analyze data sets to extract patterns within them.
- We use Apache HBase to store data that we will later query from our UI.
- We use Apache Mahout to cluster users using K-Means Algorithm (with SequenceFiles and NamedVectors representing User -> Weights of Tags of Songs Listened)
- We use HDFS for keeping the large amounts of data which is not possible with running Hadoop in local/standalone mode.
- We use SQLite to store data is not in huge quantities and to retrieve from the UI.
We try to answer the following from the Million Song Dataset:
- What's the most listened-to song? (100% Completed)
- Who's the most listened-to artist? (100% Completed)
- What's an artist's top songs? (100% Completed)
- Plot a graph of the artist's song energies (0 - 100) vs. number of songs. (100% Completed)
- What are an artist's similar artists? (100% Completed)