AI 530 MSD project

Make sure the following files/folders are in the same directory:

tutorials/
MSongsDB/
MillionSongSubset/
swagmaster.db
create_track_metadata_db_custom.py

Findings

metadata features are stronger indicators of hottt than acoustic
Combination of a couple of diverse features does better
- Combination of different energy calculations
- Combination of different metadata features
- Combination of different acoustic features
The raw acoustic features perform fine on the training set
- they actually perform better than the energy measures on the training set
- energy measures generalize better. theyre better on the test set

master plan

write script to build sample dataset
build another structure (pandas DataFrame?) to hold relevant fields for learning
try to predict song_hotttnesss using other features
- acoustic
  - key int,
  - tempo real,
  - loudness real,
  - time_signature int,
- metadata
  - duration real,
  - artist_familiarity real,
  - artist_hotttnesss real,
- What learning models should we try?
  - Logistic regression
  - SVM
  - kNN
Hopefully learning models using the above feature set perform poorly . We decide that some of the acoustic features should be combined into energy and danceability.
- Do some googling. Find out that ontologies represent these measures as derived values from other features:
  - energy: function of (loudness, segment stuff)
  - danceability: function of (tempo, time_signature)

building our dataset

Going to be very similar to the subset_track_metadata dataset. Just adding more fields

CREATE TABLE songs (
    track_id            text PRIMARY KEY,
    title               text,
    song_id             text,
    release             text,
    artist_id           text,
    artist_mbid         text,
    artist_name         text,
    duration            real,
    artist_familiarity  real,
    artist_hotttnesss   real,
    year                int,
    track_7digitalid    int,
    shs_perf            int,  # ???
    shs_work            int   # ???
    # new ones vvv
    song_hotttnesss     real, 
    danceability        real, 
    energy              real, 
    key                 int,
    tempo               real, 
    loudness            real, 
    time_signature      int
);

Energy

energy: The feature mix we use to compute energy includes loudness and segment durations.

Danceability

danceability: We use a mix of features to compute danceability, including beat strength, tempo stability, overall tempo, and more.

Notes

Tutorial notebooks

MSD link to tutorials

tutorial_1

Shows how to iterate over the files within the MillionSongSubset
The AdditionFiles has sql databases set up to ping into the /data folder's contents
Runs through an exercise to find out which artist has the most songs in the dataset (by artist_id)

tutorial_3_track-metadata

Shows how to interface with the dataset (in db form) using sqlite.
- There are .db files in AdditionalFiles. This one uses track_metadata (subset_track_metadata.db)
subset_track_metadata.db
- Contains one table named 'songs'
- Contains the following columns
  - track_id text PRIMARY KEY,
  - title text,
  - song_id text,
  - release text,
  - artist_id text,
  - artist_mbid text,
  - artist_name text,
  - duration real,
  - artist_familiarity real,
  - artist_hotttnesss real,
  - year int
Some useful queries:
- Get all songs without MB ID's : SELECT artist_id,artist_mbid FROM songs WHERE artist_mbid=''
- Get all distinct artists: SELECT DISTINCT artist_id, artist_name FROM songs
- Get all dudes with a float>value: SELECT DISTINCT artist_name, artist_familiarity FROM songs WHERE artist_familiarity>.8
  - Can use this one to filter out the tracks where hotttnesss is 0. (empty data) (WHERE NOT artist_hotttnesss=0)

Potentially useful links

github repo for the above
"Average mean hotttnesss performs just as well LOL our features dont tell us shit" "'Everything is fucked' njdup committed on Dec 12, 2014"

ronakdpatel/ai-msd