/lastfm-scrobble-analysis

an end-to-end data pipeline extracting music listening habits and producing an insightful dashboard

Primary LanguagePython

In this data engineering project, I develop an end-to-end solution to analyse my listening history by building a pipeline to extract my scrobbles from the last.fm API, process and model the data before finally visualising the results in a Tableau dashboard.

This is a batch pipeline scheduled to run each day by extracting the previous day's scrobbles and loading them into Google Cloud Storage, to be processed further on. The pipeline can also be configured to capture scrobbles between specified date points, such as to retrieve the last few years worth of data. This then feeds into a Tableau dashboard, which aims to provide some insight into my listening habits. Docs for the API can be found here: 🎵

What is last.fm?

A website which records your listening history, with integrations into Spotify, Deezer etc. I've had an account for a very long time, so it was great to use that historic data in a project. Please support them by making an account and syncing it with your streaming app of choice.

last.fm 'scrobbles'

A scrobble is a record of what you listened to, on an individual track level, so each scrobble equals one track played. In addition to the name of the track, each scrobble also contains information about the time you listened to it, name of the artist and album, genre and so on.


After the data has been extracted from the Last.fm API, I aim to model the data in accordance with star schema principles. This involves organising the data into fact and dimension tables to facilitate efficient analysis and querying. For the analysis part, I will create a Tableau dashboard and aim to answer the following questions:

  • What sort of genres do I listen to the most?
  • At what time of the day do I listen to the most music?
  • Has my listening preferences changed over the years?
  • How many tracks do I listen to on average per day?

Data set

The data is extracted from the last.fm API, which is essentially the same listening data held on my last.fm profile. I use the user.getrecenttracks and artist.gettoptags methods, retrieving a record of my scrobbling history including: timestamps, artist, album, track, associated MusicBrainz IDs and artist genre tags.


project


To follow-along and run the pipeline in your own environment, please see the below sections. You may need to adjust some of the instructions if you're not running it on a Linux machine.

Section 1 - Environment Set-up

Section 2 - Infrastructure deployment

Section 3 - last.fm API

Section 4 - Mage Orchestration

Section 5 - Modelling with dbt

Section 6 - Data Visualisation

Thoughts and improvements