The Music Listening Histories Dataset (MLHD) is a large-scale collection of music listening events assembled from more than 27 billion time-stamped logs extracted from Last.fm.
This results in 583k users, 555k unique artists, 900k albums, and 7M tracks. Here each scrobble is represented in the following format: <timestamp, artist-MBID, release-MBID, recording-MBID>
Research Paper: https://simssa.ca/assets/files/gabriel-MLHD-ismir2017.pdf
Download the dataset from: https://ddmal.music.mcgill.ca/research/The_Music_Listening_Histories_Dataset_(MLHD)/
Unfortunately, the original MLHD has some significant fallbacks due to last.fm’s out-of-date matching algorithms with the MusicBrainz DB, resulting in frequent mismatches & errors in the MBID data, affecting the quality of the available dataset.
Overall, the goal of this project is to create an updated version of the MLHD in the same format as the original, but with incorrect data resolved and invalid data removed.
- Clone the repo, and setup a Python3 virtual environment using:
python3 -m venv env
- Activate the virtual environment using:
. env/bin/activate
- Install the required packages using:
pip install -r requirements.txt
- Download the original MLHD dataset from: https://ddmal.music.mcgill.ca/research/The_Music_Listening_Histories_Dataset_(MLHD)/
- Copy
config.py.sample
toconfig.py
and update parameters as required. - Run
python gen_tables.py
to generate the required tables for the dataset. - Ready to go!
clean_master.py
- Cleans the dataset.rec_track_checker.py
- Loops through the datasets and checks if any artist_mbid is present in the recording_mbid column and converts every file from CSV+GZIP to CSV+ZSTD.
gen_tables.py
- Generates the required tables for the dataset.config.py
- Configuration file for the project. (Can also be run as a script to setup the project incase the scripts don't cover it already.)lib/gen_test_paths.py
- Utility to generate a random set of paths for testing.- usage:
python lib/gen_test_paths.py <num_paths> <output_path>
- usage:
mapper_gen_names.py
- Utility for cleaning mlhd_recording_mbid, and fetches (rec_name, artist_credit) for given set of files.
mapper.py
- Takes a set of random file paths and generates a report of the mapping results. (Useful for testing the MBC mapper.)test_arrow_vsd_pandas.ipynb
- Tests the performance of the arrow library vs pandas for reading the dataset.test_csv_parser.ipynb
- Experimental notebook for testing custom vectorized CSV parsers in Python.test_file_type_io_testing.ipynb
- Tests <readtime, writetime, and filesize> for <CSV+GZIP, CSV+ZSTD, and Parquet+ZSTD, Parquet+Snappy> files.test_file_type_io_testing_sql.ipynb
- Tests <readtime, writetime, and filesize> for SQL tables dumped as <Snappy+Parquet, ZSTD+Parquet, and ZSTD10+Parquet> files.test_mapper.py
- An experimental notebook to test different mappers.test_MLHD_conflation_mapping.ipynb
- A predecessor tomapper.py
. Uses the MBID-Mapping API to check if a recording_mbid corresponds to a given artist_mbid. Similar to mapper.py, but uses the API instead of the local database.test_MLHD_conflation.ipynb
- A predecessor tomapper.py
. Cleans recording_mbids, and fetches artist name and artist credit for cleaned recording_mbids.test_MLHD_old.ipynb
- Legacy notebook for testing the original MLHD dataset. Surveys a lot of the issues with the dataset.test_rec_track_checker.ipynb
- Experimental notebook for testing therec_track_checker.py
script.test_write_arrow.ipynb
- Compares the performance of writing CSV+ZSTD files using the arrow library and pd.to_csv() functions.
This project has been pretty fun to work on, and I've learned a lot along the way.
I've compiled all my learnings, and resources that I used throughout this project in the following sections. Hope you find them useful!
- Don’t rely on autosave.
- Getting around and moving every 45 min aids in debugging.
- Use
pandas.DataFrame.isin()
to make boolmaps. It uses Cython, and reaches C level of speeds. It’s faster thanloc
oriloc
. - [Use
IPython.display.Markdown
for making dynamic dashboards within Jupyter Notebook! - With enough work, you can speed up python code by ~596915%.](http://shvbsle.in/computers-are-fast-but-you-dont-know-it-p1/)
- Sometimes even replacing Pandas with custom python functions can speed up the process by ~9900% times.
- Numba works best when used mindfully for optimizing specific low level functions. Not as easy as just slapping a function decorator before every function.
- Use the with statement when making connections with SQL. (probably a best practice?)
- e.g.
with sqlite3.connect(config.DB_PATH) as conn:
- e.g.
- Use
isinstance()
instead oftype()
to check type of objects. (again, a best practice. Part of the liskov substitution principle) - Use
is
instead of==
for comparingNone
,True
,False
in Python - Lambda functions are overpowered
- Use Caching when running slow queries. It saves time and compute power on the API.
- Think before bruteforcing through an issue. Saves time in the long run.
- Use
os.walk()
for directory trees. - Use
os.path.join()
for path concatenation instead of string concatenation. - Use
time.monotonic()
instead of time.time() for measuring time. (more accurate, and doesn’t change with system time) - Unittests are powerful (but painful to write)
- Learnt more about Dask, VAEX, and other data processing libraries
- Apparently,
time.perf_counter()
is more accurate thantime.monotonic()
with only a small performance hit. It’s also used intime.timeit()
by default. - Sometimes a simple restart can solve seemingly impossible issues.
- Don’t make calculation mistakes when calculating calculation time.
- Keep a process running after terminating SSH session using
tmux
- Auto Docstring extension for vscode is OP
- Apache Arrow is 7x faster for reading and writing CSV files than Pandas!
- Use main guard for running scripts. (
if __name__ == "__main__":
) - Learnt about Modin as a drop in replacement of Pandas for faster performance.
- Learnt how to fetch a list of variables from an imported python module
- In Pandas/Numpy: Numeric types include:
int, float, datetime, bool, category
. They excludeobject
dtype and can be held in contiguous memory blocks. (i.e. faster performance) (reference) - Vectorized Loops are a LOT faster than simple for loops.
pandas.DataFrame.apply()
is unvectorized under the hood. pandas.DataFrame.at()
is wayy faster thanPandas.DataFrame.loc()
pd.DataFrame.loc()
returns the whole row, butpd.DataFrame.at()
only returns single value.- But running
pd.DataFrame.at()
2 times for fetching multiple values is still ~55x faster than runningpd.DataFrame.loc()
once for fetching the complete row.
pandas.DataFrame.to_sql()
is slow. Usepandas.DataFrame.to_csv()
andpsycopg2
to write to SQL.pandas.read_sql()
is ridiculously slow for some reason.pd.read_sql()
withpsycopg2
connector is 80% faster thanpd.read_sql()
withSQLalchemy
connector
- https://www.kdnuggets.com/2021/03/pandas-big-data-better-options.html
- https://towardsdatascience.com/ten-reasons-to-use-staticframe-instead-of-pandas-f368cc81e50a
- https://stackoverflow.com/questions/58595166/how-to-compress-parquet-file-with-zstandard-using-pandas
- https://www.webucator.com/article/python-clocks-explained/
- https://askubuntu.com/questions/8653/how-to-keep-processes-running-after-ending-ssh-session/220880#220880
- https://colab.research.google.com/drive/1UjD5Nsm_2fX2zp5QCu0DOtEGAe1cJy_I#scrollTo=FVmoLlUGd0w3
- https://arrow.apache.org/use_cases/
- Talk - Deepak K Gupta: Speed Up Data Access with PyArrow Apache Arrow Data is the new API - YouTube
- https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475
- G-Research Distinguished Speaker Series: Apache Arrow - High Performance Columnar Data Framework - YouTube
- https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/
- https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66
- https://kanoki.org/2022/02/11/how-to-return-multiple-columns-using-pandas-apply/
- https://stackoverflow.com/questions/19578308/what-is-the-benefit-of-using-main-method-in-python
- https://modin.readthedocs.io/en/latest/
- https://datascience.stackexchange.com/questions/172/is-there-a-straightforward-way-to-run-pandas-dataframe-isin-in-parallel
- https://stackoverflow.com/questions/9759820/how-to-get-a-list-of-variables-in-specific-python-module
- http://shvbsle.in/computers-are-fast-but-you-dont-know-it-p1/
- https://pandas.pydata.org/docs/user_guide/enhancingperf.html
- https://stackoverflow.com/questions/14991710/is-concurrent-futures-a-medicine-of-the-gil
- https://superfastpython.com/threadpoolexecutor-vs-gil/
- https://pnavaro.github.io/big-data/14-FileFormats.html
- https://medium.com/@rbmsingh/to-hdf-or-not-is-the-question-e56b684092b7
- https://www.microsoft.com/en-us/research/publication/columnar-storage-formats/
- https://www.orchest.io/blog/the-great-python-dataframe-showdown-part-1-demystifying-apache-arrow