This project explores the application of Chromagram Self-Similarity Matrices (SSMs) for segmenting music into its structural components such as intro, verse, chorus, and outro. Building on the methodology presented by Mauch, M., Noland, K., & Dixon, S. (2009) in their paper "Using Musical Structure to Enhance Automatic Chord Transcription" from the Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009) [1], this project adapts and extends their techniques to focus on the automatic segmentation of musical structures.
The primary goal is to investigate how SSMs, when combined with specific parameters and selection algorithms, can effectively identify and segment whole musical structures. Future work will focus on evaluating the segments generated by our methods through empirical testing with a dataset of ground truth labels. However, this repository primarily serves as a proof of concept and tools for visualization.
Audio files are preprocessed to extract harmonic components, and beat tracking is used to generate beat and meter grids. Chromagrams are extracted, transformed to be key-invariant using the Krumhansl-Schmuckler algorithm, and then synced to a beat-grid, creating a beat-synchronous key-invariant chromagram.
We transpose the chromagram and compute the Pearson correlation coefficients between every pair of chroma vectors to construct a beat-wise self-similarity matrix
Two types of SSMs are produced and evaluated against each other:
- Median Filtered SSM: A diagonal median filter of
size=5
is applied to smooth and emphasize structural repetitions. This is in alignment with the methods of [1]. - Enhanced SSM: We use
librosa.segment.path_enhance()
for path enhancement, inspired by the multi-angle path enhancement approach detailed by Müller and Kurth [2]. Unlike their method, which models tempo differences by re-sampling underlying features before generating the SSM,librosa.segment.path_enhance()
directly enhances continuity along diagonal paths within the existing similarity matrix. This direct enhancement of the SSM facilitates the identification of repeating structures without altering the chromagram data.
- For each pair of beat indices
(i, j)
within the SSM, wherej
is greater thani
to prevent self-comparison, we explore potential repeating segments of varying lengths, ranging frommin_length=32
tomax_length=128
. - For each potential repeating segment, we calculate its similarity score by extracting the diagonal segment
D_{i,j,l}
from the SSM and computing the percentile (quantile) value of its elements based on a predetermined percentile (p=0.1
). If this value exceeds a similarity threshold (ts=0.6
), we consider the segment a repetition. - Identified repetitions are organized into a list, with each entry containing the start index
i
, the corresponding match indexj
, the segment lengthl
, and the similarity score (i.e. Pearson correlation coefficient).
We utilize two greedy algorithms to select the most prominent repeating segments from the list of repetitions based on the criteria of segment length, similarity score, and non-overlap, with each algorithm adopting a distinct strategy to prioritize and select segments. We adopt parameters aligned with the goal of segmenting musical structures rather than identifying and transcribing chord patterns in the referenced paper [1].
This algorithm selects segments prioritizing the product of length and similarity score (l * score)
, adjusted by a length_multiplier
to fine-tune the balance between segment length and its similarity score. The algorithm operates as follows:
- Sorting: Sort segments in descending order based on the product of their length, similarity score, and the length multiplier.
- Selection: Iteratively select segments, ensuring they do not overlap with previously selected segments and lie within the song's total frames. This method aims to capture the most significant and distinct repeating structures throughout the song.
This algorithm differs in its prioritization of maximize the coverage of repeating segments across the song. The selection process prioritizes segments that cover more frames without overlapping with already selected segments.
After identifying and selecting repeating segments within a song using the greedy selection algorithms, the segment boundaries are combined and smoothed out to create a cohesive structural segmentation of the entire song. Below are chromagrams for the various SSM types with segmentation lines produced by the selection algorithms.
Reference:
[1] Mauch, M., Noland, K., & Dixon, S. (2009). Using Musical Structure to Enhance Automatic Chord Transcription. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009).
[2] Müller, Meinard and Frank Kurth. "Enhancing similarity matrices for music audio analysis." 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 5. IEEE, 2006.