Automated Music Segmentation Using Chromagram Self-Similarity Matrices

Overview

This project explores the application of Chromagram Self-Similarity Matrices (SSMs) for segmenting music into its structural components such as intro, verse, chorus, and outro. Building on the methodology presented by Mauch, M., Noland, K., & Dixon, S. (2009) in their paper "Using Musical Structure to Enhance Automatic Chord Transcription" from the Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009) [1], this project adapts and extends their techniques to focus on the automatic segmentation of musical structures.

Objectives

The primary goal is to investigate how SSMs, when combined with specific parameters and selection algorithms, can effectively identify and segment whole musical structures. Future work will focus on evaluating the segments generated by our methods through empirical testing with a dataset of ground truth labels. However, this repository primarily serves as a proof of concept and tools for visualization.

Methodology

Data Preparation

Audio files are preprocessed to extract harmonic components, and beat tracking is used to generate beat and meter grids. Chromagrams are extracted, transformed to be key-invariant using the Krumhansl-Schmuckler algorithm, and then synced to a beat-grid, creating a beat-synchronous key-invariant chromagram.

Chromagram Self-Similarity Matrix (SSM)

We transpose the chromagram and compute the Pearson correlation coefficients between every pair of chroma vectors to construct a beat-wise self-similarity matrix $(R = (r_{ij}))$ for the entire song. The size of the SSM is determined by the number of beat-synced time frames (i.e. beats) in the transposed chroma feature matrix. Each axis of the SSM corresponds to the beat index in the chromagram, with the cell values representing the similarity between these frames.

Two types of SSMs are produced and evaluated against each other:

Median Filtered SSM: A diagonal median filter of size=5 is applied to smooth and emphasize structural repetitions. This is in alignment with the methods of [1].
Enhanced SSM: We use librosa.segment.path_enhance() for path enhancement, inspired by the multi-angle path enhancement approach detailed by Müller and Kurth [2]. Unlike their method, which models tempo differences by re-sampling underlying features before generating the SSM, librosa.segment.path_enhance() directly enhances continuity along diagonal paths within the existing similarity matrix. This direct enhancement of the SSM facilitates the identification of repeating structures without altering the chromagram data.

Identifying Repetitions

For each pair of beat indices (i, j) within the SSM, where j is greater than i to prevent self-comparison, we explore potential repeating segments of varying lengths, ranging from min_length=32 to max_length=128.
For each potential repeating segment, we calculate its similarity score by extracting the diagonal segment D_{i,j,l} from the SSM and computing the percentile (quantile) value of its elements based on a predetermined percentile (p=0.1). If this value exceeds a similarity threshold (ts=0.6), we consider the segment a repetition.
Identified repetitions are organized into a list, with each entry containing the start index i, the corresponding match index j, the segment length l, and the similarity score (i.e. Pearson correlation coefficient).

Greedy Selection Algorithms

We utilize two greedy algorithms to select the most prominent repeating segments from the list of repetitions based on the criteria of segment length, similarity score, and non-overlap, with each algorithm adopting a distinct strategy to prioritize and select segments. We adopt parameters aligned with the goal of segmenting musical structures rather than identifying and transcribing chord patterns in the referenced paper [1].

Algorithm 1: Greedy Select Segments

This algorithm selects segments prioritizing the product of length and similarity score (l * score), adjusted by a length_multiplier to fine-tune the balance between segment length and its similarity score. The algorithm operates as follows:

Sorting: Sort segments in descending order based on the product of their length, similarity score, and the length multiplier.
Selection: Iteratively select segments, ensuring they do not overlap with previously selected segments and lie within the song's total frames. This method aims to capture the most significant and distinct repeating structures throughout the song.

Algorithm 2: Greedy Max Coverage

This algorithm differs in its prioritization of maximize the coverage of repeating segments across the song. The selection process prioritizes segments that cover more frames without overlapping with already selected segments.

Combining Segments for Structural Analysis

After identifying and selecting repeating segments within a song using the greedy selection algorithms, the segment boundaries are combined and smoothed out to create a cohesive structural segmentation of the entire song. Below are chromagrams for the various SSM types with segmentation lines produced by the selection algorithms.

Reference:

[1] Mauch, M., Noland, K., & Dixon, S. (2009). Using Musical Structure to Enhance Automatic Chord Transcription. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009).

[2] Müller, Meinard and Frank Kurth. "Enhancing similarity matrices for music audio analysis." 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 5. IEEE, 2006.

dennisvdang/Chroma-based-Music-Segmentation