scorealign -- a program for audio-to-audio and audio-to-midi alignment

Last updated 10 May 2013 by RBD

Contributors include:

  • Ning Hu
  • Roger B. Dannenberg
  • Joshua Hailpern
  • Umpei Kurokawa
  • Greg Wakefield
  • Mark Bartsch

scorealign works by computing chromagrams of the two sources. Midi chromagrams are estimated directly from pitch data without synthesis. A similarity matrix is constructed and dynamic programming finds the lowest-cost path through the matrix.

The alignment can optionally skip the initial silence and final silence frames in both files. The "best" path matches from the beginning times (with or without silence) to the end of either sequence but not necessarily to the end of both. In other words, the match will match all of the first file to an initial segment of the second, or it will match all of the second to an initial segment of the first.

Output includes a map from one version to the other. If one file is MIDI, output also includes (1) an estimated transcript in ASCII format with time, pitch, MIDI channel, and duration of each notes in the audio file, (2) a time-aligned midi file, and (3) a text file with beat times.

scorealign uses libsndfile (http://www.mega-nerd.com/libsndfile/). You must install libsndfile to build scorealign.

For Macintosh OS X, use Xcode to open scorealign.xcodeproj For Linux, use "make -f Makefile.linux" For Windows, open scorealign-vc2010.sln (This is set up to use my locally built copies of libsndfile, libogg, libvorbis, and libFLAC. These are such a pain on Windows, that I actually used a different Visual C++ solution file for Nyquist that includes projects to build all these libraries. You can find Nyquist on SourceForge, or you can build the libraries some other way. Note that my projects are set up to use 8-bit ASCII rather than Unicode or other.)

Command line parameters:

scorealign [-<flags> [<period> <windowsize> <path> <smooth> 
           <trans> <midi> <beatmap> <image>]] 
                 <file1> [<file2>]

specifying only simply transcribes MIDI in to
transcription.txt. Otherwise, align and . Flags are all listed together, e.g. -hwrstm, followed by filenames and arguments corresponding to the flags in the order the flags are given. Do not try something like "-h 0.1 -w 0.25" Instead, use "-hw 0.1 0.25". The flags are:

  • -h 0.25 indicates a frame period of 0.25 seconds
  • -w 0.25 indicates a window size of 0.25 seconds.
  • -r indicates filename to write raw alignment path to (default path.data)
  • -s is filename to write smoothed alignment path(default is smooth.data)
  • -t is filename to write the time aligned transcription (default is transcription.txt)
  • -m is filename to write the time aligned midi file (default is midi.mid)
  • -b is filename to write the time aligned beat times (default is beatmap.txt)
  • -i is filename to write an image of the distance matrix (default is distance.pnm)
  • -o 2.0 indicates a smoothing window of 2.0s
  • -p 3.0 means pre-smooth with a 3s window
  • -x 6.0 indicates 6s line segment approximation

A bit more detail:

The -o flag (smoothing) controls a post-process on the path. Since the path is discrete, it will have small jumps ahead or pauses whenever it differs from the diagonal. A linear regression is performed at each frame using a set of points whose size is determined by the -o parameter, and the discrete time indicated by the path is replaced by a continuous time estimated from neighboring points. This smooths out local irregularities in the time map.

The -p flag (presmoothing) operates on the discrete path. It tries to fit a straight line segment (length is set by -p) to the path. If the path fits well to the first half of the path and the second half of the path, the entire path is replaced with a straight line approximation. To "fit well", half of the path points must fall very close to the straight line (currently, within 1.5 frames). For example, if the line segment spans 40 frames, then 10 path points must be close to the first 20 frames and 10 path points must be close to the last 20 frames. The step is repeated on overlapping windows through the whole piece. This presmoothing step is designed to detect places where dynamic programming "wanders off" from the true path and then realigns to the true path. The off-track points are replaced, so they do not adversely affect the smoothing step. This approach does not seem to be robust, but sometimes works well.

The -x flag is another approach to deal with dynamic programming errors. It divides the entire piece into segments whose lengths are about equal and about the length specified by the -x parameter. The line segments are fit to the path by linear regression, and their endpoints are joined by averaging their linear regression values. Next, a hill-climbing search is performed to minimize the total distance along the path. This is like dynamic programming except that each line spans many frames, so the resulting path is forced to be fairly straight. Linear interpolation is used to estimate chroma distance since the lines do not always pass through integer frame locations. This approach is probably good when the audio is known to have a steady tempo or be performed with tempo changes that match those in the midi file.

Some notes on the software architecture of scorealign:

scorealign was originally implemented as a fairly monolithic program in MatLab. It was ported to C++. To incorporate this code into Audacity, the code was restructured so that audio input is obtained from Audio_reader, an abstract class that calls on a subclass to implement read(). The subclass just copies floats into the provided buffer. It is responsible for sample format conversion, stereo-to-mono conversion, etc. The Audio_reader returns possibly overlapping buffers of floats. The Audio_file_reader subclass uses libsndfile to read in samples and convert them to float. It does its own conversion to mono.

When scorealign is used in Audacity, a different subclass of Audio_reader will call into Audacity using a Mixer object to retrieve samples from selected tracks.

For use from the command line, scorealign has a module main.cpp that parses command line arguments. A lot of parameters and options that were formerly globals are now stored in a Scorealign object that is passed around to many routines and methods. main.cpp creates a (global) Scorealign object and uses code in the module alignfiles.cpp to do the work. The purpose of alignfiles is to provide an API that does not depend upon a command line interface, but which assumes you are aligning files. Finally, alignfiles.cpp uses an Audio_file_reader to offer samples to the main score alignment algorithm.

To summarize:

  • scorealign.cpp and gen_chroma.cpp do most of the pure alignment work
  • audioreader.cpp abstracts the source of audio, whether it comes from a file or some other source
  • alignfiles.cpp opens files and invokes the modules above
  • main.cpp parses the command line and invokes alignfiles.