Audio Fingerprinting Benchmark Toolkit
Introduction
The purpose of this Audio Fingerprinting Benchmark Toolkit is evaluation of the accuracy of different audio matching solutions. The aim of audio matching is to identify which pieces of the reference audios are present in the query audios. This is usually achieved by first creating a unique audio fingerprints for both the reference and query audio files, and then using such fingerprints to compute audio similarity confidence.
Audio fingerprints are typically created by using audio spectral analysis methods and are unique to particular audio samples. Fingerprint matching attempts to measure the similarity between fingerprints created from the reference and the query audio samples.
The aim of the evaluation process implemented in this toolkit is to measure how good, or how accurate any given audio fingerprinting and matching algorithm is. Specific evaluation metrics are used to assess the accuracy, which have been described in the following sections of this document.
In this toolkit, the generated query audio files and the corresponding reference audio files are provided for evaluation. You may also generate your own query files using your reference audio files.
Audio Fingerprinting Benchmark Toolkit supports both the file-level evaluation for matchers that report only which files match, and the segment-level evaluation for matchers that report also the matching ranges of the audio files.
Basic Evaluation
Requirements
- Python 3.6+
- Python module portion
Data Files
The following datasets of various sizes were generated from FMA dataset using the audios with permissive licenses. The provided datasets contain reference and query audio files created using artificially applied audio modifications and distortions.
Size | Difficulty | File | Size | Query chunks | Query audios | Reference audios |
---|---|---|---|---|---|---|
small | easy | pexafb_easy_small.zip | 657 MiB | 100 | 22 | 99 |
small | medium | pexafb_medium_small.zip | 717 MiB | 100 | 27 | 100 |
small | hard | pexafb_hard_small.zip | 630 MiB | 100 | 22 | 99 |
medium | easy | pexafb_easy_medium.zip | 6.8 GiB | 1000 | 221 | 960 |
medium | medium | pexafb_medium_medium.zip | 7.0 GiB | 1000 | 220 | 950 |
medium | hard | pexafb_hard_medium.zip | 7.2 GiB | 1000 | 219 | 953 |
large | easy | pexafb_easy_large.zip | 69 GiB | 20000 | 4491 | 9004 |
large | medium | pexafb_medium_large.zip | 69 GiB | 20000 | 4360 | 8998 |
large | hard | pexafb_hard_large.zip | 69 GiB | 20000 | 4400 | 9044 |
The contents of each of these archives:
fma_tracks.csv
- FMA tracks with suitable licenses, the reference audios were randomly chosen from themqueries
- generated query audio filesreferences
- used reference audio filesannotations.csv
- annotation file describing the contents of the queries, it is used for evaluation of the found matches
The difficulty level influences which audio distortions may be used in the queries:
- easy: high-pass, low-pass, changes in pitch and/or tempo mostly <= 6%, rarely > 10%, chunks are just concatenated, noise with Signal-to-Noise-Ratio (SNR) >= +10 dB
- medium: all of the above, echo, changes in pitch and/or tempo mostly <= 18%, rarely > 30%, chunks merged using overlap or fade in/fade out, noise with SNR >= +5 dB
- hard: all of the above, reverb, changes in pitch and/or tempo mostly <= 37%, rarely > 62%, noise with SNR >= +0 dB
Usage
- install the requirements (
pip3 install -r requirements.txt
) - download and unzip one or more provided data files (see the previous subsection)
- calculate the fingerprints of reference audios and query audios using your audio fingerprinter
- calculate the matches between the reference audios and the query audios using your audio matcher
- convert the found matches to the CSV format described in the section Matches File
- run
evaluate_matches.py --annotation-file annotations.csv --matches-file matches.csv
- check the results
File Formats
The Audio Fingerprinting Benchmark Toolkit uses CSV files. The delimiter is a comma ,
and quote char is a double-quote "
.
Matches File
The audio matching software should provide a CSV file with the found matches with the following fields:
Field | Type | Mandatory | Description |
---|---|---|---|
reference_id |
String | yes | ID of the reference |
query_id |
String | yes | ID of the query |
reference_begin |
Integer | segment matches | Beginning of the segment in the reference, in seconds, inclusive |
reference_end |
Integer | segment matches | End of the segment in the reference, in seconds, exclusive |
query_begin |
Integer | segment matches | Beginning of the segment in the query, in seconds, inclusive |
query_end |
Integer | segment matches | End of the segment in the query, in seconds, exclusive |
Each line describes one matched segment (if segments are reported), or one matched file (if segments are not reported).
The segment ranges are in the whole seconds, the interval is [begin, end)
, i.e. the begin
belongs to the range but the end
does not.
The segment-level evaluation will be calculated only if the segment begin/end are present.
Annotation File
The annotation file describes what was put into the query files by the query generator script generate_queries.py
.
Therefore, it describes the expected matches.
It contains the following fields:
Field | Type | Field Group | Description |
---|---|---|---|
reference_id |
String | track | ID of the reference |
query_id |
String | track | ID of the query |
reference_begin |
Integer | segment | Beginning of the segment in the reference, in seconds, inclusive |
reference_end |
Integer | segment | End of the segment in the reference, in seconds, exclusive |
query_begin |
Integer | segment | Beginning of the segment in the query, in seconds, inclusive |
query_end |
Integer | segment | End of the segment in the query, in seconds, exclusive |
tempo |
Integer | modification | Tempo in percent of the original |
pitch |
Integer | modification | Pitch change in cents |
echo_delay |
Integer | modification | Delay of the reflected sound in milliseconds |
echo_decay |
Float | modification | Decay (loudness) of the reflected sound |
high_pass |
Integer | modification | Cutoff frequency for the high-pass filter |
low_pass |
Integer | modification | Cutoff frequency for the low-pass filter |
reverb |
Integer | modification | 1 if reverb was applied |
noise_type |
String | noise | Noise type ('continuous', 'pulsating', or 'sample') |
noise_file |
String | noise | File name with the noise sample (for noise_type = 'sample') |
noise_color |
String | noise | Noise color ('brown', 'pink' or 'white') |
noise_seed |
Integer | noise | Random seed for the FFmpeg noise generator |
noise_snr |
Integer | noise | Signal to Noise Ratio of the noise, in decibels |
merge_prev |
String | connection | Type of the merging with the previous chunk |
merge_prev_duration |
Float | connection | Duration of the overlap with the previous chunk, in seconds |
merge_next |
String | connection | Type of the merging with the next chunk |
merge_next_duration |
Float | connection | Duration of the overlap with the next chunk, in seconds |
Evaluation
The matches are evaluated by the script evaluate_matches.py
.
The found matches must be in the CSV format, see the chapter Matches File describing the format and fields.
Some audio matching solutions may report only which files matched, some report also the pairs of segments in the reference audio and in the query audio that matched. The evaluation script supports both modes. The segment-level evaluation is performed only if the detected matches contain the information about the segments.
The evaluator script calculates recall (R), precision (P), F-score (F) focusing on the precision, number of true positives (TP), false positives (FP), false negatives (FN) and unknown positives (UP) - when it cannot be determined if the positive is true or false positive. The explanation of these terms can be found on Wikipedia.
The evaluator script uses the following methods to evaluate the results:
-
file-level evaluation - checking if the expected file pairs matched
-
segment-level evaluation similar to evaluation of Bounding Boxes, see the article A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection, chapter 4.2.
-
segment-level evaluation based on segment lengths
Example 1: annotation and reported match on the same reference-query pair
- the upper line is a timeline for the reference audio R
- the lower line is a timeline for the query audio Q
- the annotated (expected) segments are blue
- the matched segments are red
- if the query has different tempo, the ranges are normalized before the following calculations
- True Positive seconds TP is the minimum of the overlapped ranges
TP = min(40 - 30, 45 - 33) = min(10, 12) = 10 - False Negative seconds FN is the maximum of the ranges in the annotation but not in the match
FN = max(30 - 15, 33 - 20) = max(15, 13) = 15 - False Positive seconds FP is the maximum of the ranges in the match but not in the annotation
FP = max(45 - 40, 51 - 45) = max(5, 6) = 6 - Recall = TP / (TP + FN) = 10 / (10 + 15) = 10 / 25 = 0.40
- Precision = TP / (TP + FP) = 10 / (10 + 6) = 10 / 16 = 0.625
Example 2: the reported match is with a different reference audio R2
- this is not an expected match, therefore the whole match segment is a False Positive
FP = max(45 - 30, 51 - 33) = max(15, 18) = 18 - nothing matched with the reference audio R1, so everything is a False Negative
FN = max(40 - 15, 45 - 20) = max(25, 25) = 25
Example 3: query matched also refrains in the reference audio
- the chunk used in the query may contain a refrain that occurs also somewhere else in the reference audio
- however, it is not sure that it really is a refrain and thus correct match
- therefore the corresponding segment is calculated as Unknown Positive instead of True Positive or False Positive
- Unknown Positive seconds UP = max(62 - 50, 45 - 33) = max(12, 12) = 12
- False Positive seconds FP = max(65 - 62, 51 - 45) = max(3, 6) = 6
Evaluator output
The evaluator script calculates the results for each reference-query pair. It also aggregates these individual results to larger groups:
- all queries using the same reference audio (denoted
REF
) - groups according to the modification, noise, or merging type (denoted
TAG
) - overall results (denoted
TOTAL
)
By default, the results are printed to standard output (see the example below).
The evaluator script also supports output to CSV file specified by the command-line option --output-csv-file
.
Example output (shortened)
R 93.10 P 100.00 F 99.26 TP 27 UP 16 FP 0 FN 2 query3627 053963 echo, merge_next:end, merge_prev:concat, noise:-10dB, noise:sample, pitch:exact, speed:exact, tempo:exact
R 95.45 P 95.45 F 95.45 TP 21 UP 0 FP 1 FN 1 query2485 053963 merge_next:overlap, merge_prev:concat, noise:none, tempo:small
R 96.67 P 100.00 F 99.66 TP 29 UP 1 FP 0 FN 1 query3538 053963 merge_next:concat, merge_prev:concat, noise:none, pitch:exact, speed:exact, tempo:exact
R 95.07 P 98.48 F 98.13 TP 77 UP 17 FP 1 FN 4 REF 053963
R 93.10 P 100.00 F 99.26 TP 27 UP 16 FP 0 FN 2 TAG echo
R 95.45 P 95.45 F 95.45 TP 21 UP 0 FP 1 FN 1 TAG tempo:small
R 95.07 P 98.48 F 98.13 TP 77 UP 17 FP 1 FN 4 TOTAL
Generating New Query Audio Files
This section describes the process of generating new query audio files. The provided data files were created using this process.
The Audio Fingerprinting Benchmark Toolkit allows for the generation of augmented (modified) queries. Several types of audio modifications can be applied to query audio samples to simulate real-world use-case scenarios and measure the robustness of audio matching technology to such modifications. Different audio modification severity and control settings (e.g. cut-off frequencies, or echo delay) are used for each processed audio file.
The applied audio modifications include:
- tempo modification,
- pitch change,
- echo,
- reverberation,
- high and low-pass filtering, and
- noise addition.
Requirements
- librubberband
- ladspa with tap plugins
- FFmpeg built with mp3/aac decoding/encoding and the following configure options:
--enable-indev=lavfi --enable-librubberband --enable-ladspa
List of Reference Audio Files
First, a CSV file with the list of reference audio files must be prepared.
When using FMA dataset, use the script generate_track_list_from_fma_dataset.py
.
- The option
--fma-path=PATH
specifies the location of the fMA dataset audio files. It is a path to thefma_full
orfma_large
directory. - The first argument is the path to the
fma_metadata/raw_tracks.csv
file from the FMA metadata. - The second argument is the output file name.
When using other files that FMA dataset, use the script generate_track_list_from_media_files.py
to generate the list of reference audio files.
- The option
--output=FILE
specifies the output file name. - The arguments are the file names of the media files that should be used for generating the queries.
The reference audio files should not match each other.
In order to ensure this, match all reference audios with each other
and then use the script filter_non_matching_reference_audios.py
to keep only the non-matching reference audios in the list.
Generating the Query Audios
The script generate_query_audios.py
generates the query audio files from the list of reference audio files:
- The maximum difficulty of the generated queries may be set by the
--difficulty
parameter. - The total number of chosen chunks is 15% of the number of reference audios, or it may be set by the
--num-queries
parameter. - Until all chunks have been generated:
- Randomly choose the number of chunks to put into the next query file.
- For each chunk:
- Randomly choose the reference audio.
- Randomly choose the duration of the chunk and the position in the reference audio.
- Randomly choose the modifications (none, pitch, tempo, tempo & pitch, echo, noise, reverb, high-pass, low-pass, ...) and their parameters.
- Get the chunk from the reference audio and apply the modifications.
- Randomly choose how to connect the chunks (concatenation, fade in/out, overlap) and their parameters.
- Connect the chunks into a query file.
- Generate
annotations.csv
file describing the expected matches, see section Annotation File.
Creating the Audio Fingerprinting Benchmark Toolkit Data Files
In order to create the Data Files, the used reference audio files must be added to the package(s).
This is done by the script symlink_reference_audios.py
.
- The option
--annotation-file
specifies the annotation file describing the queries. - The option
--reference-dir
specifies the path to the directory that will contain the symlinks to the reference audio files. - The argument(s) specify the list(s) of reference audio file(s).
Finally, the contents of the data files is zipped into one archive.
All required steps starting with Generating the Query Audios are performed by the script generate_audio_benchmark_datasets.sh
.
It was used to create the provided Data Files.