FCN-f0

Code for running monophonic pitch (F0) estimation using the fully-convolutional neural network models described in the publication :

L. Ardaillon and A. Roebel, "Fully-Convolutional Network for Pitch Estimation of Speech Signals", Proc. Interspeech, 2019.

We kindly request academic publications making use of our FCN models to cite this paper, which can be dowloaded from the following url : https://hal.archives-ouvertes.fr/hal-02439798/document

Description

The code provided in this repository aims at performing monophonic pitch (F0) estimation using Fully-Convolutional Neural Networks. It is partly based on the code from the CREPE repository => https://github.com/marl/crepe

The provided code allows to run the pitch estimation on given sound files using the provided pretrained models, but no code is currently provided to train the model on new data. Three different fully-convolutional pre-trained models are provided. Those models have been trained exclusively on (synthetic) speech data and may thus not perform as well on other types of sounds such as music instruments. Note that the output F0 values are also limited to the target range [30-1000]Hz, which is suitable for vocal signals (including high-pitched soprano singing).

The models, algorithm, training, and evaluation procedures have been described in a publication entitled "Fully-Convolutional Network for Pitch Estimation of Speech Signals", presented at the Interspeech 2019 conference (https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2815.pdf).

Below are the results of our evaluations comparing our models to the SWIPE algorithm and CREPE model, in terms of Raw Pitch Accuracy (average value and standard deviation, on both a test database of synthetic speech "PAN-synth" and a database of real speech samples with manually-corrected ground truth "manual"). For this evaluation, the CREPE model has been evaluated both with the provided pretrained model from the CREPE repository ("CREPE" in the table) and with a model retrained from scratch on our synthetic database ("CREPE-speech"). FCN models have been evluated on 8kHz audio, while CREPE and SWIPE have been trained and evaluated on 16kHz audio.

	_FCN-1953	_FCN-993	_FCN-929	_CREPE	_CREPE-speech	_SWIPE
_{PAN-synth (25 cents)}	_{93.62 ± 3.34%}	_{94.31 ± 3.15%}	_{93.50 ± 3.43%}	_{77.62 ± 9.31%}	_{86.92 ± 8.28%}	_{84.56 ± 11.68%}
_{PAN-synth (50 cents)}	_{98.37 ± 1.62%}	_{98.53 ± 1.54%}	_{98.27 ± 1.73%}	_{91.23 ± 6.00%}	_{97.27 ± 2.09%}	_{93.10 ± 7.26%}
_{PAN-synth (200 cents)}	_{99.81 ± 0.64%}	_{99.79 ± 0.65%}	_{99.77 ± 0.73%}	_{95.65 ± 5.17%}	_{99.25 ± 1.07%}	_{97.51 ± 4.90%}
_{manual (50 cents)}	_{88.32 ± 6.33%}	_{88.57 ± 5.77%}	_{88.88 ± 5.73%}	_{87.03 ± 7.35%}	_{88.45 ± 5.70%}	_{85.93 ± 7.62%}
_{manual (200 cents)}	_{97.35 ± 3.02%}	_{97.31 ± 2.56%}	_{97.36 ± 2.51%}	_{92.57 ± 5.22%}	_{96.63 ± 2.91%}	_{95.03 ± 4.04%}

Our synthetic speech database has been created by resynthesizing the BREF [2] and TIMIT [3] databases using the PAN synthesis engine, described in [4, Section 3.5.2].

We also compared the different models and algorithms in terms of potential latency (with real-time implementation in mind), where the latency corresponds to the duration of half the (minimal) input size, and computation times on both a GPU and single-core CPU :

	_FCN-1953	_FCN-993	_FCN-929	_CREPE	_SWIPE
_{latency (s)}	_0.122	_0.062	_0.058	_0.032	_0.128
_{Computation time on GPU (s)}	_0.016	_0.010	_0.021	_0.092	_X
_{Computation time on CPU (s)}	_1.65	_0.89	_3.34	_14.79	_0.63

Example command-line usage (using provided pretrained models)

Default analysis : This will run the FCN-993 model and output the result as a csv file in the same folder than the input file (replacing the file extension by ".csv")

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav

Run the analysis on a whole folder of audio files :

python /path_to/FCN-f0/FCN-f0.py /path_to/audio_files

Specify an output directory or file name with "-o" option(if directory doesn't exist, it will be created):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -o /path_to/output.f0.csv python /path_to/FCN-f0/FCN-f0.py /path_to/audio_files -o /path_to/output_dir

Choose a specific model for running the analysis (default is FCN-993):

Use FCN-929 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 929 -o /path_to/output.f0-929.csv

Use FCN-993 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 993 -o /path_to/output.f0-993.csv

Use FCN-1953 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 1953 -o /path_to/output.f0-1953.csv

Use CREPE-speech model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m CREPE -o /path_to/output.f0-CREPE.csv

Apply viterbi smoothing of output :

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -vit

Output result to sdif format (requires installing the eaSDIF python library. Default format is csv):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -f sdif

Deactivate fully-convolutional mode (For comparison purpose, but not recommanded otherwise, as it makes the computation much slower):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -FC 0

References

[1] Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. "CREPE: A Convolutional Representation for Pitch Estimation", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.

[2] J. L. Gauvain, L. F. Lamel, and M. Eskenazi, “Design Considerations and Text Selection for BREF, a large French Read-Speech Corpus,” 1st International Conference on Spoken Language Processing, ICSLP, no. January 2013, pp. 1097–1100, 1990. http://www.limsi.fr/~lamel/kobe90.pdf

[3] V. Zue, S. Seneff, and J. Glass, “Speech Database Development At MIT : TIMIT And Beyond,” vol. 9, pp. 351–356, 1990.

[4] L. Ardaillon, “Synthesis and expressive transformation of singing voice,” Ph.D. dissertation, EDITE; UPMC-Paris 6 Sorbonne Universités, 2017.

ardaillon/FCN-f0