Zero-data (yet trainable) probabilistic fundamental frequency estimator.
Paper available on arXiv, to be presented in Interspeech 2018.
Proudly implemented in GNU Octave. Probably won't work in MATLAB.
License: GPL v3
Nebula is a low-gross-error pitch estimator designed for speech synthesis systems. The main idea is to define a very rough probabilistic distribution of speech signals, generate synthetic signals from the distribution by means of Monte-Carlo simulation, and train an estimator on the synthetic data.
The difficulty with this type of data-free approaches is the poor generalization from synthetic speech to real speech data. This is tackled in Nebula by factorizing the problem into a lot of time-frequency local pieces and training a frequency-dependent model for each band. The prediction from all frequency-dependent models are fused together by taking the average log posterior.
The said factorization is made possible by re-using Hideki Kawahara's SNR and instantaneous frequency feature extractors, which have very good time-frequency resolution. The models are simply Gaussian Mixture Models (GMM). These low-dimensional GMMs can be efficiently converted into conditional forms when all except one of the variables are known.
H. Kawahara, Y. Agiomyrgiannakis, and H. Zen, "Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis," in 9th ISCA Workshop on Speech Synthesis, Sunnyvale, 2016.
The Monte-Carlo simulation generates lots of points in a 6-dimensional space. After some dimensionality reduction and visualization by 3D scatter plot, the resulting image looks like some glowing celestial objects and hence the algorithm was given the name Nebula.
Like many of my other projects on Github, some C-implemented functions depend on ciglet
(a library of DSP snippets) and libgvps
(a library implementing many variants of Viterbi algorithm).
cd
intociglet
, runmake single-file
. This createsciglet.h
andciglet.c
underciglet/single-file/
. Copy and rename this directory tonebula/external/ciglet
.- Put
libgvps
undernebula/external/
and runmake
fromnebula/external/libgvps/
. - Finally launch Octave from
nebula/
, runstartup
in Octave.
To estimate F0 from a given audio signal,
[x fs] = audioread('./test.wav');
M = load_model('./model/'); % load from directory ./model
f0 = nebula_est(M, x, fs, 0.005); % estimate F0 at a 0.005s interval
A pretrained model is included in nebula/model
. It contains 36 GMMs for all frequency bands and a calibration file Lcal
.
You can also let Nebula output the F0 posterior map,
[f0 v pv lmap] = nebula_est(M, x, fs, 0.005);
imagesc(log(lmap));
The training part depends on SPTK.
make_random_dataset; % this is going to take a while
train_gmm; % the actual GMM training and calibration
K. Hua, "Nebula: F0 estimation and voicing detection by modeling the statistical properties of feature extractors," in Interspeech, Hyderabad, 2018 [to be presented].