This git contains the dataset Perceptimatic presented in the paper Perceptimatic: A human speech perception benchmark for unsupervised subword modelling (Juliette Millet and Ewan Dunbar, submitted to Interspeech 2020), along with all the code to perform the analysis done in the latter.
- python 3.6/7
- numpy
- scipy
- pandas
- statsmodels
We provide the cleaned stimuli on the from of onset and offset for the 2017 Zerospeech one second French and English stimuli. The onset, offset and labels of the cleaned French triphones are in DATA/french/all_aligned_clean_french.csv, the English are in DATA/english/all_aligned_clean_english.csv. The files have the following columns:
index | #file | onset | offset | #phone | prev-phone | next-phone | speaker |
---|
index is how, with the language, we refer to each triphone in the rest of the files, '#file in the original 2017 ZeroSpeech wavfile, onset and offset are beginning and end of triphone in second, '#phone is the centre phone, prev-phone and next-phone are the surrounding phones, speaker is the reference numbe rof the speaker.
We provide the list of triplets used for the humans and models experiment in the file DATA/all_triplets.csv:
filename | TGT | OTH | prev_phone | next_phone | TGT_item | OTH_item | X_item | TGT_first | speaker_tgt_oth | speaker_x |
---|
filename is the name of the file containing the triplet (used for the humans experiment), it can be seen as the id of the triplet. If we consider the triplet as an A, B and X stimuli, with A and X in the same category, in the file, TGT is the centre phone of A and X, OTH is the center phone of B, prev_phone an next_phone are the surrounding phones. TGT_item, OTH_item and X_item refer to the indexes of the stimuli used as A, B and X. TGT_first indicate if A comes first in the file or not. We can note that each set of three extracted triphones appears in four distinct items, corresponding to orders AB--A (that is, X is another instance of the three-phone sequence A), BA--B, AB--B, and BA--A.
We give all the code to perform the human experiments, and we provide the triplet used on demand (contact juliette.millet@cri-paris.org) For each triplet, the delay between first and second stimuli is 500 milliseconds, and between second and third 650 milliseconds, as pilot subjects reported having difficulty recalling the reference stimuli when the delays were exactly equal.
Human results are in DATA/humans_and_models.csv, this file also contains delta values for the models we evaluate in the paper. Each line corresponds to a couple (triplet, participant). This file has the following columns:
- individual : code of the individual (unique among one language group)
- language : language group of the participant (French speaking or English speaking)
- filename: triplet id (see section on the cleaned stimuli)
- TGT: same as in DATA/all_triplets.csv
- OTH: same as in DATA/all_triplets.csv
- prev_phone: same as in DATA/all_triplets.csv
- next_phone: same as in DATA/all_triplets.csv
- TGT_item: same as in DATA/all_triplets.csv
- OTH_item: same as in DATA/all_triplets.csv
- X_item: same as in DATA/all_triplets.csv
- TGT_first: same as in DATA/all_triplets.csv (True or False)
- speaker_tgt_oth: same as in DATA/all_triplets.csv
- speaker_x: same as in DATA/all_triplets.csv
- correct_answer: human answer, either -3, -2, -1, 1, 2 or 3. If it is negative then the participant has chosen the OTH item instead of the (correct) TGT item.
- binarized_answer: binarized version of correct_answer : -1 if correct_answer < 0, 1 otherwise
- nb_stimuli: number of triplets heard by the participants with this triplet included (between 1 and ~190)
- TGT_first_code: 1 if TGT_first is True, 0 otherwise
- language_code: 1 for French participants, 0 for English participants
In this section we describe all the steps to evaluate any model with our methods.
First of all you need to extract the model you want to evaluate's representations of the 2017 ZeroSpeech one second stimuli. The original wavfiles can be downloaded here: https://download.zerospeech.com/.
Our evaluation system requires that your system outputs a vector of feature values for each frame. For each utterance in the set (e.g. s2801a.wav), an ASCII features file with the same name (e.g. s2801a.fea) as the utterance should be generated with the following format (separator = ' '):
time1 | val1 | ... | valN |
time2 | val1 | ... | valN |
example:
0.0125 | 12.3 | 428.8 | -92.3 | 0.021 | 43.23 |
0.0225 | 19.0 | 392.9 | -43.1 | 10.29 | 40.02 |
Note
The time is in seconds. It corresponds to the center of the frame of each feature. In this example, there are frames every 10ms and the first frame spans a duration of 25ms starting at the beginning of the file, hence, the first frame is centered at .0125 seconds and the second 10ms later. It is not required that the frames be regularly spaced.
Once your features are in the right format, you need to put them in a global folder (called M here), put your English features in M/english/1s/, and your French features in M/french/1s. Then do:
python script_get_file_distance.py M/ DATA/all_triplets.csv $file_delta.csv$ $distance$ DATA/english/all_aligned_clean_english.csv DATA/french/all_aligned_clean_french.csv False False
If your features use the format h5f, replace the 'False False' at the end by 'True False'
$file_delta.csv
is the file created by the script: it contains all the delta values for each triplet in DATA/all_triplets.csv.
$distance$
can be 'euclidean', 'kl' or 'cosine': it is the distance you want to use for the DTW. This can adapted if your representations are not numerical.
The script also print the ABX error over the Native Perceptimatic dataset (for English and French).
In order to perform the rest of the analysis easily, you can add your model delta values to our existing file containing human results, and all the model we evaluated 's delta values. To do that you need to do:
python concatenate_results $file_delta.csv$ $name_model$ DATA/humans_and_models.csv $file_all.csv$
$name_model$
if the name of the new column you add to the original file containing human results. You obtain a file ($file_all.csv$
) containing all the data in humans_and_models.csv and the delta values you computed.
Once $file_all.csv
is created, you can compute the normal ABX accuracies over the Native Perceptimatic dataset:
python compute_results_unweighted.py $file_all.csv$ $name_model$
It will print the ABX accuracy over the Perceptimatic dataset (for French and English).
You can also compute the reweighted by human results ABX accuracy:
python compute_results_weighted_human.py $file_all.csv$ binarized_answer $name_model$
It will print the reweighted by human results ABX accuracy (for French and English).
In our paper, to compare human perceptual space and models' representational space, we study how well the delta values obtained by the models' features can predict individual human results.
To do that we fit a probit regression per model using different input (see paper for details), and instead of doing it on all human results, we resample them multiple times
(in order to obtain confidence intervals on the differences, see last section of this README). To do the same thing with your new model, and compare it to the models we used,
you need the
python probit_model_bootstrap.py $file_all.csv$ $file_log.csv$ $nb_it$
To obtain average log-likelihoods along with 95% intervals do:
python compute_log_interval.py $file_log.csv$ $final_average_log.csv
To obtain average log-likelihood differences along with 95% intervals, do:
python compute_log_interval.py $file_log.csv$ $final_average_diff_log.csv
The log-likelihood difference is loglik(name of the model first) - loglik(name of the model second).
delta values obtained for the different models can be found in the file DATA/humans_and_models.csv, one column per model with the codenames given in the paper. But if you want to extract the features yourself in order to recompute the delta values, you can follow these instructions:
The features used for the ZS2017 models are the ones submitted to the 2017 ZeroSpeech challenge. The latter can be found online, by clicking on Zenodo links that are listed here: https://zerospeech.com/2017/results.html
Here is a list of the models we evaluated, and their corresponding Zenodo number (we also precise if the downloaded features are already .fea, h5f or need to be modified)
Model | Zenodo number | type | distance used |
---|---|---|---|
S2 | 823695 | fea | cosine |
S1 | 815089 | fea | cosine |
H | 821246 | fea | kl |
P1 | 819892 | h5f | cosine |
P2 | 820409 | h5f | cosine |
A3 | 823546 | h5f | cosine |
Y1 | 814335 | need modif | cosine |
Y2 | 814579 | fea | cosine |
Y3 | 814566 | fea | cosine |
C1 | 822737 | fea | cosine |
C2 | 808915 | need modif | cosine |
The topline used is a supervised GMM-HMM model with a bigram language model, trained with a Kaldi recipe. We used exactly the same model than for the 2017 Zerospeech challenge. We cannot provide the models themselves (one trained on the 2017 Zerospeech French training set, the other on the English training set), but we provide the posteriorgrams extracted on demand (contact juliette.millet@cri-paris.org)
The MFCCs used in the paper were extracted with Kaldi toolkit, using the default paramters, adding the first and second derivatives for a total of 39 dimensions, and we applymean-variance normalization over a moving 300 milliseconds window. We provide the extracted MFCCs on demand on demand (contact juliette.millet@cri-paris.org)
We used the Shennong package (https://github.com/bootphon/shennong) to extract the multilingual botteneck features described in [2]
To extract these features you need the Shennong package installed (added to the list of requirements listed above). To extract features from wavfiles in a folder F to a folder G do
python extract_from_shennong_bottleneck.py F G BabelMulti
We use the kaldi toolkit to extract MFCCs and apply the same VTLN than in [1] (the vtln-mfccs can be provided on demand, contact juliette.millet@cri-paris.org), then we extract the posteriorgrams from the French and English models from [1] we follow the instructions of https://github.com/geomphon/CogSci-2019-Unsupervised-speech-and-human-perception
To study how well each model is able to predict human results, we fit a probit model using delta values and a set of other parameters on human results for each model. Instead of doing it on all human results, we do it on mutliple subsamples (N=13682, for each stimulus, we draw three observations --human binary answer-- without replacement). In the paper, we provide only the mean log-likelihood obtained by this resampling, but here we present the mean differences between models, and indicate if the difference is significant. The following table contains the mean of the differences. The values are in bold if the difference is significant (ie if the 95% interval is above zero).
MFCC | P2 | P1 | Y2 | Y1 | C2 | Y3 | C1 | A3 | H | S1 | Bot | DP | S2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
topline | 63.3 | 91.1 | 85.9 | 106.8 | 100.0 | 107.1 | 110.8 | 124.3 | 160.5 | 177.4 | 212.0 | 216.3 | 236.3 | 252.1 |
MFCC | - | 27.8 | 22.5 | 43.5 | 36.7 | 43.8 | 47.5 | 61.0 | 97.2 | 114.1 | 148.7 | 153.0 | 173.0 | 188.8 |
P2 | - | -5.2 | 15.7 | 8.9 | 16.0 | 19.7 | 33.1 | 69.4 | 86.3 | 120.9 | 125.2 | 145.1 | 160.9 | |
P1 | - | 20.9 | 14.1 | 21.2 | 24.9 | 38.3 | 74.6 | 91.5 | 126.1 | 130.4 | 150.4 | 166.2 | ||
Y2 | - | -6.7 | 0.3 | 4.0 | 17.5 | 53.7 | 70.6 | 105.2 | 109.6 | 129.5 | 145.2 | |||
Y1 | - | 7.1 | 10.8 | 24.2 | 60.4 | 77.4 | 112.0 | 116.3 | 136.2 | 152.0 | ||||
C2 | - | 3.7 | 17.1 | 53.4 | 70.3 | 104.9 | 109.2 | 129.1 | 144.9 | |||||
Y3 | - | 13.4 | 49.7 | 66.5 | 101.2 | 105.5 | 125.4 | 141.2 | ||||||
C1 | - | 36.2 | 53.2 | 87.7 | 92.1 | 112.0 | 127.8 | |||||||
A3 | - | 16.8 | 51.5 | 55.8 | 75.7 | 91.5 | ||||||||
H | - | 34.6 | 38.9 | 58.8 | 74.6 | |||||||||
S1 | - | 4.4 | 24.3 | 40.1 | ||||||||||
Bot | - | 19.9 | 35.7 | |||||||||||
DP | - | 15.8 |
[1] Millet, J., Jurov, N., & Dunbar, E. (2019, July). Comparing unsupervised speech learning directly to human performance in speech perception. [2] Fer, R., Matějka, P., Grézl, F., Plchot, O., Veselý, K., & Černocký, J. H. (2017). Multilingually trained bottleneck features in spoken language recognition. Computer Speech & Language, 46, 252-267.