EMD-MFCC-SVM Speech Repository

Repository with some data, code related to the speech experiments conducted in the paper "Machine Learning Mitigants for Speech Based Cyber Risk" that can be found at https://ieeexplore.ieee.org/document/9555610

Data sets

In the experiments we consider three datasets. In this repo, two of the three datasets are provided, since they have been designed ad hoc for the paper. Particularly, we considered these two to test our novel methodology within a text-dependent and a speaker-dependent verification system (TD-SD-SV) relevant to ASV challenges characterised by these conditions. The first dataset involves a set of sentences constructed to be challenging and reflect a real ASV setting in which sentences are not phonetically balanced. We obtained them from the first text (Inferno) that makes up Dante Alighieri “The Divine Comedy”. The second dataset is a reference set based on the IEEE Recommended Practices for Speech Quality Measurements, as described in [48], extensively used in speech analysis testing of speaker verification. It sets out seventy-two lists of ten phrases described as the 1965 Revised List of Phonetically Balanced Sentences, otherwise known as the ‘Harvard Sentences’. These are widely used in telecommunications, speech, and acoustics research, where standardised and repeatable speech sequences are needed.

In both datasets, two real-language sources were used from a female (speaker 1) and a male (speaker 2); for the synthetic speech, five correspondent sources (T1, T2, T3, T4, T5 described in table 10) were employed for the female case and one source (T1) for the male one. The synthetic speech voices of all TTS algorithms were selected to have an English accent. The voice recordings were sampled at 44.1kHz without significant channel or background noise to develop a text-dependent scenario relevant for speaker verification tasks [49]. Recording environments of both training and testing voice samples were identical to avoid mismatched conditions (see [14], and [49]). Common sentences were used for each speaker and the synthetic voice. The training and testing sets of data were then partitioned into training data and testing data.

Note that no recording laboratory or specialised microphone was used, and the utterances were recorded in noisy, reverberant environments. This is particularly relevant since it sets up the setting for adverse environments commonly encountered in ASV challenges. Therefore, the obtained results will carry the added feature of robustness to these kinds of speech settings.

The duration of each sentences speech recording was approximately 15sec to 1min maximum producing between 661k and 2,646k samples per spoken sentence. The start and end of each sample were trimmed to remove any nonspeech segments and decimated to a set of 60k total samples. Regarding the IMFs extraction procedure, each set of 60k samples for one sentence was then windowed into nonoverlapping collections of 5,000 samples and passed to the EMD sifting procedure. Afterwards, the features presented in Table 1 were extracted.We note that in some cases, we found that for high-frequency instantaneous frequency features, it would be advantageous also to apply a median filter (we used a window of 2ms).

In the first dataset, the total number of recorded sentences was 960, equally proportioned samples of the same sentences across all voice recordings, with 80% randomly selected for training and the rest for testing. In the second dataset, we use the first sentence from each of the seventy-two lists of the Harvard Sentences to construct the training dataset. The testing dataset was given by the second sentence from each of the seventy-two lists of the Harvard Sentences. This led to 1,152 utterances split equally between training and testing sets.

Repository Organisation

The folders within this repository are organised as follows:

Data: the speech signals are provided within two different folders, one for each dataset. Both in-samples and out-of-samples are provided. In the body of the paper, we described the results for Speaker 1 (the female voice) only, therefore we put in these folders the female voices. Male speech signals may also be available under request.
Code: * For each speaker, both synthetic and real voices, we produce a file creating the required model or time-series in R called, for example, "SPEAKER1_MODEL" for the real Speaker or "SYNTHETIC_MODEL", for the synthetic one. Afterwards, the EMD is applied and the IMFs are extracted. Note that when the code is run, results need to be saved by the user. A second file is then the one extracting the features required for the SVM, called "Speaker1_Extraxtion_feeature" or "Synthetic_Extraxtion_feeature" These files extract the instantaneous frequencies, the statistics and the spline coefficients. These features are then passed through the file "Preprocessing_1", which standardise the feature. For the EMD-MFCCs, another file is generated, called "SYNT_SPEAK1_CEPSTRUM_IMF_FEATURE". Note that depending on the experiment, the directory needs to be updated along with the m parameter representing the number of sentences (i.e. for Ex.1 m = 100, while for Ex.2 m = 72). This also applies in the case of the in-sample-analysis (as before) and out-of-sample analysis (i.e. for Ex.1 m = 20, while for Ex.2 m = 72). Once that the features are extracted, they can be passed through the code for the SVM provided in the following folders.
* InSample_Code: In this folder, we entered the code used for the in-sample analysis. As above, directory and the m parameter have to be updated when running the code. The folder contains one file for the SVMs related to each features and R files generating the final tables.
* OutOfSample_Code: In this folder, we entered the code used for the out-of-sample analysis. As above, directory and the m parameter have to be updated when running the code. The folder contains one file for the SVMs related to each features and R files generating the final tables.
* MultiKernelLearning: In this folder, we entered the code used for the Multi-Kernel Learning Experiments. Results are provided in the body of the paper. We divided the folders with respect to the two experiments, given that we computed the weights and the functions used for the MKL. Such functions are passed to the SVM main function.

Cite

If you use this code in your project, please cite:

@article{campi2021machine, title={Machine learning mitigants for speech based cyber risk}, author={Campi, Marta and Peters, Gareth W and Azzaoui, Nourddine and Matsui, Tomoko}, journal={IEEE Access}, volume={9}, pages={136831--136860}, year={2021}, publisher={IEEE} }

Queries

For any queries, I am avaliable at marta.campi.15@ucl.ac.uk or marta.campi.11@gmail.com

mcampi111/EMD-MFCC-SVM-Speech

EMD-MFCC-SVM Speech Repository

Data sets

Repository Organisation

Cite

Queries