Vowel Classification for Parkinson's disease

About the Dataset:

What is the 'Saarbruecken Voice Database'?

A collection of voice recordings from more than 2000 persons. One recording session contains the following recordings:

Recording of the vowels [i, a, u] produced at normal, high and low pitch.
Recordings of the vowels [i, a, u] with rising-falling pitch.

Methodology

Feature Extraction:

Extraction of features is a very important part in analyzing and finding relations between different things. The data provided of audio cannot be understood by the models directly to convert them into an understandable format feature extraction is used. We convert the audio files to Mel Frequency Cepstral Coefficients (MFCCs) i.e. short term spectral features of a signal which are accepted by the models. MFCCs concisely describe the overall shape of a spectral envelope of them.
The categories are the three vowels a, i, u:

Vowel a

Vowel i

Vowel u

Recurrent Neural Network

For this data we have used a stateful LSTM thats allows us to simplify the overall network structure. All we need here is the LSTM layer followed by a Dense layer. A single audio sample is fed to the network and a single sample is predicted. There is no need for a range of samples since the necessary information about the past signal is stored in the LSTM’s recurrent state. It’s important to note that a skip connection is performed, where the input sample value is added to the output value. This way, the network only has to learn the difference between the output and the input.

Training: Loss: 0.0267 Accuracy: 99.13%

Testing: Loss: 0.0334 Accuracy: 99.37%