WaveNet: A Jupyter Notebook repository from LounesAl

General info
Technologies
Setup
Reference
Citation

General info

Speech-to-Text-WaveNet : End-to-end sentence level English

The architecture is shown in the following figure.

This image is cropped from [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499).

The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation.

The network models the conditional probability to generate the next sample in the audio waveform, given all previous samples and possibly additional parameters.

After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape (num_samples, num_channels).

A convolutional layer that only accesses the current and previous inputs then reduces the channel dimension.

The core of the network is constructed as a stack of causal dilated layers, each of which is a dilated convolution (convolution with holes), which only accesses the current and past audio samples.

The outputs of all layers are combined and extended back to the original number of channels by a series of dense postprocessing layers, followed by a softmax function to transform the outputs into a categorical distribution.

The loss function is the cross-entropy between the output for each timestep and the input at the next timestep.

In this repository, the network implementation can be found in wavenet.py.

Technologies

TensorFlow needs to be installed before running the training script. Code is tested on TensorFlow version 2 for Python 3.10

In addition, glog and librosa must be installed for reading and writing audio.

To install the required python packages, run

pip install -r requirements.txt

Setup

To run this project, install it locally.

Dataset

You can use any corpus containing .wav files.

We've mainly used the VCTK (around 10.4GB, Alternative host) so far.
LibriSpeech
TEDLIUM release 2

Usage

Create dataset

Download and extract dataset(only VCTK support)
Assume the directory of VCTK dataset is C:/speech_to_text, Execute to create record for train or test

python tools/create_tf_record.py -input_dir='C:/speech_to_text'

Execute to train model.

python train.py

Execute to evalute model.

python test.py

Demo

1.Download pretrain model Best model and extract to 'release' directory

Link to the best weight of the pretrained Wavenet

2.Execute to transform a speech wave file to the English sentence. The result will be printed on the console.

python demo.py -input_path <wave_file path>

For example, try the following command.

python demo.py -input_path=data/demo.wav -ckpt_model=release/<name of the modele>

Results

After the demo with a WAV file, the result of the sentence

Ask her to bring these things with her from the store"

is given in the figure below :

Reference

Ibab. tensorflow-wavenet 2016. GitHub repository. https://github.com/ibab/tensorflow-wavenet/.

Citation

L. Allioui, S. Brahami, B. Ghoul, A. Mezemate. WaveNet 2022. GitHub repository. https://github.com/LounesAl/WaveNet.

LounesAl/WaveNet

Table of contents