Keras implementation of the method described in the paper 'LipNet: End-to-End Sentence-level Lipreading' by Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas (https://arxiv.org/abs/1611.01599).
Scenario | Epoch | CER | WER | BLEU |
---|---|---|---|---|
Unseen speakers [C] | N/A | N/A | N/A | N/A |
Unseen speakers | 178 | 6.19% | 14.19% | 88.21% |
Overlapped speakers [C] | N/A | N/A | N/A | N/A |
Overlapped speakers | 368 | 1.56% | 3.38% | 96.93% |
Notes:
- [C] means using curriculum learning.
- N/A means either the training is in progress or haven't been performed.
- Your contribution in sharing the results of this model is highly appreciated :)
- Keras 2.0+
- Tensorflow 1.0+
- PIP (for package installation)
Plus several other libraries listed on setup.py
To use the model, first you need to clone the repository:
git clone https://github.com/rizkiarm/LipNet
Then you can install the package:
cd LipNet/
pip install -e .
Note: if you don't want to use CUDA, you need to edit the setup.py
and change tensorflow-gpu
to tensorflow
You're done!
Here is some ideas on what you can do next:
- Modify the package and make some improvements to it.
- Train the model using predefined training scenarios.
- Make your own training scenarios.
- Use pre-trained weight to do lipreading.
- Go crazy and experiment on other dataset! by changing some hyperparameters or modify the model.
This model uses GRID corpus (http://spandh.dcs.shef.ac.uk/gridcorpus/)
There are five different training scenarios that are (going to be) available:
- Download all video (normal) and align from the GRID Corpus website.
- Extracts all the videos and aligns.
- Create
datasets
folder on each training scenario folder. - Create
align
folder inside thedatasets
folder. - All current
train.py
expect the videos to be in the form of 100x50px mouthcrop image frames. You can change this by addingvtype = "face"
andface_predictor_path
(which can be found inevaluation/models
) in the instantiation ofGenerator
inside thetrain.py
- The other way would be to extract the mouthcrop image using
scripts/extract_mouth_batch.py
(usage can be found inside the script). - Create symlink from each
training/*/datasets/align
to your align folder. - You can change the training parameters by modifying
train.py
inside its respective scenarios.
Create symlink from training/random_split/datasets/video
to your video dataset folder (which contains s*
directory).
Train the model using the following command:
./train random_split [GPUs (optional)]
Note: You can change the validation split value by modifying the val_split
argument inside the train.py
.
Create the following folder:
training/unseen_speakers/datasets/train
training/unseen_speakers/datasets/val
Then, create symlink from training/unseen_speakers/datasets/[train|val]/s*
to your selection of s*
inside of the video dataset folder.
The paper used s1
, s2
, s20
, and s22
for evaluation and the remainder for training.
Train the model using the following command:
./train unseen_speakers [GPUs (optional)]
The same way you do unseen speakers.
Note: You can change the curriculum by modifying the curriculum_rules
method inside the train.py
./train unseen_speakers_curriculum [GPUs (optional)]
Run the preparation script:
python prepare.py [Path to video dataset] [Path to align dataset] [Number of samples]
Notes:
[Path to video dataset]
should be a folder with structure:/s{i}/[video]
[Path to align dataset]
should be a folder with structure:/[align].align
[Number of samples]
should be less than or equal tomin(len(ls '/s{i}/*'))
Then run training for each speaker:
python training/overlapped_speakers/train.py s{i}
Copy the prepare.py
from overlapped_speakers
folder to overlapped_speakers_curriculum
folder,
and run it as previously described in overlapped speakers training explanation.
Then run training for each speaker:
python training/overlapped_speakers_curriculum/train.py s{i}
Note: As always, you can change the curriculum by modifying the curriculum_rules
method inside the train.py
To evaluate and visualize the trained model on a single video / image frames, you can execute the following command:
./predict [path to weight] [path to video]
Example:
./predict evaluation/models/overlapped-weights368.h5 evaluation/samples/id2_vcd_swwp2s.mpg
This is a work in progress. Errors are to be expected. If you found some errors in terms of implementation please report them by submitting issue(s) or making PR(s). Thanks!
Some todos:
- Use
Stanford-CTCTensorflow CTC beam search - Auto spelling correction
- Overlapped speakers (and its curriculum) training
- Integrate language model for beam search
- RGB normalization over the dataset.
- Validate CTC implementation in training.
- Proper documentation
- Unit tests
- (Maybe) better curriculum learning.
- (Maybe) some proper scripts to do dataset stuff.
MIT License