This is the repository for LipLearner: Customizable Silent Speech Interactions on Mobile Devices (CHI 2023). It contains PyTorch scripts for contrastive pre-training and the source code of the iOS application. Please refer to our paper for more details. arXiv preprint ACM Digital Library
- LipLearner: Customizable Silent Speech Interactions on Mobile Devices
- Content
- Learn Visual Speech Representations with Contrastive Learning
- On-device Silent Speech Recognizer with In-situ command customization
- Citation
- License
The prtraining
folder contains the contrastive pretraining scripts based on Feng et al.'s 3D convolutional neural networks. We use the public lipreading dataset LRW to learn efficient visual speech representations, which serve as the cornerstone our few-shot learning silent speech command customization framework.
- PyTorch 1.12
- Torchvision
- OpenCV-Python
- SciPy
- TurboJPEG and PyTurboJPEG
- Install dependencies
cd pretraining
pip install -r requirements.txt
- Download LRW Dataset and link
lrw_mp4
in the root of this repository:
ln -s PATH_TO_DATA/lrw_mp4 .
- Run
scripts/prepare_lrw.py
to generate training samples of LRW respectively:
python scripts/prepare_lrw.py
Processed data will be saved in the lrw_roi_63_99_191_227_size128_npy_gray_pkl_jpeg
directory, as the lip images are cropped with a 128x128 ROI.
We provide pretrained weights here (Google Drive).
bash train.sh
More training details and settings can be found in our paper.
We developed an iOS application that allows people to experience silent speech interaction on commodity smartphones. This application provides totally real-time and on-device lipreading, a visual keyword spotting system for hands-free activation, and an online incremental learning scheme that learns continously during use. The following diagram explains how it works:
User experience and interface design. (A) The interface of the initialization phase. The user first needs to record keyword and non-speaking samples to enable KWS activation. (B) The user says a command aloud for command registration. The voice signal will be leveraged to label the silent speech, allowing fast command registration (Voice2Lip). (C) The interface for querying the right label in the active learning mode. Users can slide through the existing commands sorted by similarity to select and add a new sample to the model. Users can update the model at any time by using the button at the upper-right corner, which usually takes around 2 seconds on iPhone. (D) An example showing the command "play some music" is recognized correctly and executed successfully by the pre-set shortcut. (E) The interface for correcting the predictions in on-demand learning mode. The user can review recent utterances displayed as a GIF animation
To get started, download the Xcode project from the LipLearner
folder. The Core ML format lipreading encoder model has been compressed into a .tar.gz file to avoid exceeding the file size limit on Github. So you will need to extract the weights on your computer before building the iOS App.
cd LipLearner_iOS/LipEncoder.mlpackage/Data/com.apple.CoreML/weights
tar -xzvf weight.bin.tar.gz
Please note that our testing has shown that the app works best on iPhone 11 or newer models. If you experience overheating or frequent crashes, we recommend turning off the camera view from the settings menu as video rendering can be taxing on the CPU. For older iPhone models, the inference may take longer time than the slide window length of our visual KWS function. In such cases, it's best to turn off the KWS function and use the recording button (long-press) to start recognition.
To avoid overheating, we added a silent speech activity detection (SSAD) function that works like the voice activity detection (VAD) function in speech recognition systems. It detects the keyword only when the user’s mouth is open. Note that this trick was not used in the user study in our paper.
In the free use mode, you can use your silent speech command to activate different functions. You need to create your own shortcuts, such as “play some music”, and register a silent speech command that matches the shortcut’s name exactly
We will provide a tutorial video to help you get started in the furture.
If you find this codebase useful for your research, please consider to cite our CHI 2023 paper and Feng's papers:
@inproceedings{10.1145/3544548.3581465,
author = {Su, Zixiong and Fang, Shitao and Rekimoto, Jun},
title = {LipLearner: Customizable Silent Speech Interactions on Mobile Devices},
year = {2023},
isbn = {9781450394215},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3544548.3581465},
doi = {10.1145/3544548.3581465},
booktitle = {Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems},
articleno = {696},
numpages = {21},
keywords = {Customization, Lipreading, Few-shot Learning, Silent Speech Interface},
location = {Hamburg, Germany},
series = {CHI '23}
}
@inproceedings{feng2021efficient,
title={An Efficient Software for Building LIP Reading Models Without Pains},
author={Feng, Dalu and Yang, Shuang and Shan, Shiguang},
booktitle={2021 IEEE International Conference on Multimedia \& Expo Workshops (ICMEW)},
pages={1--2},
year={2021},
organization={IEEE}
}
@article{feng2020learn,
author = "Feng, Dalu and Yang, Shuang and Shan, Shiguang and Chen, Xilin",
title = "Learn an Effective Lip Reading Model without Pains",
journal = "arXiv preprint arXiv:2011.07557",
year = "2020",
}
The MIT License