See and Tell

AI-driven Assistant to Experience Visual Content

🎯 Goal

Our service aims to make visual content more accessible for individuals with visual impairments. We provide detailed audio descriptions of movies, TV shows, images, and more, allowing visually impaired users to fully experience and enjoy these media. Additionally, our solution caters to situations where active viewing is not possible, like when driving, providing an immersive audio experience instead. Our mission is to promote inclusivity and ensure that everyone, regardless of their visual abilities, can engage with and appreciate visual content.

💻 Service

Our service operates through a streamlined pipeline consisting of five essential components. First, the Describe component utilizes an image-to-text model to generate textual descriptions of the events happening on the screen. Next, the Listen component intelligently identifies dialogue moments in the video to avoid overlapping with audio descriptions. The Recognize component employs face detection to identify characters, enhancing the context of captions by including character names. The Say component utilizes text-to-speech technology to voice the generated captions. Finally, Mixer combines the voiced captions with the original video, producing a final result video where the audio descriptions seamlessly blend with the visual content.

Most of the components exploit HuggingFace models, such are SpeechT5, GIT, Audio Segmentation, the Stanza NLP pipeline from Stanford and Facenet as a face recognition library.

🚀 Demo

To provide a demo of our work, we took a 30-seconds fragment from The Big Bang Theory TV series and processed it.

The work done so far can be summarized as following:

[+] Audio descriptions embedded without dialogues interventions
[+] The voice is clear and nice
[+] Captions are mostly correct and descriptive
[+] Characters are recognized correctly in most cases

However, after watching a demo you might notice one of the following:

[!] Leonard was recognized as Sheldon, because Sheldon's face was more visible. However, the center figure in the frame was still Leonard. So, the service produced a caption 'Sheldon points to a brick wall' while actually it was Leonard who pointed.
[!] At the very end, the model describe scene as 'A man in green jacket and red shirt ...', but the video was paused on different frame at that moment. The reason is that we describe every second, while frame rate of the source video is not divisible by seconds. That is why, the actual frame that was described as 'A man in ... ' is following the one the video was paused on immediately.

‼️ Limitations

The current approach for caption generation is based on statistical ideas on top of captions generated by image to text models combined with face recognition. The latter has its limitations mainly with frames where the faces are not particular visible. Also, to have enough statistical grounds, we are forced to describe a large number of frames, which is computationally expensive.
More sophisticated techniques (such as video analysis technique, a deeper dive into image to text models) to produce higher quality, more personalized captions
The face recognition module was built to be extendable to any series or TV shows. Nevertheless, the current implementation still requires providing data of the actors of the series in question.
A great addition would be to build an automatic mechanism that retrieves high quality data of any given series from the web.

🔎 Implementation:

For a deeper dive into the code, please refer to documentation for each module:

🛠️ Reproduce

To reproduce and run the service locally, you are encouraged to use Docker.

git clone https://github.com/teexone/see-and-tell/ seeandtell
cd seeandtell
docker build -t seeandtell .
docker run --rm -v /path/to/video/folder:/video seeandtell python -m cntell --help

Install

The library might be installed as a Python package and CLI tool using the following command:

pip install https://github.com/teexone/see-and-tell/archive/refs/heads/main.zip

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

equilicore/see-and-tell