This Virtual Machine implements a system that can transcribe and/ or sub-title pretty much any audio or video file. It will separate speech from non-speech, identify individual speakers, and then convert speech to text. It supports various input and output formats. If you give it a full corpus of files to transcribe, it allows you to browse and search the results.
If the results are not good, it is easy to update and change several parts of the system, for example the segmentation parameters (not enough words in the output? too much noise?) or the language model (crappy output)?
Internally, the system uses EESEN RNN-based decoding, trained on the TED-LIUM dataset and the Cantab-TEDLIUM language model from Cantab Research. In addition it includes an adapted version of Tanel Alumae's Kaldi Offline Transcriber which accepts most any audio/ video format and produces transcriptions as subtitles, plain text, and more. The transcriber performs LIUM speaker diarization. Lastly, the VM provides a video browser in a web page such that transcriptions appear as video subtitles, and are searchable by keyword across videos.
This VM runs either locally with Vagrant/VirtualBox or remotely as an Amazon Machine Image on AWS.
The Eesen software has been released under an Apache 2.0 license, the LIUM speaker diarization is GPL, but LIUM offers to work with you if that is too strict LIUM license. The Eesen transcriber uses and expands the Kaldi offline transcriber, which has been released under a very liberal license at Kaldi Offline Transcriber license. The transcriber uses various other tools such as Atlas, Sox, FFMPEG that are being released as Ubuntu packages. Some of these have their own licenses, if one of them poses a problem, it would not be too hard to remove them specifically.
Thanks to NVidia for their Academic GPU Grant