/Victor

Voice Interactive Controller

Primary LanguageC#GNU General Public License v3.0GPL-3.0

Victor - Voice Interactive Controller

Victor is an free cross-platform programmable voice control framework for desktops that was started during my entry into the Mozilla Voice Challenge to test some ideas for an integrated open-source voice stack, and then as the base client platform for the Victor CX auditory CUI that was my entry into RedHat's ReBoot Customer Experience Hackathon. The following videos are available demoing and documenting some aspects of Victor (click the screenshot):

Victor Test 1

Victor Test 2

Architecture

Victor currently uses the following open-source projects:

Julius (ASR)

Julius is a hi-speed accurate and flexible LVCSR library whicn can decode and recognize speech in real-time using a variety of models built for different languages like Japanese, English and Polish. Unlike other ASR libraries including Facebook's wav2letter++ Julius is fully supported on Windows and unlike Mozilla's own DeepSpeech, Julis can decode speech waveform input via a system mic device in real-time with appropriate handling of silences and pauses. Julius is used in the Simon voice control program for KDE. Leslaw Pawlaczyk has created Julius models based on the Mozilla corpus and has modified Julius to support DNN-HMM models as well as GMM-HMM.

Julius can be built as a statically-linked binary and run as a sub-process of Victor. Victor communicates with Julius by monitoring its stdout stream and detecting the different states the program is in:

Victor Debug Mode

The desired Julius configuration is specified in a plain text file and passed to the Julius executable as a startup argument. In this way Julius can be used by any program on any hardware or operation system platform supported by Julius. Julius's portability and real-time input recognition capabilities make it a good choice for the ASR component of an integrated voice stack.

SnipsNLU (NLU)

Snips NLU is a hi-speed accurate open-source NLU inference engine which can recognize intents and entities in utterances for a particular domain in real-time. It is written in Rust and has an FFI allowing it to be used by any language that call C libraries. Victor interfaces with the Snips NLU engine using its C FFI e.g in C# calling a SnipsNLU function in a native DLL looks like:

[DllImport("snips_nlu_ffi", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Ansi)]
        internal static extern SNIPS_RESULT snips_nlu_engine_create_from_dir
            ([In, MarshalAs(UnmanagedType.LPStr)] string root_dir, [In, Out] ref IntPtr client);

Abstractions over the lower-level Snips functions are built-up to avoid other code having to manage the details of calling the library code. This is the standard procedure used for Snips bindings to other languages like Python. This ability to interface with the Snips library directly removes the need for an intermediate Python interpreter or REST API makes SnipsNLU a good choice for the NLU component of an integrated voice stack.

Mimic (TTS)

Victor can use the Mimic TTS engine but generally it is better to rely on the operating system's narrator or TTS capabilities or the user's installed screen reader.