/cavevoc

CAVEVOC is a means to get speech recognition into the CAVE

Primary LanguagePython

CAVEVOC

(C) 2013 by Jason Leigh, Electronic Visualization Laboratory, University of Illinois at Chicago

CAVEVOC is a means to get speech recognition into the CAVE. CAVEVOC has two components, the CAVEVOC client and the CAVEVOC Python Module. The CAVEVOC client runs on a PC, records audio samples and sends them to Google for translation. The translated text and its corresponding confidence level is then transmitted to the CAVE for application use. On the CAVE, the CAVEVOC Python module will read this data and apply it to a user-defined callback function within your program.

Trivia: CAVEVOC was the name given to the first speech recognition system I developed for the old CAVE in the mid-90s for CALVIN [Video]. At the time IBM's speech engine was used. Arguably this new CAVEVOC is much more accurate using Google's crowd-sourced recognition engine.

You will need to download the following to use CAVEVOC:

  1. Processing
  2. CAVEVOC-Processing.zip - The client already bundles together the STT (Speech To Text) translation library and the UDP (networking) library in the zip file. But you can download them separately from their original source by clicking the respective links. These library folders need to be installed in your Processing user library folder.
  3. CAVEVOC Python Module - cavevoc.py - This is the python module you will use to incorporate speech recognition into your CAVE application.
  4. CAVEVOC CAVE Demo Application - demo.py - This is a simple CAVE application to show you how to use the CAVEVOC Python API.

Simple Example

This demo simply takes any recognized speech from CAVEVOC and print on the screen in the CAVE including the confidence level reported by Google.

  1. You should install Items 1-3  on the PC that will access the microphone.
  2. Install Items 4 and 5  in the location where you normally install your CAVE applications.
  3. First launch the CAVE Demo application: e.g. orun -s demo.py
    The environment should just be gray and blank until recognized text is received. The picture below is taken from a desktop simulation.
    In the demo, whenever recognized text is received, it will display its confidence level (reported by Google) and the text, in a random position on the screen in front of you, hence the picture below.

  4. Launch Processing and open CAVEVOC-PTT.pde.
  5. You will first need to edit the ip address in the code to reflect the ip address of the CAVE (e.g. lyra.evl.uic.edu).
  6. Now RUN the CAVEVOC-PTT.pde application.

  7. Hold down any key on your keyboard (like the SPACEBAR) and start talking.
  8. Release the key when you are done talking.
  9. The audio will begin recording when you hold the key down. When released the audio sample will be transmitted to Google for translation.
  10. Once the translation is received you should see feedback on the CAVEVOC-PTT window.
  11. Furthermore that text should also be sent to the CAVE application and it should show up in the CAVE.

Note: There is also a CAVEVOC-Auto.pde file that contains a version of the CAVEVOC client that will continually listen for audio without requiring you to hold a key down. Also both the CAVEVOC-PTT and CAVEVOC-Auto code are kept to a minimum so you can further customize them for your needs.

Second Example

The second demo (called Ideation) lets you create boxes and spheres, color them and move them, all via voice command. To select an object simply turn your head towards it.
Launch the program using: orun -s ideation.py

Ideation is an example that shows how you could use Pyparsing to parse the incoming voice commands. With Pyparsing you can very quickly develop a parser for very complex grammars. In the demo I have included the module: pyparsing.py so you don't need to bother to download and install Pyparsing.

The following are example voice commands:

  • MAKE | BUILD | CREATE A BOX | CUBE - creates a 1-foot cube
  • MAKE | BUILD | CREATE  A SPHERE | BALL - creates a 1-foot diameter sphere
  • PAINT | COLOR | MAKE IT RED | GREEN | BLUE | ....  - color the object that your head is pointing at Red. Other colors are green, blue, magenta, orange, yellow, black.... you get the idea.
  • NAME IT | THIS JASON - give a name to the object
  • PLACE JASON HERE - if you navigate the space and say this it will take the object named JASON and bring it to you and place it in front of you.
  • AGAIN - if you say either AGAIN or REPEAT, it will perform the last command again. E.g. if you said MAKE A BOX last, it will make a second box if you say AGAIN or REPEAT.
  • MAKE A BOX AND PAINT IT RED AND PLACE JASON HERE - You can chain commands together with the AND operator.

To see the full extent of the grammar and how it is used to activate parts of your code you will need to read the ideation.py code.

Some General Tips for Effective Use of Speech Recognition

  • Get a wireless microphone that has a push-to-talk button if possible- it will help cut out any unnecessary attempts at translation- even better if you can find a wireless bluetooth microphone. But for testing purposes you could just start with your laptop's built-in microphone.
  • Use speech recognition for interactions that take more time or dexterity to perform the physical manipulation than to say the phrase. For example, it may be challenging to accurately position a CAVE object in space using physical interactions but it is easy to say: Move object to 2.5 3 5.5 or Rotate 23.5 degrees along X axis or Make a box 1 by 2.5 by 3 meters and put it at 3 4 5.
  • After text is recognized or not recognized, try to give the user feedback- for example with an audible cue. You've seen sci-fi movies like Iron Man, use your imagination!
  • It is sometimes helpful to prefix every command with a name, e.g. "Jarvis, move the object". You can use this prefix to know when you are talking to the CAVE rather than talking to someone else in the audience. Another approach is to create a virtual character/avatar and have it so that it will only interpret your commands if you are facing it.
  • Consider creating a grammar to describe your commands and then use something like Pyparsing to implement the grammar parser.
  • If you want to be more advanced you can also use the Natural Language Toolkit. The main advantage of using natural language processing techniques is that you can minimize the need for the user to remember a specific grammar.
  • But if you don't have much experience in NLP, try to keep the number of speech utterances low to minimize the need to remember them. In any case it may be helpful to provide a "dropdown" cheat sheet in the CAVE to help the user remember the commands. Better still is to create a cheat sheet that unfolds to show the next word in a phrase that can be spoken. A good example of this is used in the video game End War.
  • Lastly you may consider implementing a state-machine-based conversation engine so that followup commands are possible:
    • User: Computer, make me a cube
    • Computer: Where would you like me to put it?
    • User: Put it at 5 5 5
    • Computer: How big would you like it?
    • User: How about 3 by 5 by 2 meters
    • Computer: Coming right up. Cube at 5 5 5 of size 3 by 5 by 2 meters.
    • User: Take me to the other side of the cube.
    • etc....

Release Notes

  • 7/26/2013 - Revised to include example using Pyparsing as the command parser.
  • 7/20/13 - First version released.