
Video and Audio Speech-to-Text using Google Cloud

Primary LanguageJava


This plug-in uses Google Cloud Speech-to-Text version v1p1beta1 to render an audio file to text, and it uses it via the REST API (not the JAVA Google Speech-to-Text SDK).


  1. Please, read the Support part, below.
  2. The plugin is an example, showing how to connect to a Google Cloud service not meant to be used with audio files lasting dozens of minutes or more.
  3. So, there are known limitations:
  • The call is synchronous
  • For audio files of maximum 60 seconds
    • If the file is too big or too long Google returns an error: "Sync input too long. For audio longer than 1 min use LongRunningRecognize with a 'uri' parameter."
  • Please, read Google's best practices for Speech to Text API to check what is supported. For example, mp3 files are not supported and must be converted, ideally to FLAC.

WARNING: Using Google beta version of Speech to Text

  • In this implementation, the plugin uses Google Speech to Text API in its BETA VERSION.
  • Google makes it clear that some API may change their billing process, for example, the access to a punctuated text. See the quota documentation.

Authentication to Google Cloud Service

As of today, the plug-in only uses an API KEY (not a Service Accounts file). To set up the credentials, the plugin looks:

  1. First for a google.speechtotext.apikey parameter in nuxeo.conf
  2. If not found there, it checks for an environment variable named GOOGLE_SPEECHTOTEXT_APIKEY

This goes also for unit testing: Set the GOOGLE_SPEECHTOTEXT_APIKEY environement variable in your terminal before unit-testing it (either via maven or Eclipse/IntelliJ)


Please, read Google's best practices for Speech to Text API (For example, mp3 files are not supported and must be converted, ideally to FLAC)

The plugin exposes API allowing to:

  • Automatically convert to FLAC (the plugin contributes a commandLine base converter) before sending the audio file to the service
  • Or the caller can specify the encoding and rate Hz of the input file (no conversion performed by the plugin)

The plugin does not automatically convert Audio (or Video) files to text. You will add listeners, buttons... that will call one of the following operations:


  • Category: Conversion
  • Input: A single Blob
  • Output the same Blob, unchanged
  • Runs the Speech-to-text and set the result in a Context variable whose name is passed as parameter. This variable is the SpeechToTextResponse Java response from the service and its methods can be called:
    • getText() returns the first transcript, with or without punctuation (depends on parameters)
    • getWordTimeOffsets() returns a JSON array of objects, each object has a "word", a "start" and an "end" fields. "start" and "end" are the number of seconds. (Google can also return nanoseconds, the plugin makes it simpler and returns only seconds)
    • getNativeResponse(): Returns the native response encapsulated in a JSONObject. In current implementation, the plugin uses only REST to call the service. The result is described in Google Documentation
  • Parameters:
    • languageCode(String, required): The language code of the audio file (see Google documentation for supported languages))
    • audioEncoding(String): The audio encoding as String. See Google documentation for supported encodings.
      • See Google's enumeration. As of 2018-11-03, we have: "FLAC", "LINEAR16", "MULAW", "AMR", "AMR_WB", "OGG_OPUS", and "SPEEX_WITH_HEADER_BYTE".
      • Optional. If the audio file is FLAC or WAW, the parameter is not required
    • sampleRateHertz(integer): The rate of the audio file. Optional. If the audio file is FLAC or WAW, the parameter is not required.
    • withPunctuation: A boolean, optional, default value is true. If false, the text will be returned with no punctuation.
    • withWordTimeOffets: A boolean, optional, default value is false. If true, getWordTimeOffsets() will return a JSON array of objects, each object having the word, and the start/end time (in seconds). This array will be available in the resultVarName response
    • moreOptionsJSONStr (String, optional). Add more configuration parameters to send to the service. The plug-in does not encapsulate and handle all and every features of the provider. Passing more parameter is a way to get the results the plug-in does not fetch by default. See the provider REST API documentation (for the current version, see above)
    • resultVarName (String, required): The Name of a Context Variable that will contain the SpeechToTextResponse object (see above)

Reminder: Before calling this operation, you can, if needed, convert the audio (or video) to FLAC using the "audio-to-flac" commandLine converter provided byu the plugin


Converts a blob of the input document and save the transcript to a field of the input document. The file will be automatically converted to FLAC if needed, before being sent to the service.

  • Category: Conversion
  • Input: A Document
  • Output: The modified Document
  • Extracts the blob stored in the blobXpath parameter (default "file:content"), transcripts it using the required languageCode parameter, and stores the transcription in the transcriptXpath field of the Document. Always use punctuation. See below for more details on parameters.
  • Parameters:
    • languageCode(String, required): The language code of the audio file (see Google documentation for supported languages))
    • blobXpath: Source blob to convert
    • transcriptXpath: Destination String field to store the result of the transcript
    • withPunctuation: A boolean, optional, default value is true. If false, the text will be returned with no punctuation.
    • withWordTimeOffets: A boolean, optional, default value is false. If true, getWordTimeOffsets() will return a JSON array of objects, each object having the word, and the start/end time (in seconds). This array will be available in the resultVarName response
    • moreOptionsJSONStr (String, optional). Add more configuration parameters to send to the service. The plug-in does not encapsulate and handle all and every features of the provider. Passing more parameter is a way to get the results the plug-in does not fetch by default. See the provider REST API documentation (for the current version, see above)
    • saveDocument (optional). A boolean. If true, Document is saved (default is false).
    • resultVarName (optional): The name of a Context Variable that will contain the SpeechToTextResponse object (see above)


Building requires the following software:

  • git
  • maven

Running the plugin requires Google Cloud API Key to access their Cloud Services.


git clone https://github.com/nuxeo-sandbox/nuxeo-speechtotext.git
cd nuxeo-speechtotext.git

mvn clean install

Note: See Authentication to Google Cloud Service. If no Google API Key is provided, the unit tests calling the service are ignored.


These features are not part of the Nuxeo Production platform, they are not supportes

These solutions are provided for inspiration and we encourage customers to use them as code samples and learning resources.

This is a moving project (no API maintenance, no deprecation process, etc.) If any of these solutions are found to be useful for the Nuxeo Platform in general, they will be integrated directly into platform, not maintained here.


Apache License, Version 2.0

About Nuxeo

Nuxeo, developer of the leading Content Services Platform, is reinventing enterprise content management (ECM) and digital asset management (DAM). Nuxeo is fundamentally changing how people work with data and content to realize new value from digital information. Its cloud-native platform has been deployed by large enterprises, mid-sized businesses and government agencies worldwide. Customers like Verizon, Electronic Arts, ABN Amro, and the Department of Defense have used Nuxeo's technology to transform the way they do business. Founded in 2008, the company is based in New York with offices across the United States, Europe, and Asia.

Learn more at www.nuxeo.com.