
Java implementation for converting speech to text using AWS Transcribe and Azure Speech Services.

Primary LanguageJavaMIT LicenseMIT

Spring Speech To Text

Java implementation for converting speech to text using AWS Transcribe and Azure Speech Services.


# Clone the repository
git https://github.com/luizveronesi/spring-speech-text.git

# Navigate to the project directory
cd spring-speech-text

# Install dependencies
mvn install
# Docker installation
mvn clean package -f pom.xml -U
docker build . -t spring-speech-text-example:latest
docker create --name spring-speech-text-example --network your-network --ip x.x.x.x --restart unless-stopped spring-speech-text-example:latest bash
docker start spring-speech-text-example


# Run the application
java -jar target/api.jar

Open Swagger: http://localhost:8080/swagger-ui/index.html


All configuration parameters must be set at file src/main/resources/application.yml.

AWS Transcribe

This implementation uploads the file to a S3 bucket and processes the audio file in batch.

With the unique identifier for the operation, it is possible to check if the job has already been executed and get the transcribed text.

        static: us-east-1
        access-key: ${AWS_ACCESS_KEY}
        secret-key: ${AWS_SECRET_KEY}
        endpoint: https://transcribe.us-east-1.amazonaws.com
        bucket: YOUR_BUCKET

Azure Speech Services

Attention: this service is not working properly.

The results are completely different and the quality is worse than the same operation in AWS Transcribe.

I recommend using AWS Transcribe while this code isn't improved.

          subscription-key: ${AZURE_SPEECH_SERVICES_KEY}
          region: eastus



Upload an image and extract its text.


Parameter Type Description
file MultipartFile The audio file itself.
type option Select the engine to extract the text from the audio file. Available options: AWS, AZURE.
language string The language code. Available: [en-IE, ar-AE, pa-IN, be-BY, te-IN, zh-TW, en-US, uk-UA, sw-KE, gu-IN, ta-IN, en-AB, ug-CN, su-ID, bn-IN, hy-AM, en-IN, sl-SI, ab-GE, zh-CN, ar-SA, eu-ES, en-ZA, gd-GB, cy-WL, uz-UZ, tl-PH, so-SO, sk-SK, rw-RW, ro-RO, pl-PL, no-NO, mt-MT, mr-IN, mn-MN, mk-MK, lv-LV, lt-LT, is-IS, hu-HU, hr-HR, ha-NG, fi-FI, et-ET, bg-BG, az-AZ, th-TH, tr-TR, ru-RU, pt-PT, nl-NL, it-IT, id-ID, fr-FR, es-ES, de-DE, sw-RW, sw-TZ, sr-RS, ps-AF, or-IN, kn-IN, ga-IE, af-ZA, wo-SN, tt-RU, sw-BI, en-NZ, ko-KR, el-GR, ba-RU, hi-IN, de-CH, vi-VN, cy-GB, ml-IN, ms-MY, he-IL, cs-CZ, ka-GE, si-LK, gl-ES, lg-IN, kab-DZ, da-DK, en-AU, zu-ZA, mhr-RU, ast-ES, pt-BR, en-WL, sw-UG, ky-KG, ckb-IQ, bs-BA, fa-IR, kk-KZ, ckb-IR, sv-SE, ja-JP, mi-NZ, ca-ES, es-US, fr-CA, en-GB]. Obs: not all languages have been tested.
numParticipants number Inform the number of different participants in the recording.
mediaFormat string Inform the media type for the upload file. Tested with mp3 and wav.


Parameter Type Description
uid string Unique identifier for the trnascription operation.
results list of objects The object from each engine response. If type is AWS, it is a Transcribe (dev.luizveronesi.speech.model.Transcribe). If type is Azure, it is a list of SpeechRecognitionResult (https://learn.microsoft.com/en-us/java/api/com.microsoft.cognitiveservices.speech.speechrecognitionresult).
sentences list of sentences List of sentence objects (dev.luizveronesi.speech.model.Sentence) with extracted text, start and end times (in seconds) and the participant.

Next steps

Implement unit tests.

Improve Azure Speech Service implementation: add batch processing as in AWS, check if other formats are allowed and get duration and participant for each sentence.