Java implementation for converting speech to text using AWS Transcribe and Azure Speech Services.
# Clone the repository
git https://github.com/luizveronesi/spring-speech-text.git
# Navigate to the project directory
cd spring-speech-text
# Install dependencies
mvn install
# Docker installation
mvn clean package -f pom.xml -U
docker build . -t spring-speech-text-example:latest
docker create --name spring-speech-text-example --network your-network --ip x.x.x.x --restart unless-stopped spring-speech-text-example:latest bash
docker start spring-speech-text-example
# Run the application
java -jar target/api.jar
Open Swagger: http://localhost:8080/swagger-ui/index.html
All configuration parameters must be set at file src/main/resources/application.yml.
This implementation uploads the file to a S3 bucket and processes the audio file in batch.
With the unique identifier for the operation, it is possible to check if the job has already been executed and get the transcribed text.
spring:
cloud:
aws:
region:
static: us-east-1
credentials:
access-key: ${AWS_ACCESS_KEY}
secret-key: ${AWS_SECRET_KEY}
transcribe:
endpoint: https://transcribe.us-east-1.amazonaws.com
bucket: YOUR_BUCKET
Attention: this service is not working properly.
The results are completely different and the quality is worse than the same operation in AWS Transcribe.
I recommend using AWS Transcribe while this code isn't improved.
spring:
cloud:
azure:
speech:
services:
subscription-key: ${AZURE_SPEECH_SERVICES_KEY}
region: eastus
Upload an image and extract its text.
Parameter | Type | Description |
---|---|---|
file |
MultipartFile | The audio file itself. |
type |
option | Select the engine to extract the text from the audio file. Available options: AWS, AZURE. |
language |
string | The language code. Available: [en-IE, ar-AE, pa-IN, be-BY, te-IN, zh-TW, en-US, uk-UA, sw-KE, gu-IN, ta-IN, en-AB, ug-CN, su-ID, bn-IN, hy-AM, en-IN, sl-SI, ab-GE, zh-CN, ar-SA, eu-ES, en-ZA, gd-GB, cy-WL, uz-UZ, tl-PH, so-SO, sk-SK, rw-RW, ro-RO, pl-PL, no-NO, mt-MT, mr-IN, mn-MN, mk-MK, lv-LV, lt-LT, is-IS, hu-HU, hr-HR, ha-NG, fi-FI, et-ET, bg-BG, az-AZ, th-TH, tr-TR, ru-RU, pt-PT, nl-NL, it-IT, id-ID, fr-FR, es-ES, de-DE, sw-RW, sw-TZ, sr-RS, ps-AF, or-IN, kn-IN, ga-IE, af-ZA, wo-SN, tt-RU, sw-BI, en-NZ, ko-KR, el-GR, ba-RU, hi-IN, de-CH, vi-VN, cy-GB, ml-IN, ms-MY, he-IL, cs-CZ, ka-GE, si-LK, gl-ES, lg-IN, kab-DZ, da-DK, en-AU, zu-ZA, mhr-RU, ast-ES, pt-BR, en-WL, sw-UG, ky-KG, ckb-IQ, bs-BA, fa-IR, kk-KZ, ckb-IR, sv-SE, ja-JP, mi-NZ, ca-ES, es-US, fr-CA, en-GB]. Obs: not all languages have been tested. |
numParticipants |
number | Inform the number of different participants in the recording. |
mediaFormat |
string | Inform the media type for the upload file. Tested with mp3 and wav. |
Parameter | Type | Description |
---|---|---|
uid |
string | Unique identifier for the trnascription operation. |
results |
list of objects | The object from each engine response. If type is AWS, it is a Transcribe (dev.luizveronesi.speech.model.Transcribe). If type is Azure, it is a list of SpeechRecognitionResult (https://learn.microsoft.com/en-us/java/api/com.microsoft.cognitiveservices.speech.speechrecognitionresult). |
sentences |
list of sentences | List of sentence objects (dev.luizveronesi.speech.model.Sentence) with extracted text, start and end times (in seconds) and the participant. |
Implement unit tests.
Improve Azure Speech Service implementation: add batch processing as in AWS, check if other formats are allowed and get duration and participant for each sentence.