Speech Recognition Client/Server and Desktop Application

The software being developed in this project uses Systran Faster Whisper technology to input text via speech and use the resulting text in any other application.

Natural language processing often requires a lot of resources. It makes sense to use a graphics card, which greatly speeds up the processing of recorded speech files. Therefore, the processing of speech with this software should be realizable in two ways:

Local processing: the speech is recorded on the local computer, processed and transferred to local applications.
Decentralized processing: The speech is recorded locally and processed into a character string on a remote computer (GPU server) and transferred back to the source computer as a character string. There, the text is then transferred to local applications.

Installation

Microsoft Windows

The following command must be executed in a terminal window (cmd):

.\bin\prepare.bat

Linux

Run the following command in a shell:

bash ./bin/prepare.sh

Start the application

Microsoft Windows

Use the following command in a terminal window (cmd):

.\run.bat

Linux

Run the following command in a shell:

bash ./run.sh

Currently the client only works with the X Window System. Wayland is not supported.

Advanced setup options

You may modify the run.bat or respectively the run.sh file when you want to modify the program behavior. For example when you are on a Linux machine and you definitely don't want to use the GPU, then you could modify the python program call in run.sh file to

python src/srcsd/tkclient.py --device=cpu

There are following options:

setting	explanation	default value
`device`	defines, on which computing resource is computed; one of [`cpu`, `gpu`]; ignored if `local=false`	`gpu` if available, else `cpu`
`local`	defines, whether the client uses local audio data processing or not. In latter case a remote GPU server can be used for processing. one of [`true`, `false`], requires a running server if `false`	`true`
`host`	string representing the hostname to be used for requests; only used if local is false	`localhost`
`port`	port to use for requests to the host; only used if local is false	`8001`
`ssl_selfsigned`	whether the server is using a selfsigned certificate; client requires a copy of the certificate to trust it	`true`
`ssl_cert`	path of the ssl certificate file to use; only used if `local=false` , required `ssl_selfsigned=true`	`./keys/cert.pem`

Start the server

The server is only used, if you start the app with with local set to false. (see "Advanced setup options")

An SSL certificate and a private key are required, if you want to use the server. The key and certificate included in this repository are only valid for accessing the server via the following addresses: localhost, 0.0.0.0, 127.0.0.1 and 192.168.0.100.

If you want to access the server via a different ip/dns address you can either use your own existing certificate and key, or add the necessary addresses to ./keys/cert.conf by appending them to the [ sans ] section or replacing any of the existing addresses in that section. You can then generate the certificate and private key using OpenSSL with the following command:

openssl req -x509 -out ./keys/cert.pem -keyout ./keys/privkey.pem -newkey rsa:4096 -sha256 -days 365 -extensions ext -config ./keys/cert.conf

Microsoft Windows

Use the following command in a terminal window (cmd):

.\run_server.bat

Linux

Run the following command in a shell:

bash ./run_server.sh

Advanced setup options

You may modify the run_server.bat or respectively the run_server.sh file when you want to modify the program behavior. For example when you are on a Linux machine and you want to host the server on port 4000, you can modify the python program call in run_server.sh:

python src/srcsd/server.py --port=4000

There are following options:

setting	explanation	default value
`port`	port to listen for requests on	`8001`
`ssl_key`	path of the ssl private key file to use	`./keys/privkey.pem`
`ssl_cert`	path of the ssl certificate file to use	`./keys/cert.pem`
`device`	defines, on which computing resource is computed; one of [`cpu`, `gpu`]	`gpu` if available, else `cpu`

Usage

The program contains the following setting options:

Setting	Usage
Model	This is a selection of the Whisper model. Smaller models are faster but also more imprecise.
Language	The original language
Task	`transcribe` means, that the text is created in the original language; `translate` means, that the text is translated to English.
Format	`normal` takes the text from the Whisper model as is, while `stripped` means, that leading and trailling whitespaces are omitted. Stripped text is preffered, when working with spread-sheets or presentation programs, while `normal` is including white spaces - so it is preffered for floating text.
Pause	The processing of speech starts after a little break. (e.g. pause between 2 sentences) This parameter determines the duration length of this break.
Active	Determines whether audio data should be processed or not.
Insert via CTR-V	Defines whether the system automatically puts the recognized text into the system clipboard and the CTRL-V key combination is automatically pressed. On linux systems, this requires xclip.

The text input field contains the recognized text.

Data privacy

Local data processing

The program stores recorded audio files on the computer into the directory 'audio_data'. Processed audio data files will be deleted directly after converting them into text. If the program is killed, there could be residual files in the audio_data directory. They can be safely deleted manually or they are deleted at the next program start.

Known Issues/Restrictions

sometimes random outputs when recording (background noise)

Client/Server based audio data processing

The client stores recorded audio files into the directory audio_data and the server stores received audio files into the directory .uploads. Audio data files will be deleted, after they have been transferred/used. If the program is killed, there could be residual files in those directories. They can be safely deleted manually or they are deleted at the next program start.

ifak-prototypes/speech_recognition_srcsd

Speech Recognition Client/Server and Desktop Application

Installation

Microsoft Windows

Linux

Start the application

Microsoft Windows

Linux

Advanced setup options

Start the server

Microsoft Windows

Linux

Advanced setup options

Usage

Data privacy

Local data processing

Known Issues/Restrictions

Client/Server based audio data processing

Copyright and License Information