A simple text-to-speech client for Azure TTS API. 😆
You can try the Azure TTS API online: https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech
Starting from version 4.0.0, aspeak
is rewritten in rust. The old python version is available at the python
branch.
By default, we try to use a trial endpoint that doesn't require authentication. But its availability is not guaranteed and its capability is restricted by Microsoft.
You can sign up for an Azure account and then choose a payment plan as needed (or stick to free tier). The free tier includes a quota of 0.5 million characters per month, free of charge.
Please refer to the Authentication section to learn how to set up authentication for aspeak.
Download the latest release from here.
After downloading, extract the archive and you will get a binary executable file.
You can put it in a directory that is in your PATH
environment variable so that you can run it from anywhere.
From v4.1.0, You can install aspeak-bin
from AUR.
Installing from PyPI will also install the python binding of aspeak
for you. Check Library Usage#Python for more information on using the python binding.
pip install -U aspeak==4.1.0
Now the prebuilt wheels are only available for x86_64 architecture. Due to some technical issues, I haven't uploaded the source distribution to PyPI yet. So to build wheel from source, you need to follow the instructions in Install from Source.
Because of manylinux compatibility issues, the wheels for linux are not available on PyPI. (But you can still build them from source.)
The easiest way to install aspeak
from source is to use cargo:
cargo install aspeak
Alternatively, you can also install aspeak
from AUR.
To build the python wheel, you need to install maturin
first:
pip install maturin
After cloning the repository and cd
into the directory
, you can build the wheel by running:
maturin build --release --strip -F python --bindings pyo3 --interpreter python --manifest-path Cargo.toml --out dist-pyo3
maturin build --release --strip --bindings bin --interpreter python --manifest-path Cargo.toml --out dist-bin
bash merge-wheel.bash
If everything goes well, you will get a wheel file in the dist
directory.
Run aspeak help
to see the help message.
Run aspeak help <subcommand>
to see the help message of a subcommand.
The authentication options should be placed before any subcommand.
For example, to utilize your authentication token and an official endpoint designated by a region, run the following command:
$ aspeak --region <YOUR_REGION> --token <YOUR_AUTH_TOKEN> text "Hello World"
If you are using a custom endpoint, you can use the --endpoint
option instead of --region
.
In the future, authentication by azure subscription key will be supported. For now, I don't have a subscription key to test.
To avoid repetition, you can store your authentication details in your aspeak profile. Read the following section for more details.
aspeak v4 introduces the concept of profiles. A profile is a configuration file where you can specify default values for the command line options.
Run the following command to create your default profile:
$ aspeak config init
To edit the profile, run:
$ aspeak config edit
If you have trouble running the above command, you can edit the profile manually:
Fist get the path of the profile by running:
$ aspeak config where
Then edit the file with your favorite text editor.
The profile is a TOML file. The default profile looks like this:
Check the comments in the config file for more information about available options.
# Profile for aspeak
# GitHub: https://github.com/kxxt/aspeak
# Output verbosity
# 0 - Default
# 1 - Verbose
# The following output verbosity levels are only supported on debug build
# 2 - Debug
# >=3 - Trace
verbosity = 0
#
# Authentication configuration
#
[auth]
# Endpoint for TTS
# endpoint = "wss://eastus.api.speech.microsoft.com/cognitiveservices/websocket/v1"
# Alternatively, you can specify the region if you are using official endpoints
# region = "eastus"
# Azure Subscription Key
# key = "YOUR_KEY"
# Authentication Token
# token = "Your Authentication Token"
# Extra http headers (for experts)
# headers = [["X-My-Header", "My-Value"], ["X-My-Header2", "My-Value2"]]
#
# Configuration for text subcommand
#
[text]
# Voice to use. Note that it takes precedence over the locale
# voice = "en-US-JennyNeural"
# Locale to use
locale = "en-US"
# Rate
rate = 0
# Pitch
pitch = 0
# Role
role = "Boy"
# Style, "general" by default
style = "general"
# Style degree, a floating-point number between 0.1 and 2.0
# style_degree = 1.0
#
# Output Configuration
#
[output]
# Container Format, Only wav/mp3/ogg/webm is supported.
container = "wav"
# Audio Quality. Run `aspeak list-qualities` to see available qualities.
#
# If you choose a container format that does not support the quality level you specified here,
# we will automatically select the closest level for you.
quality = 0
# Audio Format(for experts). Run `aspeak list-formats` to see available formats.
# Note that it takes precedence over container and quality!
# format = "audio-16khz-128kbitrate-mono-mp3"
If you want to use a profile other than your default profile, you can use the --profile
argument:
aspeak --profile <PATH_TO_A_PROFILE> text "Hello"
rate
: The speaking rate of the voice.- If you use a float value (say
0.5
), the value will be multiplied by 100% and become50.00%
. - You can use the following values as well:
x-slow
,slow
,medium
,fast
,x-fast
,default
. - You can also use percentage values directly:
+10%
. - You can also use a relative float value (with
f
postfix),1.2f
:- According to the Azure documentation,
- A relative value, expressed as a number that acts as a multiplier of the default.
- For example, a value of
1f
results in no change in the rate. A value of0.5f
results in a halving of the rate. A value of3f
results in a tripling of the rate.
- If you use a float value (say
pitch
: The pitch of the voice.- If you use a float value (say
-0.5
), the value will be multiplied by 100% and become-50.00%
. - You can also use the following values as well:
x-low
,low
,medium
,high
,x-high
,default
. - You can also use percentage values directly:
+10%
. - You can also use a relative value, (e.g.
-2st
or+80Hz
):- According to the Azure documentation,
- A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st" that specifies an amount to change the pitch.
- The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
- You can also use an absolute value: e.g.
600Hz
- If you use a float value (say
Note: Unreasonable high/low values will be clipped to reasonable values by Azure Cognitive Services.
$ aspeak text "Hello, world"
$ aspeak ssml << EOF
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'><voice name='en-US-JennyNeural'>Hello, world!</voice></speak>
EOF
$ aspeak list-voices
$ aspeak list-voices -l zh-CN
$ aspeak list-voices -v en-US-SaraNeural
Output
Microsoft Server Speech Text to Speech Voice (en-US, SaraNeural)
Display name: Sara
Local name: Sara @ en-US
Locale: English (United States)
Gender: Female
ID: en-US-SaraNeural
Voice type: Neural
Status: GA
Sample rate: 48000Hz
Words per minute: 157
Styles: ["angry", "cheerful", "excited", "friendly", "hopeful", "sad", "shouting", "terrified", "unfriendly", "whispering"]
$ aspeak text "Hello, world" -o output.wav
If you prefer mp3/ogg/webm, you can use -c mp3
/-c ogg
/-c webm
option.
$ aspeak text "Hello, world" -o output.mp3 -c mp3
$ aspeak text "Hello, world" -o output.ogg -c ogg
$ aspeak text "Hello, world" -o output.webm -c webm
$ aspeak list-qualities
Output
Qualities for MP3:
3: audio-48khz-192kbitrate-mono-mp3
2: audio-48khz-96kbitrate-mono-mp3
-3: audio-16khz-64kbitrate-mono-mp3
1: audio-24khz-160kbitrate-mono-mp3
-2: audio-16khz-128kbitrate-mono-mp3
-4: audio-16khz-32kbitrate-mono-mp3
-1: audio-24khz-48kbitrate-mono-mp3
0: audio-24khz-96kbitrate-mono-mp3
Qualities for WAV:
-2: riff-8khz-16bit-mono-pcm
1: riff-24khz-16bit-mono-pcm
0: riff-24khz-16bit-mono-pcm
-1: riff-16khz-16bit-mono-pcm
Qualities for OGG:
0: ogg-24khz-16bit-mono-opus
-1: ogg-16khz-16bit-mono-opus
1: ogg-48khz-16bit-mono-opus
Qualities for WEBM:
0: webm-24khz-16bit-mono-opus
-1: webm-16khz-16bit-mono-opus
1: webm-24khz-16bit-24kbps-mono-opus
$ aspeak list-formats
Output
amr-wb-16000hz
audio-16khz-128kbitrate-mono-mp3
audio-16khz-16bit-32kbps-mono-opus
audio-16khz-32kbitrate-mono-mp3
audio-16khz-64kbitrate-mono-mp3
audio-24khz-160kbitrate-mono-mp3
audio-24khz-16bit-24kbps-mono-opus
audio-24khz-16bit-48kbps-mono-opus
audio-24khz-48kbitrate-mono-mp3
audio-24khz-96kbitrate-mono-mp3
audio-48khz-192kbitrate-mono-mp3
audio-48khz-96kbitrate-mono-mp3
ogg-16khz-16bit-mono-opus
ogg-24khz-16bit-mono-opus
ogg-48khz-16bit-mono-opus
raw-16khz-16bit-mono-pcm
raw-16khz-16bit-mono-truesilk
raw-22050hz-16bit-mono-pcm
raw-24khz-16bit-mono-pcm
raw-24khz-16bit-mono-truesilk
raw-44100hz-16bit-mono-pcm
raw-48khz-16bit-mono-pcm
raw-8khz-16bit-mono-pcm
raw-8khz-8bit-mono-alaw
raw-8khz-8bit-mono-mulaw
riff-16khz-16bit-mono-pcm
riff-22050hz-16bit-mono-pcm
riff-24khz-16bit-mono-pcm
riff-44100hz-16bit-mono-pcm
riff-48khz-16bit-mono-pcm
riff-8khz-16bit-mono-pcm
riff-8khz-8bit-mono-alaw
riff-8khz-8bit-mono-mulaw
webm-16khz-16bit-mono-opus
webm-24khz-16bit-24kbps-mono-opus
webm-24khz-16bit-mono-opus
# Less than default quality.
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=-1
# Best quality for mp3
$ aspeak text "Hello, world" -o output.mp3 -c mp3 -q=3
$ cat input.txt | aspeak text
or
$ aspeak text -f input.txt
with custom encoding:
$ aspeak text -f input.txt -e gbk
$ aspeak text
maybe you prefer:
$ aspeak text -l zh-CN << EOF
我能吞下玻璃而不伤身体。
EOF
$ aspeak text "你好,世界!" -l zh-CN
$ aspeak text "你好,世界!" -v zh-CN-YunjianNeural
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p 1.5 -r 0.5 -S sad
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=-10% -r=+5% -S cheerful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+40Hz -r=1.2f -S fearful
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=high -r=x-slow -S calm
$ aspeak text "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+1st -r=-7% -S lyrical
Note: Some audio formats are not supported when you are outputting to speaker.
$ aspeak text "Hello World" -F riff-48khz-16bit-mono-pcm -o high-quality.wav
The new version of aspeak
is written in Rust, and the Python binding is provided by PyO3.
Here is a simple example:
from aspeak import SpeechService
service = SpeechService()
service.connect()
service.speak_text("Hello, world")
First you need to create a SpeechService
instance.
When creating a SpeechService
instance, you can specify the following parameters:
audio_format
: The audio format of the output audio. Default isAudioFormat.Riff24KHz16BitMonoPcm
.- You can get an audio format by providing a container format and a quality level:
AudioFormat("mp3", 2)
.
- You can get an audio format by providing a container format and a quality level:
endpoint
: The endpoint of the speech service. We will use a trial endpoint by default.region
: Alternatively, you can specify the region of the speech service instead of typing the boring endpoint url.subscription_key
: The subscription key of the speech service.token
: The auth token for the speech service. If you provide a token, the subscription key will be ignored.headers
: Additional HTTP headers for the speech service.
Then you need to call connect()
to connect to the speech service.
After that, you can call speak_text()
to speak the text or speak_ssml()
to speak the SSML.
Or you can call synthesize_text()
or synthesize_ssml()
to get the audio data.
For synthesize_text()
and synthesize_ssml()
, if you provide an output
, the audio data will be written to that file and the function will return None
. Otherwise, the function will return the audio data.
Here are the common options for speak_text()
and synthesize_text()
:
locale
: The locale of the voice. Default isen-US
.voice
: The voice name. Default isen-US-JennyNeural
.rate
: The speaking rate of the voice. It must be a string that fits the requirements as documented in this section: Pitch and Ratepitch
: The pitch of the voice. It must be a string that fits the requirements as documented in this section: Pitch and Ratestyle
: The style of the voice.- You can get a list of available styles for a specific voice by executing
aspeak -L -v <VOICE_ID>
- The default value is
general
.
- You can get a list of available styles for a specific voice by executing
style_degree
: The degree of the style.- According to the Azure documentation , style degree specifies the intensity of the speaking style. It is a floating point number between 0.01 and 2, inclusive.
- At the time of writing, style degree adjustments are supported for Chinese (Mandarin, Simplified) neural voices.
role
: The role of the voice.- According to the
Azure documentation
,
role
specifies the speaking role-play. The voice acts as a different age and gender, but the voice name isn't changed. - At the time of writing, role adjustments are supported for these Chinese (Mandarin, Simplified) neural voices:
zh-CN-XiaomoNeural
,zh-CN-XiaoxuanNeural
,zh-CN-YunxiNeural
, andzh-CN-YunyeNeural
.
- According to the
Azure documentation
,
Add aspeak
to your Cargo.toml
:
$ cargo add aspeak
Then follow the documentation of aspeak
crate.