/gosling

Natural sounding text-to-speech in the terminal (and more).

Primary LanguageGoMIT LicenseMIT

gosling

Natural sounding text-to-speech in the terminal (and more).

Pre-requisites

This is NOT intended to be a completely-free, pick-up-and-use TTS solution. In fact, it is simply a wrapper around Google's Cloud Text-to-Speech API.

You will need:

  • A GCP account with billing enabled.
    • Google gives you 1 million characters free every month. That's nearly 10 books a month. It's essentially free for personal use.
    • Once you have a GCP account, enable the TTS API and get a service account.
    • Export service account credentials in your shell. You will need to do this every time you open a new shell. Add it to your shell configuration or make a script to run gosling for convenience.
      export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
  • Internet connection every time you need some text spoken to you.
  • I have only tested this on Linux. Commands for playing audio will be different on other platforms.

Examples

Simple text with default options

defaults.mp4

Numbers and punctuation with default options

(the multiple exclamations are something that I have seen other TTSs struggle with):

Welcome to gosling!!! It has options such as "Pitch adjustment" in the range -20.0 to 20.0, "Speaking rate/speed" in the range 0.25 to 4.0 and "Volume gain" (in dB) in the range -96.0 to 16.0.
numbers_punc.mp4

Other languages

Kannada:

kannada.mp4

Check out the full voice list, use Wavenet or Neural2 based voices for better quality.

Installation

Pre-built binaries

Go to the latest release, scroll down to "Assets" and download the correct file for your platform. Unzip the file and run the gosling binary inside:

./gosling

If you have go installed

go install github.com/Samyak2/gosling@latest

Usage

Text file

gosling input.txt output.mp3

Play the resulting output.mp3 file using your audio player.

Standard input

echo "hello there" | gosling - output.mp3

Play audio directly

If you have the play command, which is usually a part of the sox package (sudo dnf install sox on Fedora):

echo "hello there" | gosling - - | play -t mp3 -

If you have the ffplay command, which is a part of ffmpeg:

echo "hello there" | gosling - - | ffplay -nodisp -autoexit -

Options

gosling has a lot of configuration around language & voice, audio, etc.

See gosling --help for all options.

Usage: gosling <input-file> <output-file>

Arguments:
  <input-file>     Text file to read from. Use - for standard input.
  <output-file>    Audio file to write to. Use - for standard output.

Flags:
  -h, --help                            Show context-sensitive help.
  -l, --language-code="en-US"           Language code to use for the synthesis. See full list at: https://cloud.google.com/text-to-speech/docs/voices
  -v, --voice-name="en-US-Wavenet-A"    Voice name to use for the synthesis. Use an empty string to let the GCP API choose. See full list at: https://cloud.google.com/text-to-speech/docs/voices
      --pitch=-3                        Pitch adjustment in the range [-20.0, 20.0]. Use a negative number to decrease the pitch. See:
                                        https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
  -r, --speaking-rate=1.0               Speaking rate/speed in the range [0.25, 4.0]. See: https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
      --volume-gain=0.0                 Volume gain (in dB) in the range [-96.0, 16.0]. See: https://cloud.google.com/text-to-speech/docs/reference/rest/v1/text/synthesize#audioconfig
  -s, --[no-]ssml                       Use if text has SSML. Default is plain text. See: https://cloud.google.com/text-to-speech/docs/basics#speech_synthesis_markup_language_ssml_support
      --service-endpoint=STRING         GCP Service Endpoint. You'll need to set this if you want a Neural2 voice. See: https://cloud.google.com/text-to-speech/docs/endpoints.

FAQ

The voice sounds too robotic

WaveNet

By default, on the default language, gosling uses a WaveNet based voice model. If you're using a different language, make sure to switch the voice to a WaveNet based one too. Use --voice-name for this.

Neural2

If WaveNet is not good enough, try using a Neural2 voice type (search for Neural2 in the voice list if you need other languages):

gosling input.txt output.mp3 --service-endpoint 'https://us-central1-texttospeech.googleapis.com' -v en-US-Neural2-A

TODO: this endpoint is currently timing out for all TTS requests, not sure why.

If Neural2 isn't good enough either, well... you'll have to take this up with Google.

Why am I getting this error google: could not find default credentials?

Either:

  • You did not read the Pre-requisites section.
  • You forgot to export the GOOGLE_APPLICATION_CREDENTIALS environment variable in your shell.
  • Something is wrong with your GCP service account. See this page that is also linked from the error.

Why don't --pitch and --volume-gain have short versions?

These options can have negative values and the command-line parser I use behaves weirdly with negative numbers and short flags. I have removed the short versions to avoid making it a pitfall.

How do I use this with foliate?

I use this script:

#!/bin/bash
# requires gosling and sox
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
gosling - - | play -t mp3 - &
trap 'kill $!; exit 0' INT
wait

Copy and save this to a file and chmod +x /path/to/foliate-gosling.sh it.

TODO: this only works with English text. I need to figure out a way to convert FOLIATE_TTS_LANG_LOWER to Google's format.

But why?

When I'm too lazy to read an article, I use Google Assistant's "read me this article" feature on my phone. It's extremely good, especially with text-only articles. I could not find an alternative on desktop (specifically, Linux).

Yes, there are quite a few text-to-speech apps on Linux. Most of them either sound like R2D2 or something from the depths of the void. The only one, that I found, which sounds bearable uses an undocumented Google Translate API (probably a ToS violation?). There are also some pre-trained neural-network based models, but they sound like a person speaking through a very low-bandwidth voice call and they skip over numbers and abbreviations pretending they never existed.

The only text-to-speech that sounded good was Google's. So I thought - "they must have a GCP API for this". And they did. And I hacked this together.

TODO

  • speech-dispatcher support. This will allow using it in Firefox's reader mode, for example.
  • Some pre-processing of raw text - remove extra/unnecessary punctuation, better formatting for numbers, etc.

License

MIT