Text-to-speech in (partially) C++ using Tacotron model + Tensorflow
Running Tacotron model in TensorFlow C++ API.
Its good for running TTS in mobile or embedded device.
Code is based on keithito's tacotron implementation: https://github.com/keithito/tacotron
Status
Experimental.
Python preprocessing is required to generate sequence data from a text.
Requirment
- TensorFlow r1.8+
- Ubuntu 16.04 or later
- C++ compiler + cmake
Dump graph.
In keithito's tacotron repo, append tf.train.write_graph
to Synthesizer::load
to save TensorFlow graph.
class Synthesizer:
def load(self, checkpoint_path, model_name='tacotron'):
...
# write graph
tf.train.write_graph(self.session.graph.as_graph_def(), "models/", "graph.pb")
Freeze graph
Freeze graph for example:
freeze_graph \
--input_graph=models/graph.pb \
--input_checkpoint=./tacotron-20180906/model.ckpt \
--output_graph=models/tacotron_frozen.pb \
--output_node_names=model/griffinlim/Squeeze
Example freeze graph file is included in this repo.
Build
Edit libtensorflow_cc.so path(Assume you build TensorFlow from source code) in bootstrap.sh
, then
$ ./bootstrap.sh
$ build
$ make
Note on libtensorflow_cc
Please make sure building libtensorflow_cc with --config=monolithic
. Otherwise you'll face undefined symbols error at linking stage.
https://www.tensorflow.org/install/source#preconfigured_configurations
Run
Prepare sequence JSON file.
Sequence can be generated by using text_to_sequence()
function in keithito's tacotron repo.
See sample/sequence01.json
for generated example.
Then,
$ ./tts -i ../sample/sequence01.json -g ../tacotron_frozen.pb output.wav
example output01.wav and processed01.wav is included in sample/
Optional parameter
You can specify hyperparameter settings(JSON format) using -h
option.
See sample/hparams.json
for example.
$ ./tts -i ../sample/sequence01.json -h ../sample/hparams.json -g ../tacotron_frozen.pb output.wav
Performance
Currently TensorFlow C++ code path only uses single CPU core, so its slow. Time for synthesis is roughly 10x slower on 2018's CPU than synthesized audio length(e.g. 60 secs for 6 secs audio).
TODO
- Write all TTS pipeline fully in C++
- Text to sequence(Issue #1)
- Convert to lower case
- Expand abbreviation
- Normalize numbers(number_to_words. python inflect equivalent)
- Remove extra whitespace
- Use CPU implementation of Griffin-Lim
- Text to sequence(Issue #1)
License
MIT license.
Pretrained model used for freezing graph is obtained from keithito's repo.
Third party licenses
- json.hpp : MIT license
- cxxopts.hpp : MIT license
- dr_wav : Public domain