eval form checkpoints
marymirzaei opened this issue · 30 comments
Thanks for your nice work. I have trained the model on Blizzard 2013 dataset. The synthesized files from 185k and 385k checkpoints are available in the following link. I used the samples from LJ-Speech (LJ001-0001.wav
) and Nancy (nancy.wav
) as reference files for checking the performance. I also included the mocdel.checkpoint
files and the audio files at each step (step-185000-audio.wav
, step-385000-audio.wav
).
https://www.dropbox.com/sh/jhcynw65o1tmj7r/AABJN4cBotdbs-A5-Rk89vt0a?dl=0
Any idea on how to improve the shaking voice?
Yes the voice from the link is a little shaking. Can you share the hyper-parameter and the alignments of the test setences of your experiments?
Besides, I found use the character-leval inputs is better than phoneme-level in my earlier experiments, though the paper used the phoneme inputs.
@lapwing I updated some codes today, with which the quality of generated speech is much better than before (87K steps). It aslo works with small reduce factor.
eval-87k-r2.zip
I will also eval the performance these days.
the new update seems to have set cmu-dict to false. Is that what you used to get those results?
Thank you very much! sorry for the delay... I have uploaded the files you wanted to the link above, if you still want to take a look.
The quality is very good! Do you have any plans to release the checkpoints?
@fazlekarim Yes, the hparams in the repo is 100% matching to my experiments.
@lapwing I didn't see the setting in your link, anyway I guess you can try the new code.
Since I only have one single GPU, it will take several days to test the new code. I'm very grateful if you can help to test the performance. As for now, I find the quality is better, but the style is learned slower than before(100K steps). I will continue training to see if it can get stable results with and without style attention. Also I will upload the checkpoints and new samples once I finished theses experiments in several days.
Sure! I will do the same and will upload the results so that we can compare. Thanks for your nice work!
@syang1993 are those samples generated directly from Tacotron ? the audio quality is amazing
@butterl Which sample do you mean? The samples attached in this issues were generated from gst-tacotron repo directly using blizzard 2011 data. The samples in the demo page were also directly genereted from gst-tacotron using blizzard2013 data. I also did experiments with tacotron using bc2011 data, the samples can be found in keithito/tacotron#182
@syang1993 thanks for reaching out, I‘ve tried keithito/tacotron and Rayhane-mamah/Tacotron-2 all seems generate wav with shake & echo like @lapwing’s sample and even worse(even with wavenet 300K as vcoder), and your sample wav attached is much clear, and you posted “I updated some codes today” 15 days ago, but I do not find the exactly patch.
Will try with this amazing repo to reproduce
@butterl Maybe you can try the modified keithito's tacotron in my repo, which is forked from the original one and fixed the issues to support small reduce factor. @fazlekarim may have tried this repo, I'm not sure whether he get good results. And the commit of "I updated some code today" is ba10ee1
@butterl I was satisfied with my results. I can show them to you if you are interested?
@fazlekarim thanks for reaching out, I'm be very interested in your sample , because mine is much worse with other repo even trained to 400K, and now I will switch to this one and give feed back
This is the only one I have saved in this computer. Let me know what you think about it.
@fazlekarim thanks for reaching out , the wav is good ,and seem have more shaking than eval-87k-r2.zip @syang1993 shared
@syang1993 I trained in my machine and the result is good, but for eval it failed some times (2/3)
and with use_gst=False the eval would returns error
Use random weight for GST.
Traceback (most recent call last):
File "eval.py", line 65, in <module>
main()
File "eval.py", line 61, in main
run_eval(args)
File "eval.py", line 25, in run_eval
synth.load(args.checkpoint, args.reference_audio)
File "/home/public/gst-tacotron/synthesizer.py", line 29, in load
self.model.initialize(inputs, input_lengths, mel_targets=mel_targets, reference_mel=reference_mel)
File "/home/public/gst-tacotron/models/tacotron.py", line 88, in initialize
style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
UnboundLocalError: local variable 'gst_tokens' referenced before assignment
@butterl How many steps do you train? Do you also use the BC2013 or BC2011 data?
If you set use_gst=False
, it means you will not use the style attention, then you must feed reference_audio
to model during eval.
@syang1993 the training step is 77k
I tried with two experiments on eval:
- use_gst=True,and feed wav from the training set , the out sometimes fail(not aligned and wav is small)
- use_gst=False,and with reference_audio path feed,erro turns out to be as below, seems network could not mach
Loading checkpoint: ./logs-tacotron/model.ckpt-77000
Traceback (most recent call last):
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256]
[[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]]
[[Node: save/RestoreV2/_154 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:169)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 65, in <module>
main()
File "eval.py", line 61, in main
run_eval(args)
File "eval.py", line 25, in run_eval
synth.load(args.checkpoint, args.reference_audio)
File "/home/public/gst-tacotron/synthesizer.py", line 37, in load
saver.restore(self.session, checkpoint_path)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1755, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256]
[[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]]
[[Node: save/RestoreV2/_154 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:169)]]
Caused by op 'save/Assign_152', defined at:
File "eval.py", line 65, in <module>
main()
File "eval.py", line 61, in main
run_eval(args)
File "eval.py", line 25, in run_eval
synth.load(args.checkpoint, args.reference_audio)
File "/home/public/gst-tacotron/synthesizer.py", line 36, in load
saver = tf.train.Saver()
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1293, in __init__
self.build()
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1302, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1339, in _build
build_save=build_save, build_restore=build_restore)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 796, in _build_internal
restore_sequentially, reshape)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 471, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 161, in restore
self.op.get_shape().is_fully_defined())
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 280, in assign
validate_shape=validate_shape)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 58, in assign
use_locking=use_locking, name=name)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/home/public/.local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [384,256] rhs shape= [512,256]
[[Node: save/Assign_152 = Assign[T=DT_FLOAT, _class=["loc:@model/inference/memory_layer/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/memory_layer/kernel, save/RestoreV2/_213)]]
[[Node: save/RestoreV2/_154 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_160_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:169)]]
@butterl Since the model is more complex than tacotron, it may need more data and training steps to get convergence. The flag use_gst
means two different model, you must train an new model with use_gst=False
setting.
@syang1993 Thanks! will wait to see good result.
BTW, could we put the eval mel to r9y9‘s wavenet ?
@syang1993 tried with 100K model the output is good, but the eval text would cut by “,”
e.g. "he'd like to help the girl, who’s wearing the red coat." will only output wav before ", " ,and would output all when remove “,”,tried some print
wav = audio.inv_preemphasis(wav)
print("wav len="+str(len(wav)))
end_point = audio.find_endpoint(wav)
wav = wav[:end_point]
print("wav len="+str(len(wav)))
wav len=400600
wav len=102400
seems the wav is cut here by slicense time
@butterl Without the third line, does the generated wav contain the latter speech? If it contains, maybe we need to modify the value of min_silence_sec (default 0.8)
in find_endpoint
function. Thanks for pointing it out.
I think the new code works very well. I trained up to 437k
and you can find the samples generated using you reference_audio-2.wav
file in the following link:
https://www.dropbox.com/sh/8cbrog2mtc8h8xw/AABOTLi0j8-06At3zdrHeQNra?dl=0
However, I found out that every time I do eval from the same checkpoint I get different results. Why is it so?
@lapwing Thanks for sharing , it sounds good. I'm not sure why it generate different results, there may exist a generation issue. I'm on a summer vacation these weeks and cannot test it. I will test it later to find what cause this problem. If you find it out, could you let me know? Thanks.
@syang1993 - is it possible to get the trained model that you used to generate the samples for eval-87k-r2.zip?
@ZohaibAhmed Hi, since I'm on a summer vocation these weeks, I will send it to you after I go back to school. Besides, you can get this model using Blizzard 2011 database, it will not take so long time.
Thanks for your nice work. I have trained the model on Blizzard 2013 dataset. The synthesized files from 185k and 385k checkpoints are available in the following link. I used the samples from LJ-Speech (
LJ001-0001.wav
) and Nancy (nancy.wav
) as reference files for checking the performance. I also included themocdel.checkpoint
files and the audio files at each step (step-185000-audio.wav
,step-385000-audio.wav
).
https://www.dropbox.com/sh/jhcynw65o1tmj7r/AABJN4cBotdbs-A5-Rk89vt0a?dl=0
Any idea on how to improve the shaking voice?
@lapwing could you share the hyper-parameter?
The PT couldn't be reload with default hyper-parameter
Thank you!
@ZohaibAhmed Hi, since I'm on a summer vocation these weeks, I will send it to you after I go back to school. Besides, you can get this model using Blizzard 2011 database, it will not take so long time.
@syang1993 Hi, Could you send me the trained model that you used to generate the samples for eval-87k-r2.zip? Thank you!
@lapwing can you share the 437k model ?
@lapwing Thanks for sharing , it sounds good. I'm not sure why it generate different results, there may exist a generation issue. [...]
At the top of eval.py
, before anything else is imported, I put
import random
random.seed(42)
import numpy
numpy.random.seed(42)
from tensorflow import set_random_seed
set_random_seed(42)
This sets a fixed seed for all random number generators that could be involved - and it does the trick. Now, I don't see any random numbers used in the gst-tacotron code itself that would cause randomness at inference time, but maybe something's going on in some imported lib. Anyway, the fixed seeds lead to reproducible results.
Hello! Thank you for your work! Could you send me the pretrained model please? luantunez95@gmail.com