f5-tts is making weird dubbing , you can see in provided audio and srt its horrible, but its working fine in WebView, why cant it create audio properly pyvideotrans ?
Opened this issue · 25 comments
出错信息
f5-tts is making weird dubbing , you can see in provided audio and srt its horrible. but its working fine in WebView, why cant it create audio properly pyvideotrans ?
srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,
2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,
3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.
4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.
5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,
6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.
audio link created by f5-tts
https://drive.google.com/file/d/1ZRgKFunyf-LQiLfpvqs5fhCC3kNj2v-x/view?usp=sharing
复现步骤
- 使用的哪个功能
- faster模式/openai模式?
- 使用的模型名
操作系统
- Windows
its working fine in WebView interface and creating audio properly
audio created in WebView : https://drive.google.com/file/d/1cr6RbZR7rwc9G7NS74KNWKoUP0Qgpo8n/view?usp=sharing
Figures in English are not normalized. Will change it later
Figures in English are not normalized. Will change it later
still not working after the update of 3.20
still not working in pyvideotrans version 3.21 as you can hear in provided audio link below. when will it fixed ?
srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,
2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,
3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.
4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.
5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,
6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.
audio created : https://drive.google.com/file/d/1uDIe3hgjYU2vt1XKkFIid3-C_jKKeXou/view?usp=sharing
Please use plain text or valid srt subtitles for dubbing, instead of adding other characters before the subtitles, which will dub out the timestamps as well.
Directly use the import function to import locally available legal srt files for dubbing.
not working i am doing everything correctly i have uploaded video you can see. please do something ? you can hear audio that it created at 2 : 34
link : https://drive.google.com/file/d/1nkhXfwMBTCQrj5E_Tnabs533AgsYL8Rr/view?usp=sharing
Recording.2024-11-26.172353.mp4
Explain in words what the problem is
Is it reading out the line numbers and the time lines as well?
del <b> and other html tag from srt file
no i have both shown audio created by tag and without < b >tag ,plane srt but its generating wierd sounds instead of reading the srt.
Make sure the srt is legal and there are no html tags etc in it, then rename the subtitle to exp-01.srt and test it again!
i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ?
you can listen it :
video2_2.mp4
srt :
0
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,
1
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,
2
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.
3
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.
4
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,
5
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.
i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ? you can listen it :
video2_2.mp4
srt : 0 00:00:00,000 --> 00:00:02,366 In a world where everyone has awakened, a world of advanced talents,1 00:00:02,500 --> 00:00:05,716 a man chooses to become a jobless wanderer. His classmates mercilessly mock him,
2 00:00:05,783 --> 00:00:08,366 saying this talent can't compare to the advanced skills gained after a job change.
3 00:00:08,433 --> 00:00:09,933 They tell Yun Chen to quickly find a place to work.
4 00:00:09,933 --> 00:00:12,783 Even Teacher Rose advises Yun Chen to choose a professional talent soon,
5 00:00:12,833 --> 00:00:14,816 because the benefits after changing jobs are much greater.
its always sounds like English and French mixed sound
i did that but it still doesn't work and makes audio that sounds weird , does it sounds all right on your pc ?
here you can listen the audio it created :
audio.mp4
1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,
2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,
3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.
4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.
5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,
6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.
I test no problem
I tested the cloned voice using f5-tts in pyvideotrans 3.25, but the issue of the voice being unrecognizable is still unresolved. Interestingly, the same voice works perfectly in WEBUI, but not in pyvideotrans.
To rule out system-specific issues, I also tested it on my friend's PC. Unfortunately, it didn’t work in pyvideotrans there either, though it still worked fine in WEBUI.
The webui interface is recognized using openai-whisper's large-v3-turbo model, and the audio is cut using vad before recognition.
The api is recognized in pyvideotrans using the specified model, and the audio is cut differently.
It's normal that there are differences between the two, the models are different, the cutting parameters are different, how can they be the same.
when this issue will be solved ? because i cant clone voice in pyvideotrans.
Don't understand what you mean, if you mean: works well in webui and poorly using api, then it's normal.
If you mean: it works fine in the webui, and the sound cloned using the api doesn't correspond at all to the actual text, then I didn't test it!
i gave it only 22 seconds srt to create sound but it created voice made up of repeated nonsense up to 2 minute 26 seconds
::::::::::::::::::::::::::::::: : the srt i gave to clone voice : ::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:05,660
在入学典礼上,一群充满期待的学生挤满了体育场。
2
00:00:06,020 --> 00:00:08,440
他们的注意力全都集中在舞台上。
3
00:00:08,600 --> 00:00:14,240
一位宿舍老师在台上宣布,让我们热烈欢迎军队武术教官——
4
00:00:14,240 --> 00:00:18,680
张凯教官上台。
5
00:00:18,880 --> 00:00:22,060
他既是你们的校长,也是你们的老师。
:::::::::::::::::::::::::::::::::::::::::: : transcription of cloned-voice it created : :::::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:10,940
一位宿舍是内安中,
2
00:00:11,799 --> 00:00:12,740
Hello, my friend,
3
00:00:12,740 --> 00:00:13,980
我是盲学生店里上进。
4
00:00:16,400 --> 00:00:17,860
并发财能出,
5
00:00:18,440 --> 00:00:21,440
仅满的学生充满学育群。
6
00:00:22,020 --> 00:00:23,640
又存在my friend,
7
00:00:24,520 --> 00:00:27,160
一台在入学的入学店里的
8
00:00:27,160 --> 00:00:27,840
老朋友。
9
00:00:30,000 --> 00:00:30,960
他需要联系
10
00:00:30,960 --> 00:00:33,360
我们这一封说愿素脚本
11
00:00:33,360 --> 00:00:35,320
但牛顿需要确定的词
12
00:00:36,140 --> 00:00:36,700
一
13
00:00:37,699 --> 00:00:39,060
How do you
14
00:00:40,239 --> 00:00:42,340
Is Lynn all my friend?
15
00:00:43,640 --> 00:00:44,720
Cassandra 应
16
00:00:45,400 --> 00:00:46,900
一全都集中在
17
00:00:47,980 --> 00:00:50,540
他们居然一全在舞台上
18
00:00:51,300 --> 00:00:52,740
天然的人生
19
00:00:52,740 --> 00:00:54,420
You know my dear Freddie Frank
20
00:00:54,420 --> 00:00:54,640
E
21
00:00:56,339 --> 00:00:57,680
The English Beeman
22
00:00:57,680 --> 00:00:59,880
一郎出现了
23
00:01:00,540 --> 00:01:01,920
一位宿舍是
24
00:01:02,900 --> 00:01:03,820
内安中
25
00:01:04,660 --> 00:01:06,860
Hello 吗是盲学生店里上进
26
00:01:09,060 --> 00:01:11,200
被引发财云抽而上
27
00:01:11,200 --> 00:01:12,720
挤满了学生
28
00:01:12,720 --> 00:01:14,280
充满了山雨群
29
00:01:14,979 --> 00:01:16,520
又存在my friend
en.mp4
1
00:00:01,950 --> 00:00:04,430
Several molecules have been found in the Five Elder Star Systems,
2
00:00:04,720 --> 00:00:06,780
We are still a long way from the third kind of contact.
3
00:00:07,260 --> 00:00:09,880
We have really started the photography mission on Weibo for a year,
4
00:00:10,140 --> 00:00:12,920
Recently,many photos that were difficult to take in the past have been uploaded.
5
00:00:13,440 --> 00:00:17,500
In early June,astronomers published this photo in Nature Periodicals,
I tested it without any problem。
Please make sure that f5-tts-api has downloaded the patch package and upgraded pyvideotrans to 3.26,and please make sure the reference audio and reference text are correct。
It is normal for the subtitle duration to be inconsistent with the dubbing duration。
The 5s.wav
in the above picture is the reference audio, and the text after #
is the corresponding text of the reference audio.
5s.wav
is stored in the f5-tts
folder in the same directory as sp.exe
ff5-tts-api patch update
https://github.com/jianchang512/f5-tts-api/releases/tag/v0.1
https://github.com/jianchang512/f5-tts-api/releases/download/v0.1/2024-1127-buding.7z
It's solved! I realized I was making one fatal mistake, which is why the audio was pronouncing words out of recognition when cloning. The mistake was that after the #, I was putting whatever I wanted. I thought it was just for testing whether the API worked or not.
However, I realized from the recent solution you provided that the text after the # should correspond to the reference audio.