jianchang512/pyvideotrans

f5-tts is making weird dubbing , you can see in provided audio and srt its horrible, but its working fine in WebView, why cant it create audio properly pyvideotrans ?

Opened this issue · 25 comments

出错信息
f5-tts is making weird dubbing , you can see in provided audio and srt its horrible. but its working fine in WebView, why cant it create audio properly pyvideotrans ?

srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

audio link created by f5-tts
https://drive.google.com/file/d/1ZRgKFunyf-LQiLfpvqs5fhCC3kNj2v-x/view?usp=sharing

复现步骤

  1. 使用的哪个功能
  2. faster模式/openai模式?
  3. 使用的模型名

操作系统

Figures in English are not normalized. Will change it later

Figures in English are not normalized. Will change it later

still not working after the update of 3.20

still not working in pyvideotrans version 3.21 as you can hear in provided audio link below. when will it fixed ?
srt : 1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

audio created : https://drive.google.com/file/d/1uDIe3hgjYU2vt1XKkFIid3-C_jKKeXou/view?usp=sharing

Please use plain text or valid srt subtitles for dubbing, instead of adding other characters before the subtitles, which will dub out the timestamps as well.

Directly use the import function to import locally available legal srt files for dubbing.

not working i am doing everything correctly i have uploaded video you can see. please do something ? you can hear audio that it created at 2 : 34
link : https://drive.google.com/file/d/1nkhXfwMBTCQrj5E_Tnabs533AgsYL8Rr/view?usp=sharing

Recording.2024-11-26.172353.mp4

Explain in words what the problem is

Is it reading out the line numbers and the time lines as well?

del <b> and other html tag from srt file

no i have both shown audio created by tag and without < b >tag ,plane srt but its generating wierd sounds instead of reading the srt.

Make sure the srt is legal and there are no html tags etc in it, then rename the subtitle to exp-01.srt and test it again!

i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ?
you can listen it :

video2_2.mp4

srt :
0
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

1
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

2
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

3
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

4
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

5
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

i have tried it again with what you said you can listen the sound it created at 02:58 . is there any other format than srt it supports ? you can listen it :

video2_2.mp4
srt : 0 00:00:00,000 --> 00:00:02,366 In a world where everyone has awakened, a world of advanced talents,

1 00:00:02,500 --> 00:00:05,716 a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

2 00:00:05,783 --> 00:00:08,366 saying this talent can't compare to the advanced skills gained after a job change.

3 00:00:08,433 --> 00:00:09,933 They tell Yun Chen to quickly find a place to work.

4 00:00:09,933 --> 00:00:12,783 Even Teacher Rose advises Yun Chen to choose a professional talent soon,

5 00:00:12,833 --> 00:00:14,816 because the benefits after changing jobs are much greater.

its always sounds like English and French mixed sound

You could have just typed the text in like this.

image

If it's not a formatting problem, but just a pronunciation problem, that won't solve it.

Or you can open the api.py file under f5-tts-api and refer to the source code to modify it.

image
You can directly enter text for dubbing

If the audio is fine after dubbing, it's just not pronounced correctly like you said, like a mix of English and French, then it's not an error.

i did that but it still doesn't work and makes audio that sounds weird , does it sounds all right on your pc ?
here you can listen the audio it created :

audio.mp4
1
00:00:00,000 --> 00:00:02,366
In a world where everyone has awakened, a world of advanced talents,

2
00:00:02,500 --> 00:00:05,716
a man chooses to become a jobless wanderer. His classmates mercilessly mock him,

3
00:00:05,783 --> 00:00:08,366
saying this talent can't compare to the advanced skills gained after a job change.

4
00:00:08,433 --> 00:00:09,933
They tell Yun Chen to quickly find a place to work.

5
00:00:09,933 --> 00:00:12,783
Even Teacher Rose advises Yun Chen to choose a professional talent soon,

6
00:00:12,833 --> 00:00:14,816
because the benefits after changing jobs are much greater.

image

I test no problem

I tested the cloned voice using f5-tts in pyvideotrans 3.25, but the issue of the voice being unrecognizable is still unresolved. Interestingly, the same voice works perfectly in WEBUI, but not in pyvideotrans.

To rule out system-specific issues, I also tested it on my friend's PC. Unfortunately, it didn’t work in pyvideotrans there either, though it still worked fine in WEBUI.

The webui interface is recognized using openai-whisper's large-v3-turbo model, and the audio is cut using vad before recognition.

The api is recognized in pyvideotrans using the specified model, and the audio is cut differently.

It's normal that there are differences between the two, the models are different, the cutting parameters are different, how can they be the same.

when this issue will be solved ? because i cant clone voice in pyvideotrans.

Don't understand what you mean, if you mean: works well in webui and poorly using api, then it's normal.

If you mean: it works fine in the webui, and the sound cloned using the api doesn't correspond at all to the actual text, then I didn't test it!

i gave it only 22 seconds srt to create sound but it created voice made up of repeated nonsense up to 2 minute 26 seconds
::::::::::::::::::::::::::::::: : the srt i gave to clone voice : ::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:05,660
在入学典礼上,一群充满期待的学生挤满了体育场。

2
00:00:06,020 --> 00:00:08,440
他们的注意力全都集中在舞台上。

3
00:00:08,600 --> 00:00:14,240
一位宿舍老师在台上宣布,让我们热烈欢迎军队武术教官——

4
00:00:14,240 --> 00:00:18,680
张凯教官上台。

5
00:00:18,880 --> 00:00:22,060
他既是你们的校长,也是你们的老师。

:::::::::::::::::::::::::::::::::::::::::: : transcription of cloned-voice it created : :::::::::::::::::::::::::::::::::
1
00:00:00,000 --> 00:00:10,940
一位宿舍是内安中,

2
00:00:11,799 --> 00:00:12,740
Hello, my friend,

3
00:00:12,740 --> 00:00:13,980
我是盲学生店里上进。

4
00:00:16,400 --> 00:00:17,860
并发财能出,

5
00:00:18,440 --> 00:00:21,440
仅满的学生充满学育群。

6
00:00:22,020 --> 00:00:23,640
又存在my friend,

7
00:00:24,520 --> 00:00:27,160
一台在入学的入学店里的

8
00:00:27,160 --> 00:00:27,840
老朋友。

9
00:00:30,000 --> 00:00:30,960
他需要联系

10
00:00:30,960 --> 00:00:33,360
我们这一封说愿素脚本

11
00:00:33,360 --> 00:00:35,320
但牛顿需要确定的词

12
00:00:36,140 --> 00:00:36,700

13
00:00:37,699 --> 00:00:39,060
How do you

14
00:00:40,239 --> 00:00:42,340
Is Lynn all my friend?

15
00:00:43,640 --> 00:00:44,720
Cassandra 应

16
00:00:45,400 --> 00:00:46,900
一全都集中在

17
00:00:47,980 --> 00:00:50,540
他们居然一全在舞台上

18
00:00:51,300 --> 00:00:52,740
天然的人生

19
00:00:52,740 --> 00:00:54,420
You know my dear Freddie Frank

20
00:00:54,420 --> 00:00:54,640
E

21
00:00:56,339 --> 00:00:57,680
The English Beeman

22
00:00:57,680 --> 00:00:59,880
一郎出现了

23
00:01:00,540 --> 00:01:01,920
一位宿舍是

24
00:01:02,900 --> 00:01:03,820
内安中

25
00:01:04,660 --> 00:01:06,860
Hello 吗是盲学生店里上进

26
00:01:09,060 --> 00:01:11,200
被引发财云抽而上

27
00:01:11,200 --> 00:01:12,720
挤满了学生

28
00:01:12,720 --> 00:01:14,280
充满了山雨群

29
00:01:14,979 --> 00:01:16,520
又存在my friend

en.mp4
1
00:00:01,950 --> 00:00:04,430
Several molecules have been found in the Five Elder Star Systems,

2
00:00:04,720 --> 00:00:06,780
We are still a long way from the third kind of contact.

3
00:00:07,260 --> 00:00:09,880
We have really started the photography mission on Weibo for a year,

4
00:00:10,140 --> 00:00:12,920
Recently,many photos that were difficult to take in the past have been uploaded.

5
00:00:13,440 --> 00:00:17,500
In early June,astronomers published this photo in Nature Periodicals,


I tested it without any problem。

Please make sure that f5-tts-api has downloaded the patch package and upgraded pyvideotrans to 3.26,and please make sure the reference audio and reference text are correct。

It is normal for the subtitle duration to be inconsistent with the dubbing duration。

image

The 5s.wav in the above picture is the reference audio, and the text after # is the corresponding text of the reference audio.

5s.wav is stored in the f5-tts folder in the same directory as sp.exe

ff5-tts-api patch update

https://github.com/jianchang512/f5-tts-api/releases/tag/v0.1

https://github.com/jianchang512/f5-tts-api/releases/download/v0.1/2024-1127-buding.7z

It's solved! I realized I was making one fatal mistake, which is why the audio was pronouncing words out of recognition when cloning. The mistake was that after the #, I was putting whatever I wanted. I thought it was just for testing whether the API worked or not.

However, I realized from the recent solution you provided that the text after the # should correspond to the reference audio.

I'm sorry for causing extra work due to my mistake.😔😔😭
Screenshot 2024-11-29 221615