agermanidis/autosub

set langcode zh-TW but output wrong coding as simplify Chinese(zh-CN) in ubuntu

chiangandy opened this issue · 2 comments

I use command line to make sub in ubuntu as below
autosub -S zh-TW -D zh-TW andy-good_food_sub1.mp4
then response following with no any error ...

Converting speech regions to FLAC files: 100% |#########################################################################| Time:  0:00:02
Performing speech recognition: 100% |###################################################################################| Time:  0:00:05
Subtitles file created at andy-good_food_sub1.srt

but when I check andy-good_food_sub1.srt inside , the content is simplify chinese (zh-CN)
2
00:00:03,072 --> 00:00:04,096
营养师

3
00:00:04,352 --> 00:00:05,632
今天跟大家分享

4
00:00:06,144 --> 00:00:08,448
非常多人问我的有问题


9
00:00:14,080 --> 00:00:17,152
不同的年龄就是儿童青少年男士

10
00:00:17,408 --> 00:00:23,040
现在成年人拿孕妇和老年人的更年期妇女每个人的营养需求都不一样

I am sure this is simplify Chinese not traditional Chinese
The issue is so tricky, does anyone has this issue before? I cannot figure out how to solve this...

Any information will be appreciate.

Andy

more information to explain issue...
When I run sam command in my local environment for Mac OS X and output is corrected Traditional Chinese which is so strange.

more try to confirm issue, I found the issue is in google speech API, because API produce text is already wrong. Then I try to curl to try API...
So tricky I go this.

when I try in my mac environment...

curl -X POST \ 
--data-binary @'andy-sample1_foutput_sub1.flac' \
--header 'Content-Type: audio/x-flac; rate=44100;' \
'http://www.google.com/speech-api/v2/recognize?output=json&lang=zh-TW&key=<<avoid>>'
{"result":[]}
{"result":[{"alternative":[{"transcript":"雨傘認識太厲害了是盡情的發揮他的相應的合作處著手","confidence":0.89017951},{"transcript":"雨傘認識太厲害的是盡情的發揮他的相應的合作處著手"},{"transcript":"雨傘認識太厲害了盡情的發揮他的相應的合作處著手"},{"transcript":"雨傘認識太厲害了是盡情的發揮它的相應的合作處著手"},{"transcript":"雨傘認識太厲害的是盡情的發揮它的相應的合作處著手"}],"final":true}],"result_index":0}

Then I upload audio file and try same script in ubuntu...

curl -X POST \
> --data-binary @'andy-sample1_foutput_sub1.flac' \
> --header 'Content-Type: audio/x-flac; rate=44100;' \
> 'https://www.google.com/speech-api/v2/recognize?output=json&lang=<<avoid>>'
{"result":[]}
{"result":[{"alternative":[{"transcript":"雨伞认识太厉害了自尽情的发挥它的上映然后作出这首","confidence":0.89187473},{"transcript":"雨伞认识太厉害了是尽情的发挥它的上映然后做出正手"},{"transcript":"雨伞认识太厉害了自尽情的发挥它的上映然后做出正手"},{"transcript":"雨伞认识太厉害了些尽情的发挥它的上映然后做出正手"},{"transcript":"雨伞认是太厉害了自尽情的发挥它的上映然后作出这首"}],"final":true}],"result_index":0}

Obviously, output text is different. Why! does api is depend on environment???

Does anyone has comments for this?