kaegi/alass

Feature request: allow different subtitles charsets (other than UTF8) to be processed

Closed this issue · 4 comments

I'm having some issues and have to use iconv to force a UTF-8 conversion of my .rtf files as alass would not process them otherwise.

Can different charsets be considered at alass code level?

Thanks

kaegi commented

Have you tried --encoding-inc and --encoding-ref?

Have you tried --encoding-inc and --encoding-ref?

Actually I didn't but it seems like the user is expected to know the charset of the incorrect subtitles file (or the reference one)

It is nice to see this was "considered" still it needs manual interaction which is what I was trying to avoid.
I'm wondering if the charset can be worked out automatically by alass directly (e.g. like the file command does)

Thanks!


Edit:

Video=$(ls -1 | grep -Ei '*.avi$|*.mkv$|.*asf$|*.wmv$|*.mp4$|*.mpg$|*.mpeg$|*.divx$|*.m4v$' 2>/dev/null)
SubName=$(ls -1 | grep -Ei '*.srt$' 2>/dev/null | head -1)
CharSet=$(file -bi ./"$SubName" | cut -f2 -d "=")
SubNameResync=".alass"_$SubName
alass --encoding-inc "$CharSet" "$Video" "$SubName" "$SubNameResync"

Running the above script recursively on my video folders does the job OKish; I still think this could be better handled by alass internally with a charset autodetection routine.

P.S. file occasionally returns an "unknown-8bit" which alass doesn't understand as CharSet input.

As a matter of facts I have found that file -bi is NOT reliable enough. So anybody facing this problem I would strongly suggest forcing a UTF-8 of the source .srt like this (you'll need vim installed):

vim +'set nobomb | set fenc=utf8 | x' <filename>

The above will open any CharSet and save in utf-8 transparently.

So the above script is further developed into:

Video=$(ls -1 | grep -Ei '*.avi$|*.mkv$|.*asf$|*.wmv$|*.mp4$|*.mpg$|*.mpeg$|*.divx$|*.m4v$' 2>/dev/null)
SubName=$(ls -1 | grep -Ei '*.srt$' 2>/dev/null | head -1)
vim +'set nobomb | set fenc=utf8 | x' $SubName
CharSet=$(file -bi ./"$SubName" | cut -f2 -d "=")
SubNameResync=".alass"_$SubName
alass --encoding-inc "$CharSet" "$Video" "$SubName" "$SubNameResync"

HTH

kaegi commented

Auto-detection of character encoding using https://github.com/thuleqaid/rust-chardet implemented in 874f02d.