buzz/mediainfo.js

Character encoding issue

Closed this issue · 9 comments

Checklist

  • I have searched the issue tracker for any duplicate issues and confirmed that this bug has not been reported before.
  • I have tested the issue with the upstream project MediaInfo and can confirm that the problem only exists in mediainfo.js.
  • I have attached all necessary test files, if applicable, so that the issue can be easily reproduced by the developers.
  • I have added a reproduction repository or a code sandbox that clearly illustrates the issue. Providing a minimal example will greatly help the developers in understanding and resolving the problem.

Bug Description

The metadata of an MP3 file has character encoding issues.
It's listing the artist as �me instead of Âme

Steps to Reproduce

const mediainfo = await MediaInfoFactory({
    coverData: covers,
    locateFile: () => {
      return "../../wasm/media-info.wasm";
    },
})

const result = await mediainfo.analyzeData(
    getSize(headUrl),
    readChunk(getUrl)
)

Where getSize and readChunk is: https://github.com/icidasset/diffuse/blob/b11d5b0a8d204ed3db401c0eedb8da25e4076131/src/Javascript/processing.ts#L116-L155

Using the file:
https://www.dropbox.com/scl/fi/u7qw9rgo7xjmcajsh1oex/211-me_-_rej.mp3?rlkey=tmwckk19p4n7r7gd5g8ngk7cz&dl=0

You can try it out in the app itself using this branch:
https://github.com/icidasset/diffuse/tree/encoding-issue-mediainfo

You'll have to upload the file to a supported service in order to test it though.
I can do that for you if you want, let me know.

Expected Behavior

Expected to see the artist Âme

Actual Behavior

Got the artist �me

Environment

  • mediainfo.js version: 0.2.1
  • Operating System: macOS Sonoma
  • Browser (if applicable): Chromium 121

Additional Information

I tested this with the mediainfo CLI tool installed via homebrew where it did parse the metadata correctly, uses MediaInfoLib v24.01. Other metadata parsers show the info correctly too.

Thanks for this great project!

buzz commented

That seems to be an old file that uses latin-1 encoding.

$ mid3v2 --list-raw 211-me_-_rej.mp3
Raw IDv2 tag info for 211-me_-_rej.mp3
[...]
TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])
[...]

When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly:

Before:

$ mediainfo.js 211-me_-_rej.mp3 | grep Performer
Performer                                : �me

After:

$ mediainfo.js 211-me_-_rej.mp3 | grep Performer
Performer                                : Âme

Interestingly the medianfo.js web page shows them correctly. Modern browsers usually auto-detect the encoding.

Screenshot_2024-02-12_04-33-35

I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen. A lot of flags are disabled for the emscripten build. You could test by building mediainfo.js with those flags enabled.

Another approach would be to solve this outside of mediainfo.js. There's are libs that detect and convert between encodings. Or even make UTF-8 mandatory as character encoding in your app.

Thanks for investigating! 🙏

So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader. And the demo is an older version of mediainfo. Maybe I need to set some encoding option/header somewhere?

even make UTF-8 mandatory as character encoding in your app.

I have <meta charset="utf-8" /> set in my html file, I assume that's all I need?

I also tried converting to UTF8 manually, but no luck.
I'll take a look at the different mediainfo/emscripten flags.

When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly

Could you share the resulting file?
Because I don't get how it could be different. Maybe the they change the encoding to UTF-8? But MediaInfoLib should manage the difference and show the same MediaInfo::Get() output.
Could you have a debug on MediaInfo::Get() result and output the hex dump of the output from the 2 files?

TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])

MediaInfoLib parses it as Latin1 too and provides the string in Unicode or UTF-8, so no reason that mediainfo.js shows "�me" with this file on the command line.

So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader

Input kind should not change MediaInfoLib behavior.

I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen.

It converts to local code page. But if the OS is misconfigured without e.g. LC_CTYPE to e.g. "en_US.UTF-8" it may convert internal Unicode to C locale (so Latin1... but from a UTF-8 string, so issue). Maybe similar issue with mediainfo.js?

A lot of flags are disabled for the emscripten build.

Should not have any impact. There is a known issue with modern platforms (having UTF-8 encoding for the terminal) and such MP3 files with Latin1 encoding only if MediaInfoLib is compiled in non Unicode mode (MEDIAINFO_UNICODE_NO).

If possible, a first point of debug would be the result of MediaInfo::Get() and see how it is encoded (Unicode chars? UTF-8?) so you know if the issue is from MediaInfoLib or something else (mediainfo.js or the platform config somewhere not handling UTF-8).

Thanks for chiming in @JeromeMartinez
I'd love to help out more with this but I'm having a difficult time getting the project to build 🙈


Anyhow! I did figure out that it seems to be a problem with mediainfo.js v0.2
I've updated the demo from the gh-pages-src branch to v0.2.1 (from v0.1.9) and then the encoding problem showed up there too:
Screenshot 2024-02-12 at 15 17 12
So I guess we can exclude any app specific code.

buzz commented

I created a mp3 file with different encodings. Most media players and tag editors display the tags just fine (i.e. EasyTAG fails to display the UTF-16BE tag).

A test case and the mp3 test file can be found in the 150-id3-character-encodings branch.

Python script used to create the test file
"""
Create id3 tags in different character encodings.

$ mid3v2 --list-raw char_enc_tags.mp3
Raw IDv2 tag info for char_enc_tags.mp3
TIT2(encoding=<Encoding.UTF8: 3>, text=['utf-8 〃𐍈'])
TPE1(encoding=<Encoding.LATIN1: 0>, text=['latin-1 Ã£â¬Æ'])
TALB(encoding=<Encoding.UTF16: 1>, text=['utf-16 〃𐍈'])
TCON(encoding=<Encoding.UTF16BE: 2>, text=['utf-16be 〃𐍈'])
"""

from mutagen.id3 import Encoding, ID3, TALB, TCON, TIT2, TPE1

tags = ID3()
# performer
tags.add(TPE1(encoding=Encoding.LATIN1, text=["latin-1 Ã£â¬Æ"]))
# title
tags.add(TIT2(encoding=Encoding.UTF8, text=["utf-8 〃𐍈"]))
# album
tags.add(TALB(encoding=Encoding.UTF16, text=["utf-16 〃𐍈"]))
# genre
tags.add(TCON(encoding=Encoding.UTF16BE, text=["utf-16be 〃𐍈"]))

tags.save("char_enc_tags.mp3")

Findings for different flavors of mediainfo:

✅ https://mediainfo.js.org/ - mediainfo.js v0.1.9 (MediaInfoLib v22.09)

mediainfo js org_v0 1 9

✅ mediainfo.js CLI v0.1.9 (MediaInfoLib v22.09)
$ node dist/cli.js --format JSON ../char_encoding_issue_150/char_enc_tags.mp3
"Title": "utf-8 〃𐍈",
"Album": "utf-16 〃𐍈",
"Track": "utf-8 〃𐍈",
"Performer": "latin-1 Ã£â¬Æ",
"Genre": "utf-16be 〃𐍈",
❌ mediainfo.js CLI v0.2.1 (MediaInfoLib v24.01)
$ pnpm exec node dist/cjs/cli.cjs --format JSON __tests__/fixtures/char_enc_tags.mp3
"Title":"utf-8 〃𐍈",
"Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="},
"Track":"utf-8 〃𐍈",
"Performer":"latin-1 �",
"Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="},
❌ https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01)
"Title":"utf-8 〃𐍈",
"Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="},
"Track":"utf-8 〃𐍈",
"Performer":"latin-1 �",
"Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="},
❌ mediainfo CLI (MediaInfoLib v24.01)

Displays the Latin-1 tag correctly, but not the UTF-16/UTF-16BE...

$ mediainfo __tests__/fixtures/char_enc_tags.mp3
[...]
Album                                    : utf-16 〃??
Track name                               : utf-8 〃𐍈
Performer                                : latin-1 Ã£â¬Æ
Genre                                    : utf-16be 〃??
[...]

These are just some preliminary findings. I'll have to see when I find the time to do some version bisecting on mediainfo.js.

Thanks @buzz
I managed to make a build with a Github action.
I've removed the --disable-unicode flag from the zenlib compilation, which I assume also put mediainfolib in non-unicode mode, and added the LC_CTYPE=en_US.UTF-8 env var.
But... no luck.

https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01)
❌ mediainfo CLI (MediaInfoLib v24.01)

This is annoying... Weird, the Windows version is fine, maybe some misconfiguration somewhere about locale.
Note that MediaInfo CLI v22.09 has also the issue, so it seems that there is no change there on our side.

IMO 2 issues:

  • mediainfo.js change somewhere about locale configuration between v0.1.9 and v0.2.1, for @buzz
  • MediaInfo CLI issue doing same as mediainfo.js v0.2.1 + MediaInfoOnline even more weird, for @g-maxime

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 30 days since being marked as stale.