Character encoding issue
Closed this issue · 9 comments
Checklist
- I have searched the issue tracker for any duplicate issues and confirmed that this bug has not been reported before.
- I have tested the issue with the upstream project MediaInfo and can confirm that the problem only exists in mediainfo.js.
- I have attached all necessary test files, if applicable, so that the issue can be easily reproduced by the developers.
- I have added a reproduction repository or a code sandbox that clearly illustrates the issue. Providing a minimal example will greatly help the developers in understanding and resolving the problem.
Bug Description
The metadata of an MP3 file has character encoding issues.
It's listing the artist as �me
instead of Âme
Steps to Reproduce
const mediainfo = await MediaInfoFactory({
coverData: covers,
locateFile: () => {
return "../../wasm/media-info.wasm";
},
})
const result = await mediainfo.analyzeData(
getSize(headUrl),
readChunk(getUrl)
)
Where getSize
and readChunk
is: https://github.com/icidasset/diffuse/blob/b11d5b0a8d204ed3db401c0eedb8da25e4076131/src/Javascript/processing.ts#L116-L155
Using the file:
https://www.dropbox.com/scl/fi/u7qw9rgo7xjmcajsh1oex/211-me_-_rej.mp3?rlkey=tmwckk19p4n7r7gd5g8ngk7cz&dl=0
You can try it out in the app itself using this branch:
https://github.com/icidasset/diffuse/tree/encoding-issue-mediainfo
You'll have to upload the file to a supported service in order to test it though.
I can do that for you if you want, let me know.
Expected Behavior
Expected to see the artist Âme
Actual Behavior
Got the artist �me
Environment
- mediainfo.js version: 0.2.1
- Operating System: macOS Sonoma
- Browser (if applicable): Chromium 121
Additional Information
I tested this with the mediainfo CLI tool installed via homebrew where it did parse the metadata correctly, uses MediaInfoLib v24.01
. Other metadata parsers show the info correctly too.
Thanks for this great project!
That seems to be an old file that uses latin-1 encoding.
$ mid3v2 --list-raw 211-me_-_rej.mp3
Raw IDv2 tag info for 211-me_-_rej.mp3
[...]
TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])
[...]
When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly:
Before:
$ mediainfo.js 211-me_-_rej.mp3 | grep Performer
Performer : �me
After:
$ mediainfo.js 211-me_-_rej.mp3 | grep Performer
Performer : Âme
Interestingly the medianfo.js web page shows them correctly. Modern browsers usually auto-detect the encoding.
I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen. A lot of flags are disabled for the emscripten build. You could test by building mediainfo.js with those flags enabled.
Another approach would be to solve this outside of mediainfo.js. There's are libs that detect and convert between encodings. Or even make UTF-8 mandatory as character encoding in your app.
Thanks for investigating! 🙏
So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch
API (file is hosted on Amazon S3) and the Range
header, whereas your demo reads the whole file using FileReader
. And the demo is an older version of mediainfo. Maybe I need to set some encoding option/header somewhere?
even make UTF-8 mandatory as character encoding in your app.
I have <meta charset="utf-8" />
set in my html file, I assume that's all I need?
I also tried converting to UTF8 manually, but no luck.
I'll take a look at the different mediainfo/emscripten flags.
When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly
Could you share the resulting file?
Because I don't get how it could be different. Maybe the they change the encoding to UTF-8? But MediaInfoLib should manage the difference and show the same MediaInfo::Get()
output.
Could you have a debug on MediaInfo::Get()
result and output the hex dump of the output from the 2 files?
TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])
MediaInfoLib parses it as Latin1 too and provides the string in Unicode or UTF-8, so no reason that mediainfo.js shows "�me" with this file on the command line.
So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader
Input kind should not change MediaInfoLib behavior.
I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen.
It converts to local code page. But if the OS is misconfigured without e.g. LC_CTYPE to e.g. "en_US.UTF-8" it may convert internal Unicode to C locale (so Latin1... but from a UTF-8 string, so issue). Maybe similar issue with mediainfo.js?
A lot of flags are disabled for the emscripten build.
Should not have any impact. There is a known issue with modern platforms (having UTF-8 encoding for the terminal) and such MP3 files with Latin1 encoding only if MediaInfoLib is compiled in non Unicode mode (MEDIAINFO_UNICODE_NO).
If possible, a first point of debug would be the result of MediaInfo::Get()
and see how it is encoded (Unicode chars? UTF-8?) so you know if the issue is from MediaInfoLib or something else (mediainfo.js or the platform config somewhere not handling UTF-8).
Thanks for chiming in @JeromeMartinez
I'd love to help out more with this but I'm having a difficult time getting the project to build 🙈
Anyhow! I did figure out that it seems to be a problem with mediainfo.js v0.2
I've updated the demo from the gh-pages-src
branch to v0.2.1 (from v0.1.9) and then the encoding problem showed up there too:
So I guess we can exclude any app specific code.
I created a mp3 file with different encodings. Most media players and tag editors display the tags just fine (i.e. EasyTAG fails to display the UTF-16BE tag).
A test case and the mp3 test file can be found in the 150-id3-character-encodings branch.
Python script used to create the test file
"""
Create id3 tags in different character encodings.
$ mid3v2 --list-raw char_enc_tags.mp3
Raw IDv2 tag info for char_enc_tags.mp3
TIT2(encoding=<Encoding.UTF8: 3>, text=['utf-8 〃𐍈'])
TPE1(encoding=<Encoding.LATIN1: 0>, text=['latin-1 Ã£â¬Æ'])
TALB(encoding=<Encoding.UTF16: 1>, text=['utf-16 〃𐍈'])
TCON(encoding=<Encoding.UTF16BE: 2>, text=['utf-16be 〃𐍈'])
"""
from mutagen.id3 import Encoding, ID3, TALB, TCON, TIT2, TPE1
tags = ID3()
# performer
tags.add(TPE1(encoding=Encoding.LATIN1, text=["latin-1 Ã£â¬Æ"]))
# title
tags.add(TIT2(encoding=Encoding.UTF8, text=["utf-8 〃𐍈"]))
# album
tags.add(TALB(encoding=Encoding.UTF16, text=["utf-16 〃𐍈"]))
# genre
tags.add(TCON(encoding=Encoding.UTF16BE, text=["utf-16be 〃𐍈"]))
tags.save("char_enc_tags.mp3")
Findings for different flavors of mediainfo:
✅ mediainfo.js CLI v0.1.9 (MediaInfoLib v22.09)
$ node dist/cli.js --format JSON ../char_encoding_issue_150/char_enc_tags.mp3
"Title": "utf-8 〃𐍈",
"Album": "utf-16 〃𐍈",
"Track": "utf-8 〃𐍈",
"Performer": "latin-1 Ã£â¬Æ",
"Genre": "utf-16be 〃𐍈",
❌ mediainfo.js CLI v0.2.1 (MediaInfoLib v24.01)
$ pnpm exec node dist/cjs/cli.cjs --format JSON __tests__/fixtures/char_enc_tags.mp3
"Title":"utf-8 〃𐍈",
"Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="},
"Track":"utf-8 〃𐍈",
"Performer":"latin-1 �",
"Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="},
❌ https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01)
"Title":"utf-8 〃𐍈",
"Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="},
"Track":"utf-8 〃𐍈",
"Performer":"latin-1 �",
"Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="},
❌ mediainfo CLI (MediaInfoLib v24.01)
Displays the Latin-1 tag correctly, but not the UTF-16/UTF-16BE...
$ mediainfo __tests__/fixtures/char_enc_tags.mp3
[...]
Album : utf-16 〃??
Track name : utf-8 〃𐍈
Performer : latin-1 Ã£â¬Æ
Genre : utf-16be 〃??
[...]
These are just some preliminary findings. I'll have to see when I find the time to do some version bisecting on mediainfo.js.
Thanks @buzz
I managed to make a build with a Github action.
I've removed the --disable-unicode
flag from the zenlib compilation, which I assume also put mediainfolib in non-unicode mode, and added the LC_CTYPE=en_US.UTF-8
env var.
But... no luck.
❌ https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01)
❌ mediainfo CLI (MediaInfoLib v24.01)
This is annoying... Weird, the Windows version is fine, maybe some misconfiguration somewhere about locale.
Note that MediaInfo CLI v22.09 has also the issue, so it seems that there is no change there on our side.
IMO 2 issues:
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 30 days since being marked as stale.