evanmiller/fmptools

Support other character sets

Closed this issue · 2 comments

FP3 is assumed to be MacRoman. FP7 and later is assumed Windows-1252. Either locate the file bytes that indicate the encoding, or offer an option to the user to specify the encoding. (Note strings returned to the client are always UTF-8.)

I've discovered (through extensive tinkering) that FP7 and later use the Standard Compression Scheme for Unicode:

https://www.unicode.org/reports/tr6/tr6-4.html

This is a clever pre-UTF-8 encoding that looks a lot like Latin-1, but it jumps to other code blocks using special C0 control characters. I went ahead and implemented SCSU according to the specification. It seems to work well with my test files containing Greek and Japanese characters, which means that the FP7 and FMP12 readers now enjoy full Unicode support – including characters in the Extended plane (emoji, anyone?).

I'll leave this issue open since I haven't figured out whether FP3 and FP5 indicate their encoding or not. A non-Latin test file would help a lot.

Based on this discussion:

qwesda/fp5dump#5

It seems like MacRoman is the predominant FP5 file encoding, and nobody seems to have an example of a non-MacRoman file. So I'm going to close this issue until someone comes along with a file that is not being decoded correctly.