windows-1255 encoding: add mapping for 0xCA
bhaible opened this issue · 11 comments
The windows-1255 specified through the spec does NOT map the byte 0xCA.
However, the main use of windows-1255 is as a codepage on Windows, and the native Windows converter (function MultiByteToWideChar) maps 0xCA to U+05BA, already since Windows 2000, i.e. for 15 years.
On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used", and the majority of non-Windows conversion software does not map the byte 0xCA.
For details of these mapping tables, see
http://haible.de/bruno/charsets/conversion-tables/index.html
http://haible.de/bruno/charsets/conversion-tables/CP1255.html
The implementation of the change would be to edit index-windows-1255.txt, adding a line
74 0x05BA (HEBREW POINT HOLAM HASER FOR VAV)
Per https://www.w3.org/International/tests/repo/results/encoding-sb-dec#windows-1255 it's indeed only Microsoft that has failures here. I can't seem to run the test however in Edge and the note indicates it's mostly about PUA code points. @r12a?
(Note that to implement this change we'd update the JSON resource and run tools-index.py, but it's not entirely clear to me that we want too given that the majority of implementations is aligned.)
it's not entirely clear to me that we want too given that the majority of implementations is aligned
Yes, usually I follow this "majority of implementations" argument. But here, given that the main use of windows-1255 is as "a code page used under Microsoft Windows" [see https://en.wikipedia.org/wiki/Windows-1255], I would follow what the implementation of MultiByteToWideChar under Windows does: it maps 0xCA to U+05BA.
@annevk i had no problem running the test. If you continue to have a problem, let me know.
Here's a snap of the results.
0xCA is mapped to U+05BA and called out as an error.
The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.
I'm OK with adding the mapping to Gecko's implementation.
On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used"
This is an archaic archive and should not be considered as a reference these days. For example, it does not contain a mapping to euro sign.
Recently Microsoft removed the former reference site and put a link to the "best fit" mappings on unicode.org. So the "best fit" mappings should be considered as the latest reference now.
The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.
Thank you for the pointer to these tables. I've updated the mapping table comparison in http://haible.de/bruno/charsets/conversion-tables/CP1255.html.
FWIW, I made the corresponding change in GNU libiconv: http://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=commitdiff;h=500b967b8f4bcb2bd656c293c5412dc611c5720b
I'm OK with adding this mapping.
Unless @jungshik objects, it seems this is ready to be merged.
I created a PR, let me know if you see any problems. I plan on merging by end-of-day.
I don't have any objection. I'll add that to Blink's mapping.