windows-1255 encoding: add mapping for 0xCA

Question

windows-1255 encoding: add mapping for 0xCA

bhaible opened this issue 8 years ago · 11 comments

The windows-1255 specified through the spec does NOT map the byte 0xCA.

However, the main use of windows-1255 is as a codepage on Windows, and the native Windows converter (function MultiByteToWideChar) maps 0xCA to U+05BA, already since Windows 2000, i.e. for 15 years.

On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used", and the majority of non-Windows conversion software does not map the byte 0xCA.

For details of these mapping tables, see
http://haible.de/bruno/charsets/conversion-tables/index.html
http://haible.de/bruno/charsets/conversion-tables/CP1255.html

The implementation of the change would be to edit index-windows-1255.txt, adding a line
74 0x05BA (HEBREW POINT HOLAM HASER FOR VAV)

Answer 1 · 2016-10-04T07:35:15.000Z

Per https://www.w3.org/International/tests/repo/results/encoding-sb-dec#windows-1255 it's indeed only Microsoft that has failures here. I can't seem to run the test however in Edge and the note indicates it's mostly about PUA code points. @r12a?

(Note that to implement this change we'd update the JSON resource and run tools-index.py, but it's not entirely clear to me that we want too given that the majority of implementations is aligned.)

Answer 2 · 2016-10-04T09:56:11.000Z

it's not entirely clear to me that we want too given that the majority of implementations is aligned

Yes, usually I follow this "majority of implementations" argument. But here, given that the main use of windows-1255 is as "a code page used under Microsoft Windows" [see https://en.wikipedia.org/wiki/Windows-1255], I would follow what the implementation of MultiByteToWideChar under Windows does: it maps 0xCA to U+05BA.

Answer 3 · 2016-10-04T10:34:33.000Z

@annevk i had no problem running the test. If you continue to have a problem, let me know.

Here's a snap of the results.

0xCA is mapped to U+05BA and called out as an error.

Answer 4 · 2016-10-04T11:21:58.000Z

The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.

I'm OK with adding the mapping to Gecko's implementation.

Answer 5 · 2016-10-05T10:49:43.000Z

On the other hand, the codepage chart at Microsoft https://msdn.microsoft.com/en-us/library/cc195057.aspx marks this position as "not used"

This is an archaic archive and should not be considered as a reference these days. For example, it does not contain a mapping to euro sign.

Recently Microsoft removed the former reference site and put a link to the "best fit" mappings on unicode.org. So the "best fit" mappings should be considered as the latest reference now.

Answer 6 · 2016-10-05T11:03:20.000Z

@jungshik @hsivonen okay with you too?

Answer 7 · 2016-10-05T19:22:15.000Z

The "best fit" mappings for windows-1255 (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1255.txt) have the 0xCA to U+05BA mapping, by the way.

Thank you for the pointer to these tables. I've updated the mapping table comparison in http://haible.de/bruno/charsets/conversion-tables/CP1255.html.

FWIW, I made the corresponding change in GNU libiconv: http://git.savannah.gnu.org/gitweb/?p=libiconv.git;a=commitdiff;h=500b967b8f4bcb2bd656c293c5412dc611c5720b

Answer 8 · 2016-10-10T07:29:11.000Z

I'm OK with adding this mapping.

Answer 9 · 2016-10-23T17:21:38.000Z

Unless @jungshik objects, it seems this is ready to be merged.

Answer 10 · 2016-10-24T08:12:42.000Z

I created a PR, let me know if you see any problems. I plan on merging by end-of-day.

Answer 11 · 2016-10-24T17:00:21.000Z

I don't have any objection. I'll add that to Blink's mapping.