jamesturk/jellyfish

String Functions should operate on wchar_t instead of char

stevvooe opened this issue · 6 comments

First, I'd like to say that this library is awesome. It has all the greats in one place and doesn't try to do anything too fancy. The implementations are clean as well.

Currently, the library doesn't really have support for unicode strings. If I try to submit a unicode string with a non-ascii character, I get a traceback:

>>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4faf' in position 0: ordinal not in range(128)

In addition, when one does encode them so that they will appropriately convert to an encoded string (non UCS-4 unicode), the incorrect answer is given. This example should give 1, rather than 3:

>>> jellyfish.hamming_distance(u'\u725b'.encode('utf8'), u'\u4faf'.encode('utf8'))
3

This obviously doesn't make sense for things like soundex or other english only algorithms, but as a general rule, python libraries should take only unicode objects for string operations.

Patching this library to support unicode objects, rather than string objects shouldn't be too bad. You just need to replace the PyString_FromString with PyUnicode_FromWideChar and update the functions to use wchar_t. Here is the python c api unicode reference:

http://docs.python.org/c-api/unicode.html

If you want, I can fork this and send you a pull request. Just let me know.

Thanks for the feedback. I would definitely like to see jellyfish work with unicode strings, but there are still some complications that I haven't worked out.

First, I don't think we can just convert char to wchar_t wherever it occurs. As I understand the current state of Unicode and Python:

  • sizeof(wchar_t) varies from platform to platform - it's 4 on most Unix-like systems but 2 on Windows
  • sizeof(Py_UNICODE), python's internal unicode character representation, varies based on a compiler flag. It defaults to 2 (i.e. UCS2) but many Linux distributions override it and ship w/ UCS4 python builds.
  • PyUnicode_AsWideChar does not perform any encoding conversion.

As a result, a straghtforward char->wchar_t find-and-replace could result in our functions returning different results on different platforms and python installations (when called on strings with characters outside the Basic Multilingual Plane, at least).

Instead, I'd rather our C functions accept uint32_t* and always encode the strings we get from python as UTF-32. This raises some more issues, however. Consider the grapheme 'à'. This can be represented by the single unicode code point 0x00E0 or by the pair 0x0061 0x0300 (an 'a' codepoint followed by a combing grave codepoint). This can result in a situation like this:

>>> s1 = u"\u00E0"
>>> s2 = u"\u0061\u0300"
>>> print s1
à
>>> print s2
à
>>> jellyfish.levenshtein_distance(s1, s2)
2

which is probably not ideal for most real-world use cases of Levenshtein distance. One solution is to first normalize unciode strings. The Unicode standard defines a normal form, NFC, which will collapse u"\u0061\u0300" (and many similar cases) down to its single-codepoint equivalent u"\u00E0". However, there are still many graphemes which do not have a single-codepoint equivalent and will cause weird results. 'и̏' must be represented as a series of 2 codepoints (an 'и' codepoint followed by a combining double grave codepoint). As a result, even with unicode normalization you will get:

>>> jellyfish.levenshtein_distance(u"и̏", u"j")
2

Again, probably not what the user was actually looking for.

  • sizeof(wchar_t) varies from platform to platform - it's 4 on most Unix-like systems but 2 on Windows
  • sizeof(Py_UNICODE), python's internal unicode character representation, varies based on a compiler flag. It defaults to 2 (i.e. UCS2) but many Linux distributions override it and ship w/ UCS4 python builds.

If you look at the python source, in Objects/unicodeobject.c, you will see that this situation is handled. If the Py_UNICODE_SIZE is 2 bytes, PyUnicode_FromWideChar converts the encoding to UTF-16 before converting from a wchar_t.

  • PyUnicode_AsWideChar does not perform any encoding conversion.

This conversion may not be necessary. The python unicode strings are already stored as UTF16 or UTF32. More libraries would have this issue if it wasn't handled correctly in the Python library. Perhaps shooting a mail over to the python-dev list might help?

As a result, a straghtforward char->wchar_t find-and-replace could result in our functions returning different results on different platforms and python installations (when called on strings with characters outside the Basic Multilingual Plane, at least).

We should just build up a number of test cases for different strings that may expose possible bugs. There is already some protection against using non BMP characters on a UCS2 system.

Instead, I'd rather our C functions accept uint32_t* and always encode the strings we get from python as UTF-32. This raises some more issues, however.

I would shy away from this approach. There would be an overhead associated with support this decoding and encoding every time a function is called. Since this is a python library, maybe you could just operate on the Py_UNICODE type.

The Unicode standard defines a normal form...

This is a problem I have run into before. Rather than have your library modify the users precious strings, I would just include a blip about this in the documentation, along with a best practice for handling this situation.

We should just build up a number of test cases for different strings that may expose possible bugs. There is already some protection against using non BMP characters on a UCS2 system.

Whether python uses UCS2 or UTF-16 internally is blurry - surrogate pairs can definitely get introduced into Py_UNICODE buffers, as PyUnicode_FromWideChar explicitly does when sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2 and as the various PyUnicode_Decode functions will on 'UCS2' python installations - see http://www.mail-archive.com/python-dev@python.org/msg03325.html for discussion or the functions in unicodeobject.c. As a result, the output of PyUnicode_AsWideChar can also contain surrogate pairs, which will completely throw off our distance functions.

I would shy away from this approach. There would be an overhead associated with support this decoding and encoding every time a function is called. Since this is a python library, maybe you could just operate on the Py_UNICODE type.

The encoding/decoding would only actually be necessary when sizeof(Py_UNICODE) != 4. Relying on Py_UNICODE has the same problems as relying on wchar_t - namely, the possible existence of surrogate pairs.

Whether python uses UCS2 or UTF-16 internally is blurry

I pretty sure its UCS-2 or UCS-4, not utf-16. The Py_UNICODE type is a wide character, whereas utf-16 is defined for a bytestream.

As a result, the output of PyUnicode_AsWideChar can also contain surrogate pairs, which will completely throw off our distance functions.

So, for surrogates, I was saying that it would be best to just treat these as characters and notify users via documentation that they should normalize if they want truly correct results.

The encoding/decoding would only actually be necessary when sizeof(Py_UNICODE) != 4. Relying on Py_UNICODE has the same problems as relying on wchar_t - namely, the possible existence of surrogate pairs.

Actually, it turns out that Py_UNICODE is wchar_t:

configure:12358:    PY_UNICODE_TYPE="wchar_t"
Include/unicodeobject.h:137:typedef PY_UNICODE_TYPE Py_UNICODE;

As a result, a straghtforward char->wchar_t find-and-replace could result in our functions returning different results on different platforms and python installations (when called on strings with characters outside the Basic Multilingual Plane, at least).

I agree that you won't be able to do a straight find and replace; The wide character string functions will have to be used. Again, if you want, I have no problem rolling a patch for this.

I apologize that the below gets into a lot of gritty details; for a previous project I had to become intimately familiar with large parts of the Unicode spec. The short summary is that I agree with you that we shouldn't be normalizing strings behind the user's back but I think the only sensible, cross-platform way to do this is by dealing with UTF-32 internally (converting from python's internal UTF-16 when necessary).

I pretty sure its UCS-2 or UCS-4, not utf-16. The Py_UNICODE type is a wide character, whereas utf-16 is defined for a bytestream.

The (first) definiton of PyUnicode_FromWideChar (which is used when converting from 4 byte wchar_t's on a sizeof(Py_UNICODE) == 2 installation) is illustrative of why this is not the case. First, you've got the comment that prefaces it ("Here sizeof(wchar_t) is 4 but Py_UNICODE_SIZE == 2, so we need to convert from UTF32 to UTF16."), but then if you look inside the definition:

    {
        register Py_UNICODE *u;
        u = PyUnicode_AS_UNICODE(unicode);
        for (i = size; i > 0; i--) {
            if (*w > 0xFFFF) {
                wchar_t ordinal = *w++;
                ordinal -= 0x10000;
                *u++ = 0xD800 | (ordinal >> 10);
                *u++ = 0xDC00 | (ordinal & 0x3FF);
            }
            else
                *u++ = *w++;
        }
    }

0xFFFF is the highest codepoint that UCS2 can represent. When FromWideChar comes across a codepoint higher than that it does some bit manipulation and inserts 2 Py_UNICODEs into the internal string - that's a UTF-16 surrogate pair that it's constructing. Surrogate pairs are not legal UCS2 (their presence is one of the defininig differences between UCS2 and UTF-16); Python's internal encoding is thus not strictly UCS2 on sizeof(Py_UNICODE) == 2 installations. Similar code can be found in python's other decoding functions.

So, for surrogates, I was saying that it would be best to just treat these as characters and notify users via documentation that they should normalize if they want truly correct results.

I think we're conflating the issue of surrogate pairs (two UTF-16 code-units being used to represent a single unicode codepoint) with grapheme clusters (multiple unicode codepoints being used to represent a single grapheme, or "user-perceived character" as the standard calls it).

Unicode normalization can be used to collapse a grapheme cluster such as u"\u0061\u0300" (à) into a single codepoint u"\u00E0", when one exists. This conversion is encoding-independent because normalization operates on codepoints, not encoded bytes. I'm willing to grant that applying this kind of normalization behind the user's back is probably not a good idea - and, in fact, an edit distance function that operates on a codepoint basis w/o any normalization does make sense in many circumstances.

Surrogate pairs, however, are a different story. Outside of a few specific situations (e.g. determining exactly how much storage space a string will take), it is almost never semantically correct for a function that accepts UTF-16 to just treat it as a series of 16-bit blobs (i.e. to ignore surrogate pairs) -- it certainly doesn't make sense for any of the functions in jellyfish.

Unfortunately there's not a straightforward way to just leave it up to the user to 'normalize' any strings that might contain surrogate pairs if they desire correct behavior. Consider a user on a sizeof(Py_UNICODE) == 2 system. The user has some unicode objects with high codepoints in them that he wants to compare, but we wrote our functions to accept internal python strings without regard for surrogate pairs. The user thinks "well, I'll re-encode these strings as UTF-32 to get rid of the surrogate pairs so that the jellyfish functions will get me reasonable answers." Calling s.encode('UTF-32') on a string doesn't change its internal representation, however -- it returns a new str (python < 3) or bytes (python >= 3) instance. Now he passes these UTF-32 encoded objects to our functions, which have no way of knowing that the PyString (or PyBytes) that they just got are supposed to be interpreted as UTF-32 instead of ASCII.

We're left with a few options:

  • Accept wchar_t or Py_UNICODE and just ignore the issue of surrogate pairs. This will produce results that are just plain wrong on some platforms (mostly OS X and Windows, as it works out). There's no reasonable way for the user to circumvent our choice.
  • Accept wchar_t or Py_UNICODE and handle surrogate pairs ourselves. This is feasible, but would require sprlinkilng this logic across a bunch of functions and doesn't carry many advantages over the below option.
  • Convert unicode strings to UTF-32 before processing. This will incur a very small performance hit on some systems (mostly OS X and Windows systems, again).

I'm inclined towards the last option.

Actually, it turns out that Py_UNICODE is wchar_t:

That's installation dependent. On my machine (OS X), with 4-byte wchar_t's and 2 byte Py_UNICODE's, the two are not equivalent. Note that an alternative branch inside of the configure script sets PY_UNICODE_TYPE="unsigned short". From the docs:

On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

In fact, python can be compiled on systems without a wchar.h header. Note that the definitions of PyUnicode_AsWideChar and PyUnicode_FromWideChar are contained within #ifdef HAVE_WCHAR_H blocks - on systems without wchar.h these functions are simply unavailable, as the docs further illustrate, beginning the section explaining them with "wchar_t support for platforms which support it:".

For posterity:

Jellyfish 0.5 makes the decision to use Unicode on all strings. Given the shift to Python 3 this felt like the right thing to do, and while there will certainly be quite a few edge cases as discussed above, Python 3.4 representing all strings as wchar_t hopefully takes care of some of the major objections.

Some bugs may not be practical to fix on Python 2, so I'd expect as issues arise Jellyfish may start treating Python 2 as a second class citizen, and I'd venture a guess that whatever form it takes Jellyfish 1.0 will not have official support for Python 2.