String encoding by platformID

Question

String encoding by platformID

chrissimpkins opened this issue 8 years ago · 42 comments

Current approach:

if record.platformID == 0:
    record.string = version_string.encode('utf_16_be')  # Unicode platform ID gets UTF16 big endian
elif record.platformID == 1:
    record.string = version_string                      # Mac platform ID
elif record.platformID == 3:
    record.string = version_string.encode('utf_16_be')  # Windows platform ID gets UTF16 big endian

TODO:

add from __future__ import unicode_literals to module
let fontools encode proper string by platformID type (see #1 (comment))
use unicode alias and tounicode function in fonttools py23 module to encode properly before use of fonttools library for the string write (see #1 (comment))
test encoding of strings and exception handling for string encoding in Py 2 + 3

Per conversations with @davelab6 and @anthrotype, the platformID 1 version strings are not used to any significant degree any longer so the platformID writes should see limited use out there. It sounds as though there is some legacy use of the platformID 1 records on old versions of Mac (maybe even pre OS X) applications.

Review of numerous fonts (including commercial) shows that the platformID 1 version strings are almost universally present so we should still support the correct modification of this record here.

Answer 1 · 2017-09-05T14:02:50.000Z

You can just set the record.string to a unicode string (i.e. unicode type in Python 2, or str type in Python 3), and let fonttools encode it automatically for you with the correct platform encoding for that name record.

I also recommend to use from __future__ import unicode_literals whenever you can, so your string literals are implicitly treated as unicode strings (without needing u"" in front of them) like in Python 3.

The fontTools.misc.py23 module has some convenience utilities and type aliases to correctly work with unicode/str/bytes in a py2/py3 environment.

Remember that the list of command line arguments that you get from the console (sys.argv) are bytes strings on Python 2, and these are encoded with the console's encoding (UTF-8 on most Unix based systems). On Python 3 on the other hand, they are already decoded to unicode strings (str). In the case of python 2, you'll need to know what the console's encoding is (sys.stdout.encoding gives you a hint, but the latter may be None if standard streams have been redirected to a pipe; you can google for "python 2 sys.argv encoding" or if you feel lazy use "utf-8" all the time!), and then use the console's encoding to decode the arguments to get the unicode strings.

There's a tounicode function in fontTools.misc.py23 that decodes a bytes string to unicode with a given encoding or returns the string as is if it is already a unicode string; useful in these kinds of situations where the input may be either bytes or unicodes and you want unicodes. (also check out tobytes and tostr functions in same module).

Answer 2 · 2017-09-05T14:07:31.000Z

Thank you! Very helpful!

Answer 3 · 2017-09-05T14:11:05.000Z

You can just set the record.string to a unicode string (i.e. unicode type in Python 2, or str type in Python 3), and let fonttools encode it automatically for you with the correct platform encoding for that name record.

Does the use of the unicode type (even with Python interpreter checking prior to execution of the code) raise exceptions on Python 3? i.e. do I catch this exception and ignore when Python 3 interpreter in use?

Answer 4 · 2017-09-05T14:13:08.000Z

There's a tounicode function in fontTools.misc.py23 that decodes a bytes string to unicode with a given encoding or returns the string as is if it is already a unicode string; useful in these kinds of situations where the input may be either bytes or unicodes and you want unicodes. (also check out tobytes and tostr functions in same module).

Will check out these functions. Very helpful. Thank you!

Answer 5 · 2017-09-05T14:25:23.000Z

The unicode type in Python 2 is the equivalent of the str type in Python 3. Hence the py23 module in fonttools exports a unicode alias that points to unicode in Python 2 and to str in Python 3. You can import that unicode type from py23 module and use it, e.g., with isinstance.

If you want to convert a bytes string to a unicode string (in Python 2, which is the same as a str string in Python 3), you can use the tounicode function from the fonttools py23 module, which will just pass it on as is if it's already a unicode string, or decode the bytes string to a unicode string using the provided encoding (it defaults to ascii but you should provide the actual bytes string encoding; if you read the bytes from a text file, that would usually be UTF-8, if you read them from the console (in Python 2) it will be the console's encoding, etc.).

If you want to work with strings in a codebase that's meant to support both python 2 and 3, I'm afraid you need to become familiar with terms like bytes, unicode, str, encode, decode, etc.
In an ideal world where everybody is using python 3, then one would just use str for text, and bytes for binary data, without confusing one with the other like python 2 used to do. ;)

Answer 6 · 2017-09-05T14:30:17.000Z

Also note that the stuff in fonttools py23 module is nothing fancy. Other popular python packages that are not afraid of adding dependencies just use the six module (https://pythonhosted.org/six/) which provides the same (and more) functionalities. But since your library already depends on fonttools, you may well just use that.

Answer 7 · 2017-09-05T14:45:55.000Z

Hence the py23 module in fonttools exports a unicode alias that points to unicode in Python 2 and to str in Python 3. You can import that unicode type from py23 module and use it, e.g., with isinstance.

But since your library already depends on fonttools, you may well just use that.

👍 Thanks Cosimo. This is very helpful. I will tinker in Py2/3 once I get the Travis testing set up for the project.

Answer 8 · 2017-09-05T14:54:10.000Z

Added the following to the TODO list based upon above conversation:

add from __future__ import unicode_literals to module
let fontools library encode proper string by platformID type, ignore this on font-v end (see #1 (comment))
use unicode alias and tounicode function in fonttools py23 module to encode properly before use of fonttools library for the string write - i.e. before record.string definitions (see #1 (comment))
test encoding of strings and exception handling for string encoding in Py 2 + 3 on Travis

Answer 9 · 2017-09-05T20:42:12.000Z

👍 tests that are Py2 + 3 compatible...

from __future__ import unicode_literals
from fontTools.misc.py23 import unicode, tounicode, tobytes, tostr


def test_fontv_fonttools_lib_unicode():
    test_string = tobytes("hello")
    test_string_str = tostr("hello")
    test_string_unicode = tounicode(test_string, 'utf-8')
    test_string_str_unicode = tounicode(test_string_str, 'utf-8')

    assert (isinstance(test_string, unicode)) is False
    if sys.version_info[0] == 2:
        assert (isinstance(test_string_str, unicode)) is False     # str != unicode in Python 2
    elif sys.version_info[0] == 3:
        assert (isinstance(test_string_str, unicode)) is True      # str = unicode in Python 3
    assert (isinstance(test_string_unicode, unicode)) is True      # after cast with fonttools function, Py2+3 = unicode
    assert (isinstance(test_string_str_unicode, unicode)) is True  # ditto
    assert test_string_unicode == "hello"

Will cast everything that comes from command line to unicode with tounicode function.

Answer 10 · 2017-09-06T01:53:42.000Z

My head is spinning after a lengthy attempt to perform a string equality comparison between the name record record.string that is read in from fonttools and a string literal across combinations of the unicode_literals import, attempts to cast to unicode with fonttools tounicode function, use of str.decode('utf-8'), etc. I need to do this in order to confirm that I am not saving a previous sha1 string/dev string/release string after I split the complete version string on semicolons to a list.

What I am looking to do is something along these lines:

version_list = record.string.split(";")
keep_list = []
for substring in version_list[1:]: # exclude the Version X.X string that is held in another variable
    if substring.strip() == "DEV" or substring.strip() == "RELEASE":
        pass
    else:
        keep_list.append(substring)

post_version_string = ";".join(keep_list)

Full version string is concatenated from variables above this level , new content based upon user command line request, and the above post_version_string appended to maintain anything that user previously had following a semicolon

Thoughts? Despite being able to cast everything to same type in Python 2 str v str and unicode v unicode, the string equality comparison always yields False. String lengths (len(s)) differ despite being of the same type.

This is what I am seeing as the default for the version_list with data read from fonttools following the split on semicolons for a font with a version string that reads Version 1.000; DEV:

['\x00\x00\x00V\x00\x00\x00e\x00\x00\x00r\x00\x00\x00s\x00\x00\x00i\x00\x00\x00o\x00\x00\x00n\x00\x00\x00 \x00\x00\x001\x00\x00\x00.\x00\x00\x000\x00\x00\x000\x00\x00\x000\x00', '\x00 \x00D\x00E\x00V']

and here is a list of string literals with same data using the from __future__ import unicode_literals import without any special string formatting in the list (simply between double quotes):

[u'Version 1.000', u' DEV']

This is coming from nameID 5, platformID 3.

Answer 11 · 2017-09-06T07:01:08.000Z

When fonttools decompiles a font, the record.string you get is of type bytes; if you want to decode it to a Unicode string you need to call the NameRecord's toUnicode() method. That's used e.g. when fonttools dumps the name record to TTX.
Check the inline documentation in the name table module.

Answer 12 · 2017-09-06T07:07:04.000Z

https://github.com/fonttools/fonttools/blob/master/Lib/fontTools/ttLib/tables/_n_a_m_e.py#L326

Answer 13 · 2017-09-06T13:02:25.000Z

I think that this was the problem. They must be utf16 big endian encoded data (or some other non-utf8 encoding). I was trying to cast to unicode with the py23.misc.tounicode function and specifying utf8. Will check it today with the table specific unicode casting function. Thanks Cosimo.

Answer 14 · 2017-09-06T13:05:15.000Z

It looks like the name table specific toUnicode function uses tounicode internally so I am working under the assumption that we are still going to be dealing with str types on Py3 and unicode types on Py 2. We will see...

Answer 15 · 2017-09-06T13:06:46.000Z

yes, toUnicode() method of NameRecord returns a unicode string in Python 2 (which is the same as a str string in Python 3). It automatically decodes it using the right encoding.

Answer 16 · 2017-09-06T13:09:34.000Z

Perfect! Thank you Cosimo!

A very valuable (and painful) lesson in Py2/3 string handling :)

Answer 17 · 2017-09-06T13:13:25.000Z

Are the OpenType tables stored in binaries as C structs with a defined byte length of padding? This appears to be how they are being unpacked in fonttools.

Answer 18 · 2017-09-06T13:26:30.000Z

You ask out of curiosity? That's an implementation detail.
You can trust FontTools is Doing The Right Thing (TM).

Answer 19 · 2017-09-06T13:27:46.000Z

I'm joking of course. Have a look at the compile method of the name table class if you are interested in understanding how that's done.

Answer 20 · 2017-09-06T13:28:26.000Z

Oh yes, simply for my own knowledge. I fully trust fonttools. Was going to dust off my C and tinker a bit to better understand the binary.

Answer 21 · 2017-09-06T13:52:17.000Z

Your suggestions addressed the issue. Fixed!

Answer 22 · 2017-09-06T13:55:19.000Z

Nice! if you're terminal supports UTF-8, you should be even able to put an emoji in your version string 😄 (well, not in the Mac Roman one of course but that's legacy)

Answer 23 · 2017-09-06T14:01:07.000Z

sorry, actually I just realised that the tool is not meant to take arbitrary strings from the console, but just numeric strings like "2_000", so all my ramblings on the encoding of the sys.argv is misplaced since you're guaranteed to receive only ascii characters like digits and underscores.
You can decode the arguments with tounicode(s, encoding="ascii") and if it fails with UnicodeDecodeError, it's the user's fault ;)

Answer 24 · 2017-09-06T14:01:10.000Z

Nice! if you're terminal supports UTF-8, you should be even able to put an emoji in your version string 😄 (well, not in the Mac Roman one of course but that's legacy)

haha! for a future release perhaps?

Answer 25 · 2017-09-06T14:02:06.000Z

tounicode(s, encoding="ascii") and if it fails with UnicodeDecodeError, it's the user's fault ;)

I like it when it is not my fault...

Answer 26 · 2017-09-06T14:02:51.000Z

Is the default 'ascii' encoding on tounicode stable? necessary to specify it explicitly?

Answer 27 · 2017-09-06T14:03:31.000Z

it's the default, it won't change. you can just call tounicode(s)

Answer 28 · 2017-09-06T14:03:50.000Z

but explicit is better than implicit ;)

Answer 29 · 2017-09-06T14:04:26.000Z

sorry, actually I just realised that the tool is not meant to take arbitrary strings from the console

Currently supports the following:

modification of the version integers based upon console entry
addition of a DEV or RELEASE tag in the string as additional (post semicolon) meta data
addition of the repository git commit short sha1 string with optional -dev or -release tag added to the sha1

Answer 30 · 2017-09-06T14:07:09.000Z

So you can make the following:

Version 1.001
Version 1.001; DEV or Version 1.001; RELEASE
Version 1.001; 8ca53c6
Version 1.001; 8ca53c6-dev or Version 1.001; 8ca53c6-release

Answer 31 · 2017-09-06T14:07:36.000Z

cool. all ascii-safe

Answer 32 · 2017-09-06T14:07:42.000Z

We are going to use it to add the build git commit SHA1 to Hack

Answer 33 · 2017-09-06T14:09:23.000Z

Thanks again for all of your help Cosimo. I really appreciate all of your time. This was helpful.

Answer 34 · 2017-09-07T17:54:31.000Z

@anthrotype

Cosimo, is any of this code of interest in fontmake? If you have any interest in supporting this version (dev/release/sha1) approach, it probably makes more sense to have in a compiler rather than yet one more post compilation adjustment tool. The tool here could be used for one off changes and for those who do not use fontmake.

Answer 35 · 2017-09-07T18:01:54.000Z

I'm not particularly interested myself, but you are welcome to open an issue on fontmake's repo and see whether others would be interested, or you can propose a pull request.

Answer 36 · 2017-09-07T19:15:55.000Z

@anthrotype no worries. if there isn't buy-in from maintainer won't be worth the time at this stage. will forge ahead here and see if the approach is attractive to anyone else out there. if so will pitch it on fontmake repo down the road :)

thanks!

Answer 37 · 2017-10-02T01:52:19.000Z

@anthrotype Cosimo, I am getting the following error when I concatenate strings pulled in from fontTools nameID 5 with the git sha1 strings and string literals (e.g. '-dev', '-release') here.

#12

It seems that the parts (i.e. the concatenated portions between semicolons) are not encoded in the same way and it leads to an odd spacing between characters display in the image in that IR, and more importantly, does not lead to a display of the version string in FontBook on OS X.

The current approach is to pull the "pre" version string as built from source via the fontTools namerecord.toUnicode() method:

            namerecord_list = tt['name'].names
            for record in namerecord_list:
                if record.nameID == 5:

               # ... full nameID record string is cast with .toUnicode method from fontTools library

then I split the string on semicolons into a Python list. The list parts are concatenated to a new nameID 5 version string with a combination of string literals and the git SHA1 short code that is returned by the gitpython library. It seems that the second list item that includes the sha1 string +/- any string literals is properly formatted but the Version X.XXX string (direct from fontTools library and maintained in the list) and the ttfautohint... string (direct from fontTools library and maintained in the list prior to concatenation) do not appear to have proper formatting following the write out to the font binary.

Do I need to encode the entire concatenated string in some way before I send it to fontTools for the write? I am assuming that I am dealing with different string formats for some reason here, but thought that Python (during concatenation) or fontTools would automate these casts in a way that it does not need to be explicit on my end. Thoughts?

Previously I was using the following for platformID 3 prior to the write with fontTools:

record.string = version_string.encode('utf_16_be')
# then write with fontTools

Answer 38 · 2017-10-02T14:38:05.000Z

the git SHA1 short code that is returned by the gitpython library

are you sure that is a unicode string and not a bytes string? Use tounicode() function to ensure it is a unicode string (import it from fonttools.misc.py23 module).

Concatenating unicode and bytes in Python 2 would automatically decode the latter (upgrade them to unicode) using the default ascii encoding. In Python 3, attempting to do u"abc" + b"abc" raises TypeError (rightly so). So, in a py2.py3 world, it's better you make sure you are dealing with either or the other, and not mix them up.

The NameRecord string attribute can take either a unicode string or a bytes string. In the first case, it will encode it automatically for you next time you compile; in the latter case, it will assume you already encoded the bytes string with the right encoding and will just write it as is. If the bytes were not encoded with the right encoding, you may get garbage like the one in the screenshot.

Answer 39 · 2017-10-02T15:02:50.000Z

Hmmm... that may be the case. Interestingly that is the portion of the string (the git sha1) that does not seem to be affected in the font's version string after write. Will attempt to cast the string returned by gitpython to unicode with tounicode from the py23 module. It should not be necessary to cast the string literals explicitly before concatenation correct? My understanding is that the addition of the line

from __future__ import unicode_literals

should force string literals in Py2 to unicode and Py3 as str? If so, this should be consistent with where we are with the other substrings from the original font when we import from the record.string with the .toUnicode() method?

Answer 40 · 2017-10-02T15:07:48.000Z

correct, if you use unicode_literals.

Answer 41 · 2017-10-02T15:17:15.000Z

Thank you! Will give it a shot this evening. Hopefully the fix is this simple... :)

Answer 42 · 2017-10-03T03:28:37.000Z

Seems to have fixed it!

Thanks again for your help Cosimo! Greatly appreciated!!