String encoding by platformID
chrissimpkins opened this issue ยท 42 comments
Current approach:
if record.platformID == 0:
record.string = version_string.encode('utf_16_be') # Unicode platform ID gets UTF16 big endian
elif record.platformID == 1:
record.string = version_string # Mac platform ID
elif record.platformID == 3:
record.string = version_string.encode('utf_16_be') # Windows platform ID gets UTF16 big endian
TODO:
- add
from __future__ import unicode_literals
to module - let fontools encode proper string by platformID type (see #1 (comment))
- use
unicode
alias andtounicode
function in fonttools py23 module to encode properly before use of fonttools library for the string write (see #1 (comment)) - test encoding of strings and exception handling for string encoding in Py 2 + 3
Per conversations with @davelab6 and @anthrotype, the platformID 1 version strings are not used to any significant degree any longer so the platformID writes should see limited use out there. It sounds as though there is some legacy use of the platformID 1 records on old versions of Mac (maybe even pre OS X) applications.
Review of numerous fonts (including commercial) shows that the platformID 1 version strings are almost universally present so we should still support the correct modification of this record here.
You can just set the record.string
to a unicode string (i.e. unicode
type in Python 2, or str
type in Python 3), and let fonttools encode it automatically for you with the correct platform encoding for that name record.
I also recommend to use from __future__ import unicode_literals
whenever you can, so your string literals are implicitly treated as unicode strings (without needing u""
in front of them) like in Python 3.
The fontTools.misc.py23
module has some convenience utilities and type aliases to correctly work with unicode/str/bytes in a py2/py3 environment.
Remember that the list of command line arguments that you get from the console (sys.argv
) are bytes
strings on Python 2, and these are encoded with the console's encoding (UTF-8
on most Unix based systems). On Python 3 on the other hand, they are already decoded to unicode strings (str
). In the case of python 2, you'll need to know what the console's encoding is (sys.stdout.encoding
gives you a hint, but the latter may be None
if standard streams have been redirected to a pipe; you can google for "python 2 sys.argv encoding" or if you feel lazy use "utf-8" all the time!), and then use the console's encoding to decode the arguments to get the unicode strings.
There's a tounicode
function in fontTools.misc.py23
that decodes a bytes string to unicode with a given encoding or returns the string as is if it is already a unicode string; useful in these kinds of situations where the input may be either bytes or unicodes and you want unicodes. (also check out tobytes
and tostr
functions in same module).
Thank you! Very helpful!
You can just set the record.string to a unicode string (i.e. unicode type in Python 2, or str type in Python 3), and let fonttools encode it automatically for you with the correct platform encoding for that name record.
Does the use of the unicode type (even with Python interpreter checking prior to execution of the code) raise exceptions on Python 3? i.e. do I catch this exception and ignore when Python 3 interpreter in use?
There's a tounicode function in fontTools.misc.py23 that decodes a bytes string to unicode with a given encoding or returns the string as is if it is already a unicode string; useful in these kinds of situations where the input may be either bytes or unicodes and you want unicodes. (also check out tobytes and tostr functions in same module).
Will check out these functions. Very helpful. Thank you!
The unicode
type in Python 2 is the equivalent of the str
type in Python 3. Hence the py23
module in fonttools exports a unicode
alias that points to unicode
in Python 2 and to str
in Python 3. You can import that unicode
type from py23
module and use it, e.g., with isinstance
.
If you want to convert a bytes
string to a unicode
string (in Python 2, which is the same as a str
string in Python 3), you can use the tounicode
function from the fonttools py23
module, which will just pass it on as is if it's already a unicode string, or decode the bytes string to a unicode string using the provided encoding (it defaults to ascii
but you should provide the actual bytes string encoding; if you read the bytes from a text file, that would usually be UTF-8, if you read them from the console (in Python 2) it will be the console's encoding, etc.).
If you want to work with strings in a codebase that's meant to support both python 2 and 3, I'm afraid you need to become familiar with terms like bytes, unicode, str, encode, decode, etc.
In an ideal world where everybody is using python 3, then one would just use str
for text, and bytes
for binary data, without confusing one with the other like python 2 used to do. ;)
Also note that the stuff in fonttools py23 module is nothing fancy. Other popular python packages that are not afraid of adding dependencies just use the six
module (https://pythonhosted.org/six/) which provides the same (and more) functionalities. But since your library already depends on fonttools, you may well just use that.
Hence the py23 module in fonttools exports a unicode alias that points to unicode in Python 2 and to str in Python 3. You can import that unicode type from py23 module and use it, e.g., with isinstance.
But since your library already depends on fonttools, you may well just use that.
๐ Thanks Cosimo. This is very helpful. I will tinker in Py2/3 once I get the Travis testing set up for the project.
Added the following to the TODO list based upon above conversation:
- add
from __future__ import unicode_literals
to module - let fontools library encode proper string by platformID type, ignore this on font-v end (see #1 (comment))
- use
unicode
alias andtounicode
function in fonttools py23 module to encode properly before use of fonttools library for the string write - i.e. beforerecord.string
definitions (see #1 (comment)) - test encoding of strings and exception handling for string encoding in Py 2 + 3 on Travis
๐ tests that are Py2 + 3 compatible...
from __future__ import unicode_literals
from fontTools.misc.py23 import unicode, tounicode, tobytes, tostr
def test_fontv_fonttools_lib_unicode():
test_string = tobytes("hello")
test_string_str = tostr("hello")
test_string_unicode = tounicode(test_string, 'utf-8')
test_string_str_unicode = tounicode(test_string_str, 'utf-8')
assert (isinstance(test_string, unicode)) is False
if sys.version_info[0] == 2:
assert (isinstance(test_string_str, unicode)) is False # str != unicode in Python 2
elif sys.version_info[0] == 3:
assert (isinstance(test_string_str, unicode)) is True # str = unicode in Python 3
assert (isinstance(test_string_unicode, unicode)) is True # after cast with fonttools function, Py2+3 = unicode
assert (isinstance(test_string_str_unicode, unicode)) is True # ditto
assert test_string_unicode == "hello"
Will cast everything that comes from command line to unicode with tounicode
function.
My head is spinning after a lengthy attempt to perform a string equality comparison between the name record record.string
that is read in from fonttools and a string literal across combinations of the unicode_literals import, attempts to cast to unicode with fonttools tounicode
function, use of str.decode('utf-8'), etc. I need to do this in order to confirm that I am not saving a previous sha1 string/dev string/release string after I split the complete version string on semicolons to a list.
What I am looking to do is something along these lines:
version_list = record.string.split(";")
keep_list = []
for substring in version_list[1:]: # exclude the Version X.X string that is held in another variable
if substring.strip() == "DEV" or substring.strip() == "RELEASE":
pass
else:
keep_list.append(substring)
post_version_string = ";".join(keep_list)
Full version string is concatenated from variables above this level , new content based upon user command line request, and the above post_version_string appended to maintain anything that user previously had following a semicolon
Thoughts? Despite being able to cast everything to same type in Python 2 str
v str
and unicode
v unicode
, the string equality comparison always yields False. String lengths (len(s)
) differ despite being of the same type.
This is what I am seeing as the default for the version_list with data read from fonttools following the split on semicolons for a font with a version string that reads Version 1.000; DEV
:
['\x00\x00\x00V\x00\x00\x00e\x00\x00\x00r\x00\x00\x00s\x00\x00\x00i\x00\x00\x00o\x00\x00\x00n\x00\x00\x00 \x00\x00\x001\x00\x00\x00.\x00\x00\x000\x00\x00\x000\x00\x00\x000\x00', '\x00 \x00D\x00E\x00V']
and here is a list of string literals with same data using the from __future__ import unicode_literals
import without any special string formatting in the list (simply between double quotes):
[u'Version 1.000', u' DEV']
This is coming from nameID 5, platformID 3.
When fonttools decompiles a font, the record.string you get is of type bytes; if you want to decode it to a Unicode string you need to call the NameRecord's toUnicode()
method. That's used e.g. when fonttools dumps the name record to TTX.
Check the inline documentation in the name table module.
I think that this was the problem. They must be utf16 big endian encoded data (or some other non-utf8 encoding). I was trying to cast to unicode with the py23.misc.tounicode function and specifying utf8. Will check it today with the table specific unicode casting function. Thanks Cosimo.
It looks like the name table specific toUnicode
function uses tounicode
internally so I am working under the assumption that we are still going to be dealing with str types on Py3 and unicode types on Py 2. We will see...
yes, toUnicode()
method of NameRecord returns a unicode
string in Python 2 (which is the same as a str
string in Python 3). It automatically decodes it using the right encoding.
Perfect! Thank you Cosimo!
A very valuable (and painful) lesson in Py2/3 string handling :)
Are the OpenType tables stored in binaries as C structs with a defined byte length of padding? This appears to be how they are being unpacked in fonttools.
You ask out of curiosity? That's an implementation detail.
You can trust FontTools is Doing The Right Thing (TM).
I'm joking of course. Have a look at the compile method of the name table class if you are interested in understanding how that's done.
Oh yes, simply for my own knowledge. I fully trust fonttools. Was going to dust off my C and tinker a bit to better understand the binary.
Nice! if you're terminal supports UTF-8, you should be even able to put an emoji in your version string ๐ (well, not in the Mac Roman one of course but that's legacy)
sorry, actually I just realised that the tool is not meant to take arbitrary strings from the console, but just numeric strings like "2_000", so all my ramblings on the encoding of the sys.argv is misplaced since you're guaranteed to receive only ascii characters like digits and underscores.
You can decode the arguments with tounicode(s, encoding="ascii")
and if it fails with UnicodeDecodeError, it's the user's fault ;)
Nice! if you're terminal supports UTF-8, you should be even able to put an emoji in your version string ๐ (well, not in the Mac Roman one of course but that's legacy)
haha! for a future release perhaps?
tounicode(s, encoding="ascii") and if it fails with UnicodeDecodeError, it's the user's fault ;)
I like it when it is not my fault...
Is the default 'ascii' encoding on tounicode stable? necessary to specify it explicitly?
it's the default, it won't change. you can just call tounicode(s)
but explicit is better than implicit ;)
sorry, actually I just realised that the tool is not meant to take arbitrary strings from the console
Currently supports the following:
- modification of the version integers based upon console entry
- addition of a DEV or RELEASE tag in the string as additional (post semicolon) meta data
- addition of the repository git commit short sha1 string with optional
-dev
or-release
tag added to the sha1
So you can make the following:
Version 1.001
Version 1.001; DEV
orVersion 1.001; RELEASE
Version 1.001; 8ca53c6
Version 1.001; 8ca53c6-dev
orVersion 1.001; 8ca53c6-release
cool. all ascii-safe
We are going to use it to add the build git commit SHA1 to Hack
Thanks again for all of your help Cosimo. I really appreciate all of your time. This was helpful.
Cosimo, is any of this code of interest in fontmake? If you have any interest in supporting this version (dev/release/sha1) approach, it probably makes more sense to have in a compiler rather than yet one more post compilation adjustment tool. The tool here could be used for one off changes and for those who do not use fontmake.
I'm not particularly interested myself, but you are welcome to open an issue on fontmake's repo and see whether others would be interested, or you can propose a pull request.
@anthrotype no worries. if there isn't buy-in from maintainer won't be worth the time at this stage. will forge ahead here and see if the approach is attractive to anyone else out there. if so will pitch it on fontmake repo down the road :)
thanks!
@anthrotype Cosimo, I am getting the following error when I concatenate strings pulled in from fontTools nameID 5 with the git sha1 strings and string literals (e.g. '-dev', '-release') here.
It seems that the parts (i.e. the concatenated portions between semicolons) are not encoded in the same way and it leads to an odd spacing between characters display in the image in that IR, and more importantly, does not lead to a display of the version string in FontBook on OS X.
The current approach is to pull the "pre" version string as built from source via the fontTools namerecord.toUnicode() method:
namerecord_list = tt['name'].names
for record in namerecord_list:
if record.nameID == 5:
# ... full nameID record string is cast with .toUnicode method from fontTools library
then I split the string on semicolons into a Python list. The list parts are concatenated to a new nameID 5 version string with a combination of string literals and the git SHA1 short code that is returned by the gitpython library. It seems that the second list item that includes the sha1 string +/- any string literals is properly formatted but the Version X.XXX
string (direct from fontTools library and maintained in the list) and the ttfautohint...
string (direct from fontTools library and maintained in the list prior to concatenation) do not appear to have proper formatting following the write out to the font binary.
Do I need to encode the entire concatenated string in some way before I send it to fontTools for the write? I am assuming that I am dealing with different string formats for some reason here, but thought that Python (during concatenation) or fontTools would automate these casts in a way that it does not need to be explicit on my end. Thoughts?
Previously I was using the following for platformID 3 prior to the write with fontTools:
record.string = version_string.encode('utf_16_be')
# then write with fontTools
the git SHA1 short code that is returned by the gitpython library
are you sure that is a unicode
string and not a bytes
string? Use tounicode()
function to ensure it is a unicode
string (import it from fonttools.misc.py23
module).
Concatenating unicode and bytes in Python 2 would automatically decode the latter (upgrade them to unicode) using the default ascii encoding. In Python 3, attempting to do u"abc" + b"abc"
raises TypeError (rightly so). So, in a py2.py3 world, it's better you make sure you are dealing with either or the other, and not mix them up.
The NameRecord string
attribute can take either a unicode string or a bytes string. In the first case, it will encode it automatically for you next time you compile; in the latter case, it will assume you already encoded the bytes string with the right encoding and will just write it as is. If the bytes were not encoded with the right encoding, you may get garbage like the one in the screenshot.
Hmmm... that may be the case. Interestingly that is the portion of the string (the git sha1) that does not seem to be affected in the font's version string after write. Will attempt to cast the string returned by gitpython to unicode with tounicode
from the py23 module. It should not be necessary to cast the string literals explicitly before concatenation correct? My understanding is that the addition of the line
from __future__ import unicode_literals
should force string literals in Py2 to unicode
and Py3 as str
? If so, this should be consistent with where we are with the other substrings from the original font when we import from the record.string with the .toUnicode() method?
correct, if you use unicode_literals
.
Thank you! Will give it a shot this evening. Hopefully the fix is this simple... :)