/encodings

Internal MIG presentation about Character Encodings

Primary LanguageRuby

Character Sets and Character Encodings.

ASCII: http://http://ascii-table.com/ 7 bits (127 characters)

Different uses for that remaining bit (code pages):
http://www.i18nguy.com/unicode/codepages.html#msftdos

This kind of worked for a while, as long as documents where used on the same machine
as they were created, and you only had to deal with one language.

As soon as the Internet happened it became quite common to move strings from one
machine to another.

Unicode:

It's an effort to create a character set that includes every writing system in the
planet, including dead and ficticious languages.

A letter is mapped to a number (code point) and this is just a concept, it has nothing
to do with its physical representation.

Some of the problems in defining this standard have to do with identifying which
letters in different languages are actually the same.

open U0000.pdf
open U0B80.pdf
http://www.unicode.org/charts/

Hello
U+0048 U+0065 U+006C U+006C U+006F

Encodings:

One obvious way of encoding those code points would be using 2 bytes per code point:
00 48 00 65 00 6C 00 6C 00 6F

But this could also be:
48 00 65 00 6C 00 6C 00 6F 00

This is actually UTF-16/UCS2 and it has both little endian and big endian mode.
There is a convention to store a Byte Order Mask (BOM) at the beginning of the file:
FE FF or FF FE but it's not always there

Encoding English in this way wastes a lot of space since most characters are below +U00FF

Unicode Transformation Format-8 (UTF-8):
UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on
the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use
mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet.
UTF-8 is the default encoding for XML.

Hello - 48 65 6C 6C 6F

There is also UCS4/UTF32 - Really inefficient.

UTF-8 is more efficient for western languages but for other languages UTF-16 can be more efficient.

Open hello.latin1 and hello.utf8 in TextViewer
Open hello.latin1 and hello.utf8 in Chrome, change encoding of both files.

victor@victor ~/encodings (master)*$ xxd hello.utf8
0000000: 4865 6c6c 6f20 4920 616d 2056 c3ad 6374  Hello I am V..ct
0000010: 6f72                                     or
victor@victor ~/encodings (master)*$ xxd hello.latin1
0000000: 4865 6c6c 6f20 4920 616d 2056 ed63 746f  Hello I am V.cto
0000010: 72                                       r


See how ASCII characters are encoded the same in utf8 and latin1.
See how UTF8 encoding takes more space for some characters.

Text without knowing it's encoding doesn't mean anything, it's just bytes.
How to specify encoding:
HTTP   - Content Type Header.
HTML   - Content Type meta tag. This could be tricky but it isn't. The meta tag should be the first thing in the
head section. Browsers try to guess based on frequency of bytes.
E-mail - Content Type Header

Ruby 1.8:

victor@victor ~/encodings (master)*$ rvm 1.8.6
victor@victor ~/encodings (master)*$ irb
ruby-1.8.6-p399 > latin1 = File.open("hello.latin1").read
 => "Hello I am V\355ctor"
ruby-1.8.6-p399 > 0355
 => 237
ruby-1.8.6-p399 > 237.to_s(16)
 => "ed"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
 => "Hello I am V\303\255ctor"
ruby-1.8.6-p399 > 0303
 => 195
ruby-1.8.6-p399 > 195.to_s(16)
 => "c3"
ruby-1.8.6-p399 > 0255
 => 173
ruby-1.8.6-p399 > 173.to_s(16)
 => "ad"

ruby-1.8.6-p399 > latin1 << utf8
 => "Hello I am V\355ctorHello I am V\303\255ctor"

Ruby 1.8 has some support for Encodings:

victor@victor ~/encodings (master)*$ irb
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
 => "Hello I am V\303\255ctor"
ruby-1.8.6-p399 > $KCODE = "U"
 => "U"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
 => "Hello I am Víctor"

There are 4 possible values for $KCODE:
NONE: "N"
EUC: "E" Asian Encoding
Shift-JS: "S" Asian Encoding
UTF-8: "U"

Support in regular expressions:

ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]

ruby -e 'p "Résumé".size'
8

ruby -e 'p "Résumé".scan(/./mu).size'
6

ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]

ruby -e 'p "Résumé"'
"R\303\251sum\303\251"

ruby -KUe 'p "Résumé"'
"Résumé"

ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]

#!/usr/bin/env ruby -wKU

Iconv: C library to handle character conversion

iconv --list

irb
ruby-1.8.6-p399 > $KCODE = "U"
 => "U"
ruby-1.8.6-p399 > require "iconv"
 => true
ruby-1.8.6-p399 > latin1 = File.open("hello.latin1").read
 => "Hello I am V?ctor"
ruby-1.8.6-p399 > utf8 = File.open("hello.utf8").read
 => "Hello I am Víctor"
ruby-1.8.6-p399 > latin1_in_utf8 = Iconv.conv("UTF8", "LATIN1", latin1)
 => "Hello I am Víctor"
ruby-1.8.6-p399 > latin1_in_utf8 + utf8
 => "Hello I am VíctorHello I am Víctor"

Problems with 1.8 encoding support:

No enough encodings supported
Regexp-only support just isn't comprehensive enough
$KCODE is a global setting for all encodings

Ruby 1.9:

In Ruby 1.9 Strings are both raw bytes plus information about the encoding.
This is different from other languages that favour only 1 type of encoding (UTF-8).

p __ENCODING__

ruby encoding-1.9.rb
#<Encoding:US-ASCII>

ruby encoding-1.9_comment.rb
#<Encoding:UTF-8>

-e gets the encoding from the environment:

echo $LANG
en_GB.UTF-8

ruby -e 'p __ENCODING__'
#<Encoding:UTF-8>

ruby -e 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]

ruby -e 'p "Résumé".size'
6

ruby -e 'p "Résumé".encoding'
#<Encoding:UTF-8>

ruby -e 'p "Résumé".size'
6

ruby -e 'p "Résumé".encoding'
#<Encoding:UTF-8>

ruby -e 'p "Résumé".each_byte{|b| p b}'
82
195
169
115
117
109
195
169
"Résumé"

ruby -e 'p "Résumé".each_char{|c| p c}'
"R"
"é"
"s"
"u"
"m"
"é"
"Résumé"

ruby -e 'p "Résumé".each_codepoint{|c| p c}'
82
233
115
117
109
233
"Résumé"

ruby -e 'p "Résumé".bytes.to_a'
[82, 195, 169, 115, 117, 109, 195, 169]

Encode Method: (Changes encoding metadata + raw bytes)

ruby -e 'p "Résumé".encode("ISO-8859-1")'
"R?sum?"

ruby -e 'p "Résumé".encode("ISO-8859-1").size'
6

ruby -e 'p "Résumé".encode("ISO-8859-1").encoding'
#<Encoding:ISO-8859-1>

ruby -e 'p "Résumé".encode("ISO-8859-1").bytes.to_a'
[82, 233, 115, 117, 109, 233]

Force Encoding: (only changes metadata)

ruby -e 'p [82, 233, 115, 117, 109, 233].map{|c| c.to_s(16)}'
["52", "e9", "73", "75", "6d", "e9"]

ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9"'
"R\xE9sum\xE9"

ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".encoding'
#<Encoding:UTF-8>

ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".force_encoding("ISO-8859-1")'
"R?sum?"

ruby -e 'p "\x52\xe9\x73\x75\x6d\xe9".force_encoding("ISO-8859-1").encode("UTF-8")'
"Résumé"

Read a file specifying external and internal encoding:

rvm 1.9.2
irb

ruby-1.9.2-preview1 > File.open("hello.latin1").read
 => "Hello I am V\xEDctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1").read
 => "Hello I am V?ctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1").read.encoding
 => #<Encoding:ISO-8859-1>
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1:UTF-8").read
 => "Hello I am Víctor"
ruby-1.9.2-preview1 > File.open("hello.latin1", "r:ISO-8859-1:UTF-8").read.encoding
 => #<Encoding:UTF-8>


Internal Encoding and External Encoding (defaults):

ruby-1.9.2-preview1 > Encoding.default_internal = "UTF-8"
 => "UTF-8"
ruby-1.9.2-preview1 > Encoding.default_external = "ISO-8859-1"
 => "ISO-8859-1"
ruby-1.9.2-preview1 > File.open("hello.latin1").read
 => "Hello I am Víctor"

Exceptions:

ruby-1.9.2-preview1 > "Hello".encode("ASCII-8BIT")
 => "Hello "
ruby-1.9.2-preview1 > "Hello I am Víctor".encode("ASCII-8BIT")
Encoding::UndefinedConversionError: U+00ED from UTF-8 to ASCII-8BIT
	from (irb):19:in `encode'
	from (irb):19
	from /Users/victor/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>'

ruby-1.9.2-preview1 > b =  File.open("hello.utf8", "r:binary").read
 => "Hello I am V\xC3\xADctor"
ruby-1.9.2-preview1 > b.encoding
 => #<Encoding:ASCII-8BIT>
ruby-1.9.2-preview1 > b << "Hello I am Víctor"
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
	from (irb):3
	from /Users/victor/.rvm/rubies/ruby-1.9.2-preview1/bin/irb:17:in `<main>'

Erlang:

A string is just a list of numbers:

[65, 66, 67, 68].
"ABCD"

[65, 66, 67, 68] == "ABCD".
true

Erlang assumes by default ISO-8859-1:

[82, 233, 115, 117, 109, 233].
"Résumé"

[82, 233, 115, 117, 109, 233,256].
[82,233,115,117,109,233,256]

A list of integer takes 8 bytes per element, 4 for the integer and 4 for the pointer to the next element (double in 64 bit architecture).

A binary takes just 1 byte per character:

list_to_binary([82, 233, 115, 117, 109, 233]).
<<"Résumé">>

Unicode support:

io:getopts().
[{expand_fun,#Fun<group.0.120017273>},
 {echo,true},
 {binary,false},
 {encoding,unicode}]

U = unicode:characters_to_binary([82,233,115,117,109,233], utf8).
<<"Résumé">>

io:format("~s~n",[U]).
Résumé
ok

io:format("~ts~n",[U]).
Résumé

unicode:characters_to_list(<<"Résumé">>).
"Résumé"

<<First/utf8, Second/utf8, Rest/binary>> = <<"Résumé">>.
<<"Résumé">>

First.
82

Second.
233

[First, Second].
"Ré"

Modules that are unicode aware: unicode, io, file, re, wx.

string, except to_upper and to_lower.

GSM:
http://www.dreamfabric.com/sms/default_alphabet.html
It contains 127 + 10 characters, representable in 7 bits (10 of them need an escape character).
That's why a SMS can contain a maximum of 140 bytes but 160 characters (not escaped).

Some operators accept data already encoded in GSM, others only accept a default alphabet that they translate into GSM so not all GSM characters can be sent to all operators.

Handsets also support UCS but size is limited to 70 characters.

References:
http://www.unicode.org
http://www.utf-8.com/
http://www.joelonsoftware.com/articles/Unicode.html
http://blog.grayproductions.net/categories/character_encodings
http://yehudakatz.com/2010/05/17/encodings-unabridged/
http://nuclearsquid.com/writings/ruby-1-9-encodings.html
http://ftp.sunet.se/pub/lang/erlang/doc/man/unicode.html