samuel/python-bert

Unicode String Encoding

Closed this issue · 4 comments

The BERT generated by encoding a Python unicode string is not parsable by the current Ruby BERT implementation. The way this library encodes it is not part of the 1.0 spec @ http://bert-rpc.org/ .

Here is an example:

import bert
>>> bert.encode(u"Unicode is fun!")
'\x83h\x04d\x00\x04bertd\x00\x06stringl\x00\x00\x00\x05aUaTaFa-a8jm\x00\x00\x00\x0fUnicode is fun!'
>>> bert.encode("Unicode is fun!")
'\x83m\x00\x00\x00\x0fUnicode is fun!'
require 'bert'
> BERT.decode("\x83h\x04d\x00\x04bertd\x00\x06stringl\x00\x00\x00\x05aUaTaFa-a8jm\x00\x00\x00\x0fUnicode is fun!")
 => nil 
> BERT.decode("\x83m\x00\x00\x00\x0fUnicode is fun!") 
 => "Unicode is fun!" 

I have two questions:

  • Is this library meant to implement the BERT 1.0 spec?
  • If so, is this a bug in the Python version, the Ruby version, or are the two not meant to be compatible?

Thanks for your time.

The Python library follows the proposed Unicode support in BERT-RPC 2.0 found at http://groups.google.com/group/bert-rpc/browse_thread/thread/b3ccda7b76a3a631

It seems the Ruby library doesn't currently support the same. Though it's quite possible that the Python library has a bug in the implementation as well. I don't actively use the libraries at the moment. If you want to look into the implementations I will gladly accept patches.

Thanks,
Samuel

On Friday, June 3, 2011 at 15:31 , tpett wrote:

The BERT generated by encoding a Python unicode string is not parsable by the current Ruby BERT implementation. The way this library encodes it is not part of the 1.0 spec @ http://bert-rpc.org/ .

Here is an example:

import bert
> > > bert.encode(u"Unicode is fun!")
'\x83h\x04d\x00\x04bertd\x00\x06stringl\x00\x00\x00\x05aUaTaFa-a8jm\x00\x00\x00\x0fUnicode is fun!'
> > > bert.encode("Unicode is fun!")
'\x83m\x00\x00\x00\x0fUnicode is fun!'
require 'bert'
> BERT.decode("\x83h\x04d\x00\x04bertd\x00\x06stringl\x00\x00\x00\x05aUaTaFa-a8jm\x00\x00\x00\x0fUnicode is fun!")
 => nil 
> BERT.decode("\x83m\x00\x00\x00\x0fUnicode is fun!") 
 => "Unicode is fun!" 

I have two questions:

  • Is this library meant to implement the BERT 1.0 spec?
  • If so, is this a bug in the Python version, the Ruby version, or are the two not meant to be compatible?

Thanks for your time.

Reply to this email directly or view it on GitHub:
#1

The encoding is supposed to be an atom: "Where Encoding is an atom that specifies the character encoding" It is currently expressed as a list in this implementation. Also, lists of bytes should probably be encoded as STRING_EXT.

ref: http://groups.google.com/group/bert-rpc/browse_thread/thread/b3ccda7b76a3a631
ref: http://www.erlang.org/doc/apps/erts/erl_ext_dist.html#id85596

Thanks @davepage, you're right. I've made the change for the encoding from a list of bytes to an Atom.

As for encoding lists of bytes as STRING_EXT that's more of an issue for erlastic. Although, I didn't add explicit checking for that because I feel the overhead of detecting it isn't worth it, and if you encoding/decode through erlastic it can take the simple route of saying anything that's STRING_EXT is a string. There's really no great solution. However, if you use bert rather than erlastic than one option would be to modify 'bert' to encode ALL strings, not just unicode, as the bert representation. This would be less compatible though as the unicode representation isn't in the spec.

Sorry I never replied back on this one. What I ended up doing was forking and backing out Unicode support all together as I currently don't need it. Also, I am talking to a Ruby Ernie server so I need it to play nice with the Ruby implementation. With that change everything is humming along nicely.

My Fork: https://github.com/teaminsight/python-bert/tree/insight-bert