msiemens/tinydb

Unicode in queries

djordjeglbvc opened this issue · 12 comments

When testing db field with unicode string, UnicodeEncodeError exception is raised.
Line which causes the exception:

db.get(where('name') == u'žir')

Inserting unicode data went without problems:

db.insert({'name': 'žir'})

I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.

In queries.py, I've changed Query._update_repr function body to:

self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)

and Query.__hash__ to:

return hash(repr(unicode(self)))

Basically adding string preffix "u" in _update_repr, and "unicode" call in __hash__...

Using tinydb from git on python 2.7.6, ubuntu 14.04

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

@zelenikotao Can you please post a full traceback?

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>> 

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

Tested it, works great for me!
Thanks!

Thanks for reporting!