Unicode in queries

Question

Unicode in queries

djordjeglbvc opened this issue 10 years ago · 12 comments

When testing db field with unicode string, UnicodeEncodeError exception is raised.
Line which causes the exception:

db.get(where('name') == u'žir')

Inserting unicode data went without problems:

db.insert({'name': 'žir'})

I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.

In queries.py, I've changed Query._update_repr function body to:

self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)

and Query.__hash__ to:

return hash(repr(unicode(self)))

Basically adding string preffix "u" in _update_repr, and "unicode" call in __hash__...

Using tinydb from git on python 2.7.6, ubuntu 14.04

Answer 1 · 2014-09-12T12:47:54.000Z

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

Answer 2 · 2014-09-15T02:38:49.000Z

@zelenikotao Can you please post a full traceback?

Answer 3 · 2014-09-15T08:49:21.000Z

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

Answer 4 · 2014-09-16T00:15:06.000Z

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

Answer 5 · 2014-09-16T09:55:10.000Z

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

Answer 6 · 2014-09-16T10:57:46.000Z

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

Answer 7 · 2014-09-16T11:05:18.000Z

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

Answer 8 · 2014-09-16T11:15:06.000Z

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

Answer 9 · 2014-09-16T13:46:42.000Z

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>>

Answer 10 · 2014-09-17T13:50:35.000Z

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

Answer 11 · 2014-09-17T14:29:48.000Z

Tested it, works great for me!
Thanks!

Answer 12 · 2014-09-17T14:31:41.000Z

Thanks for reporting!