Unicode in queries
djordjeglbvc opened this issue · 12 comments
When testing db field with unicode string, UnicodeEncodeError
exception is raised.
Line which causes the exception:
db.get(where('name') == u'žir')
Inserting unicode data went without problems:
db.insert({'name': 'žir'})
I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.
In queries.py
, I've changed Query._update_repr
function body to:
self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)
and Query.__hash__
to:
return hash(repr(unicode(self)))
Basically adding string preffix "u" in _update_repr
, and "unicode" call in __hash__
...
Using tinydb from git on python 2.7.6, ubuntu 14.04
Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:
db.get(where('name') == 'zir')
Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)
@zelenikotao Can you please post a full traceback?
Sorry for not responding earlier, I didn't have any free time over weekend.
@eugene-eeo I have tried with unicodedata.normalize, result is the same.
@eugene-eeo @msiemens, here is the traceback:
$ python example.py
Traceback (most recent call last):
File "example.py", line 13, in <module>
db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
self._update_repr('==', other)
File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)
Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js
@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:
db.insert({'name': 'žir'})
db.search(where('name') == 'žir')
@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'}
and {'name': 'žir'}
, TinyDB will regard them as equal while they propably shouldn't be.
@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
self._cmp = lambda value: value == other
Of course, search doesn't return document I was searching for, only None value.
What's the exact code you've used? If I use byte strings for both inserting and searching, it works...
>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]
(Note: the question mark in the db.search
result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)
This is the code I've used:
>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
self._cmp = lambda value: value == other
[]
But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?
Could be, I'm investigating.
EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.
@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156
UPDATE: It works:
>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>>
I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.
@zelenikotao Could you test if it works in the latest development version?
Tested it, works great for me!
Thanks!
Thanks for reporting!