johnnykv/mnemosyne

Error in nomalizing with long content to dork collection.

Opened this issue · 2 comments

I'm found some errors in mnemosyne.err as below.

OperationFailure: Btree::insert: key too large to index, failing mnemosyne.dork.$content_1 1233 { : "/999999.9+/%2A%2A/uNiOn/%2A%2A/aLl+/%2A%2A/sElEcT+0x393133353134353632312e39,0x393133353134353632322e39,0x393133353134353632332e39,0x39313335313435363..." }

It could be the content is too long to be indexed.
I've using hashed content as index key instead of text :

https://github.com/johnnykv/mnemosyne/blob/master/persistance/mnemodb.py#L48

from pymongo import MongoClient, HASHED


self.db.dork.ensure_index([('content', HASHED)], unique=False, background=True)

Now it seems work fine.
If any suggestion, please let me know.

sh4t commented

@sidra-asa

Were you still seeing errors after related to upsert?

Traceback (most recent call last):
  File "/opt/mnemosyne/env/local/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/opt/mnemosyne/normalizer/normalizer.py", line 125, in inserter
    self.database.insert_normalized(norm, id, identifier)
  File "/opt/mnemosyne/persistance/mnemodb.py", line 97, in insert_normalized
    upsert=True)
  File "/opt/mnemosyne/env/local/lib/python2.7/site-packages/pymongo/collection.py", line 552, in update
    _check_write_command_response(results)
  File "/opt/mnemosyne/env/local/lib/python2.7/site-packages/pymongo/helpers.py", line 205, in _check_write_command_response
    raise OperationFailure(error.get("errmsg"), error.get("code"), error)
OperationFailure: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing mnemosyne.dork.$content_1 1127 { : "/suse/include/components/com_artlinks/support/mailling/maillist/inc/include/control/999999.9+%0BuNiOn%0BaLl+%0BsElEcT+0x393133353134353632312e39,0x393..." }
<Greenlet at 0x7f9f7d9db7d0: <bound method Normalizer.inserter of <normalizer.normalizer.Normalizer object at 0x7f9f7d9c5f90>>([([{'session': {'_id': ObjectId('57aee159e5645d38e)> failed with OperationFailure

I'm attempting the hashed index as well, though not recreating entire collcetion; failing still though I believe it is because of the upsert on update method.

mnemosyne/persistance/mnemodb.py line 97~

                elif collection is 'dork':
                    self.db[collection].update({'content': document['content'], 'type': document['type']},
                                               {'$set': {'lasttime': document['timestamp']},
                                                '$inc': {'count': document['count']}},
                                               upsert=True)

@sh4t

I dropped the index of dork content, and created hashed one.
I just checked the log , but there's no such error like yours.
Could you give it a try to see if error occurs ?

If any suggestion, please let me know.