casmlab/stack

mongo insert stops when duplicate encountered

libbyh opened this issue · 7 comments

Here's an example from the mil2 project:

Traceback (most recent call last):
  File "__main__.py", line 448, in <module>
    c.process_command(command)
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 144, in process_command
    self.restart()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 304, in restart
    self.start()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
    self.run()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
    _check_write_command_response(results)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _check_write_command_response

Should just gracefully skip the duplicate instead

See /.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt

I think the inserter actually skips the duplicates. Could you please paste the actual error message (the last line of traceback)?

This is from the error log of one of our projects:

Traceback (most recent call last):
File "main.py", line 448, in
c.process_command(command)
File "/home/bits/stack/app/controller.py", line 144, in process_command
self.restart()
File "/home/bits/stack/app/controller.py", line 304, in restart
self.start()
File "/home/bits/stack/app/controller.py", line 177, in start
self.run()
File "/home/bits/stack/app/controller.py", line 317, in run
mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
File "/home/bits/stack/app/twitter/mongoBatchInsert.py", line 228, in go
inserted_ids_list = insert_tweet_list(deleteCollection, deleted_tweets_list, line_number, processedTweetsFile, delete_db)
File "/home/bits/stack/app/twitter/mongoBatchInsert.py", line 66, in insert_tweet_list
inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 409, in insert
gen(), check_keys, self.uuid_subtype, client)
File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 1111, in _send_message
sock_info = self.__socket(member)
File "/usr/local/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 919, in __socket
"%s %s" % (host_details, str(why)))
pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused

Traceback (most recent call last):
  File "__main__.py", line 448, in <module>
    c.process_command(command)
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 144, in process_command
    self.restart()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 304, in restart
    self.start()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 177, in start
    self.run()
  File "/home/libbyh/github/casmlab/stack/app/controller.py", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/mongoBatchInsert.py", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
    _check_write_command_response(results)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/helpers.py", line 198, in _check_write_command_response
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: potus45_5886bdea21e38564ac1ccfd8.tweets index: id_str_1 dup key: { : "931464828660715521" }

Did you create a unique index on this field?

You can get the info by using index_information()
http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.index_information

Yes, we have a couple of unique indices set so that we don't keep throwing in dups.

Hi everyone,I remember of adding a unique key in Mongo DB to avoid duplicate entries of tweet,and the duplicate tweets were removed,when I was working in april

I see.

I could not test this myself as none of our collections had any unique index defined. Could you add this right after line 77 of MongobatchInsert.py?

except pymongo.errors.DuplicateKeyError, e:
print "Exception during mongo insert"
logger.warning("Duplicate error during mongo insert at or before file line number %d (%s)" % (line_number, processedTweetsFile))
logging.exception(e)
print traceback.format_exc()
pass

I'm not running any right now but will try to get to this before I talk to you on Monday.