
mongo insert stops when duplicate encountered

libbyh opened this issue · 7 comments

Here's an example from the mil2 project:

Traceback (most recent call last):
  File "", line 448, in <module>
  File "/home/libbyh/github/casmlab/stack/app/", line 144, in process_command
  File "/home/libbyh/github/casmlab/stack/app/", line 304, in restart
  File "/home/libbyh/github/casmlab/stack/app/", line 177, in start
  File "/home/libbyh/github/casmlab/stack/app/", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/", line 410, in insert
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/", line 198, in _check_write_command_response

Should just gracefully skip the duplicate instead

See /.../stack/out/mil2-58e844bb21e38548ecb86364/std/mil2-insert-twitter-58e844bb21e38548ecb86364-stderr.txt

I think the inserter actually skips the duplicates. Could you please paste the actual error message (the last line of traceback)?

This is from the error log of one of our projects:

Traceback (most recent call last):
File "", line 448, in
File "/home/bits/stack/app/", line 144, in process_command
File "/home/bits/stack/app/", line 304, in restart
File "/home/bits/stack/app/", line 177, in start
File "/home/bits/stack/app/", line 317, in run
mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
File "/home/bits/stack/app/twitter/", line 228, in go
inserted_ids_list = insert_tweet_list(deleteCollection, deleted_tweets_list, line_number, processedTweetsFile, delete_db)
File "/home/bits/stack/app/twitter/", line 66, in insert_tweet_list
inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
File "/usr/local/lib/python2.7/dist-packages/pymongo/", line 409, in insert
gen(), check_keys, self.uuid_subtype, client)
File "/usr/local/lib/python2.7/dist-packages/pymongo/", line 1111, in _send_message
sock_info = self.__socket(member)
File "/usr/local/lib/python2.7/dist-packages/pymongo/", line 919, in __socket
"%s %s" % (host_details, str(why)))
pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused

Traceback (most recent call last):
  File "", line 448, in <module>
  File "/home/libbyh/github/casmlab/stack/app/", line 144, in process_command
  File "/home/libbyh/github/casmlab/stack/app/", line 304, in restart
  File "/home/libbyh/github/casmlab/stack/app/", line 177, in start
  File "/home/libbyh/github/casmlab/stack/app/", line 317, in run
    mongoBatchInsert.go(self.project_id, self.rawdir, self.insertdir, self.logdir)
  File "/home/libbyh/github/casmlab/stack/app/twitter/", line 214, in go
    inserted_ids_list = insert_tweet_list(insert_db, tweets_list, line_number, processedTweetsFile, data_db)
  File "/home/libbyh/github/casmlab/stack/app/twitter/", line 67, in insert_tweet_list
    inserted_ids_list = mongoCollection.insert(tweets_list, continue_on_error=True)
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/", line 410, in insert
  File "/home/libbyh/anaconda3/envs/stack/lib/python2.7/site-packages/pymongo/", line 198, in _check_write_command_response
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: potus45_5886bdea21e38564ac1ccfd8.tweets index: id_str_1 dup key: { : "931464828660715521" }

Did you create a unique index on this field?

You can get the info by using index_information()

Yes, we have a couple of unique indices set so that we don't keep throwing in dups.

Hi everyone,I remember of adding a unique key in Mongo DB to avoid duplicate entries of tweet,and the duplicate tweets were removed,when I was working in april

I see.

I could not test this myself as none of our collections had any unique index defined. Could you add this right after line 77 of

except pymongo.errors.DuplicateKeyError, e:
print "Exception during mongo insert"
logger.warning("Duplicate error during mongo insert at or before file line number %d (%s)" % (line_number, processedTweetsFile))
print traceback.format_exc()

I'm not running any right now but will try to get to this before I talk to you on Monday.