richardwilly98/elasticsearch-river-mongodb

Suggestion: Initial sync

calexandre opened this issue · 24 comments

Hey Richard,
I would like to suggest some sort of initial sync functionality (optional).

Something like when you create the river via the PUT api, some additional options regarding on how the user would like to perform the initial sync.

This would be a "one time" operation. I dont even know if it is possible...

The main issue is that not everything is on the oplog, especially for really large and stale collections...
So it would be nice to implement a set of options that would allow the user to tell the river to pull all data from mongo (much like a GetAll operation).

Of course we could discuss different strategies for pulling the data, such as:

  1. GetAll (easy, but cumbersome for large collections)
  2. via MongoDump, MongoExport of BsonDump
  3. Others..?

It would be nice to support different import strategies, much like as plugins for this river.

Keep up the good work :)

xma commented

Hello,

I'd agree if that could be exposed by an API or any other way (conf. file / etc ...) that would be really great.

For stale collection (at least my use case I would say) I've made a patch
#45

and I don't like idea of using personal forked code suited for my use case, in production.

Regards,

Yes I'd also love to see this feature implemented. After working with the mysql river (which slurps in the table initially) I thought I was doing something wrong when my collection wasn't being slurped. If there is no plan to implement this, it might be worth mentioning this in the wiki.

+1 once we have changed the mapping,we need to clean the old-index,and ask a "re-pull" function,pulling data from mongo-db to elasticsearch,hope to see this feature.

+1. Would love to see this feature supported

+1 would love this too!

+1 for this!

A good work around for this would be to simply do a BULK-UPDATE on the collection after the mongo-rivers are setup. I use this for millions of records and it works great

+1 Very usefull!

+1

subratbasnet:
what do yo u mean by a BULK-UPDATE? if A is the large stale collection, do you mean setting up another empty collection B do this:
db.collectionA.find().
forEach( function(i) {
i.ts_imported = new Date();
db.collectionB.insert(i);
});

Then setup the river on collectionB?

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch

I see! This is a clever idea.
On May 17, 2013, at 10:59 AM, Subrat Basnet notifications@github.com wrote:

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch


Reply to this email directly or view it on GitHub.

Updating every document in MongoDB to get them to appear in the oplog and be copied by the river is very clever. Unfortunately, I believe this will make the initial import much slower. Writes to MongoDB are much slower for me than writes to ElasticSearch (because MongoDB stores data less efficiently than ES and because MongoDB has its unfortunate DB-level lock). Do you think it would work if we copied over all documents from MongoDB and then iterated over the oplog? I think that's what this issue is requesting and it doesn't sound much more difficult than what we have today. A nice optimization would be to read the latest oplog timestamp, import the collection, then import the oplog only from the start timestamp.

There are few challenges with your suggestion:

  • A MongoDB collection does not have out of box timestamp field. So the query to read the collection will need to be defined in river settings.
  • The collection will need to be locked during the initial import. Maybe using [1]. Is it acceptable?
  • There is already an initial timestamp settings but it would need to be dynamically calculated based on the end of the initial import.

[1] - http://docs.mongodb.org/manual/reference/method/db.fsyncLock/

Thanks for the feedback. Could you clarify? I'm not sure that those things are true. E.g. why would the collection need to be locked? Yes, copying without locking could result in an inconsistent state, but then once the oplog is applied wouldn't that fix it?

The main reason of #47 is to synchronize data not available on oplog.rs
So I am thinking of the following scenario:

  1. The collection has been created and populated before replicaset has been setup (at this point replicaset is not setup yet).
  2. Run the initial import
  3. Once the initial import is completed setup the replicaset
  4. River will then import data from oplog.rs

In step 1. we will need to need ensure no new data is imported in the collection.

If you import without locking you could get inconsistent state / data, but it does not garanty that will be fixed when processing oplog.

What if we just make it so that you can only run the initial import on a replica set?

nfx commented

+1

I have posted the question here [1]. Let's see what MongoDB experts will answer...

[1] - https://groups.google.com/forum/#!topic/mongodb-user/sOKlhD_E2ns

Response from William Zola at 10gen for @richardwilly98's question:

The way that MongoDB does initial sync internally is:
 - Record the latest timestamp (call it time 'T') from the oplog
 - Copy all of the documents from the collection
 - Apply all of the operations in the oplog starting from time 'T'

You could use the same strategy in your plugin.

@richardwilly98 Awesome! I'm really excited about the ability to do an initial import!

One thing I'm not very sure about is how to handle the river being stopped and then started again during the initial import. We have to restart the initial import in that case. Should we drop the index and start the initial import again? I'm hesitant to drop an index though. Maybe we should just stop the river from doing anything and post a warning to the admin UI and logs that the index needs to be dropped?

@benmccann I agree dropping the index is not a good option.

We need a flag to indicate the initial import is in progress. If the flag is not cleared and timestamp is null then stop the river and send warning as you suggested.

@cggaurav this is already implemented and released

this issue should probably be closed