richardwilly98/elasticsearch-river-mongodb

Too much indexed docs - drop database

lukaszpy opened this issue · 13 comments

my versions:

  • elasticsearch 0.90.2
    plugins:
    -- mapper attachments 1.4.0
    -- river jdbc 2.2.1 (driver: postgresql-9.2-1002.jdbc4)
    -- elasticsearch-river-mongodb-1.6.11

Problem:
When I creating indexes for Postgre db everything works fine in head plugin for ES I see:
structure - name of index
size: 1mb (1mb)
docs: 3587 (3587)
But when i create index on mongo db I getting:
type - index name
size: 642.6kb (642.6kb)
docs: 10495 (10495)

But in docs field is wrong number of docs, because in my db I have only 3936 docs.
This problem exist on every index on mongo db - count of indexed docs not mach count of docs in db.

I'm creating indexes with ( it's a windows version):
curl -XPUT "http://localhost:9200/_river/body/_meta" -d "{ "type": "mongodb", "mongodb": { "servers": [{host: "localhost", port: "27017" }], "options": {"secondary_read_preference": true}, "credentials": [{db: "fis-bps",user: "guest", password: "guest"}], db: "fis-bps", collection: "body",gridfs: "false"}, index: {name: "body", throttle_size: 2000}}"

This problem only exist on Windows system. On Ubuntu system problem dosn't exist.

I noticed one more think: when I dump my DB, and remove all data for dbs (from data directory for primary and slave). I create databases, create index, and then restore database from dump.
Now i have correct count of indexed docs.

It looks like elasticsearch looks deep into mongo db and normal drop DB and recreate it, still leave some data which are used by elasticsearch to create indexes.

The river get the data from oplog.rs not directly from the collection.

Did you by any chance drop the collection?

I droped collection oplog.rs in PRIMARY but not in secondary (replica) Is that a mistake ?

In that case you should use options/drop_collection parameter (for more details see [1])

[1] - https://github.com/richardwilly98/elasticsearch-river-mongodb/wiki#configuration

Ok, but ,I'm not sure so correct me If I think wrong.
flag drop_collection dosn't work with drop whole db, yes ?

So If I drop whole db, and recreate. Then restore data, I shoud get more indexed docs, than exist in my DB ?
Besause collection is not droped (whole DB is droped), and ES will se new dosc when using oplog

i just check that case on my Windows workstation.
I think it's a bug.

options/drop_collection will look with drop collection - probably not with drop database.

So I think it's a bug, and shoud be corrected . Because we have incoherent states of indexes

Can you please clarify?
Which MongoDB command to drop the database or collection?
I believe dropping database or collection usually do not really apply to production environment.

To drop db we should:

  1. use test-db
  2. db.dropDatabase()

To drop collection:

  1. use test-db
  2. db.test-collection.drop()

I believe that bug coud exist on production environment.
for example me have:
machine1 and machine2 (with the same application, databases are duplicated too (each machine have its own mongodb))
For some reason we want to move data from 1 to 2 mchine.
We connect to machine1 and make dump of database. Then we go to machine2, now we droping whole db, and restore db created on machine1.
index state on machine2 will be incoherent. Because ES will get old data from oplog (but collection are empty), and new data from restore.

@lukaszpy

I will create a new feature request to support drop_database.

{
        "ts" : Timestamp(1380107544, 1),
        "h" : NumberLong("4469577380503976492"),
        "v" : 2,
        "op" : "c",
        "ns" : "mydb97.$cmd",
        "o" : {
                "dropDatabase" : 1
        }
}

@lukaszpy I will postpone this feature to release 1.7.2

  • The coming release 1.7.1 uses a different technique to do the initial import using the collection data (see #47)
  • That could be a good workaroun to this issue: before to restore the data on machine 2 just drop the index and river in ES. Recreate the river when the restore has been completed.

Please provide feedback.

Is this way like in mongodb? After large node downtime.

@mahnunchik can you please clarify?