richardwilly98/elasticsearch-river-mongodb

Adding a field to the JSON of a PDF in MongoDB => NullPointerException for the river

antoinecarton opened this issue · 5 comments

Hi,

First of all, here is the Exception from ElasticSearch :

Exception in thread "elasticsearch[Nathaniel Richards][mongodb_river_slurper][T#1]" java.lang.NullPointerException
at org.elasticsearch.river.mongodb.MongoDBRiver$Slurper.processOplogEntry(MongoDBRiver.java:1074)
at org.elasticsearch.river.mongodb.MongoDBRiver$Slurper.run(MongoDBRiver.java:986)
at java.lang.Thread.run(Thread.java:679)

Here is my configuration :
River : 1.6.9
ElasticSearch : 0.90.1
MongoDB : 2.4.4

Configuration used for MongoDB :
http://docs.mongodb.org/manual/tutorial/deploy-replica-set/, partie "Deploy a Development or Test Replica Set"

Next, in a console :

mongo --port 27017
use pdf_database5

In a second console, I add a PDF file :

mongofiles --host localhost:27017 --db pdf_database5 --collection fs --type applicaton/pdf put /PATH_TO_A_PDF

After that, I create a MongoDB river for ElasticSearch :

curl -XPUT "${host}/_river/mongodb/_meta" -d '{
"type": "mongodb",
"mongodb": {
"db": "pdf_database5",
"collection": "fs",
"gridfs": true
},
"index": {
"name": "mongoindex",
"type": "files"
}
}'

Until now, everything is OK and my PDF file is correctly indexed and full text search is OK.

However, once I add a field to the JSON of the PDF file, that is to say with the following step in mongoDB console :

db.fs.files.find({});

(for instance, 51c05f881a13d534df7463c4 is the ID of my PDF).

I add a field "titleDoc" to the object with the id 51c05f881a13d534df7463c4 thanks to the following command :

db.fs.files.update({"_id": ObjectId("51c05f881a13d534df7463c4")}, {$set: {"titleDoc":"MY TITLE DOC"}})

I then have the exception in the ElasticSearch log. I tried to edit the _mapping in ElasticSearch but there's still the error.

Maybe it is an error due to the fact that I forgot something for the river to map new fields of raw file like PDF in Mongo.

Thank in advance,

Antoine

Hi Antoine,

Additional gridfs metadata should be stored in metadata attribute (see here [1]).

doc.metadata = {}
doc.metadata.title = "woww"
db.fs.files.save(doc)
{
        "_id" : ObjectId("51c78a054ce10426a81a3e27"),
        "filename" : "test-document.pdf",
        "chunkSize" : 262144,
        "uploadDate" : ISODate("2013-06-23T23:51:33.229Z"),
        "md5" : "947090a3e9cac07c13adabb25b9a3fa9",
        "length" : 50573,
        "contentType" : "applicaton/pdf",
        "title" : "test",
        "metadata" : {
                "title" : "woww"
        }
}

Does it help?

[1] - http://docs.mongodb.org/manual/reference/gridfs/#gridfs-files-collection

Thanks,
Richard.

Hi,

Thank you for your answer.

You are right for metadata attribute. However, I have already tried to use it and I still have the problem with the following steps :

My initial object :

{ "_id" : ObjectId("51c7f5dc71f6549c212cae37"), "filename" : "/home/acarton/Téléchargements/Cairngorm.pdf", "chunkSize" : 262144, "uploadDate" : ISODate("2013-06-24T07:31:41.611Z"), "md5" : "2d7d1f636a4e07b675eebb873330205e", "length" : 661649, "contentType" : "applicaton/pdf" }

The update command :

db.fs.files.update({"_id": ObjectId("51c7f5dc71f6549c212cae37")}, {$set: {"metadata.titleDoc":"Framework CAIRNGORM"}})

And the final object :

{ "_id" : ObjectId("51c7f5dc71f6549c212cae37"), "chunkSize" : 262144, "contentType" : "applicaton/pdf", "filename" : "/home/acarton/Téléchargements/Cairngorm.pdf", "length" : 661649, "md5" : "2d7d1f636a4e07b675eebb873330205e", "metadata" : { "titleDoc" : "Framework CAIRNGORM" }, "uploadDate" : ISODate("2013-06-24T07:31:41.611Z") }

I still have the NullPointerException with this update command.

However, the steps you give work fine. What is the difference between the "update" and the "save" commands ?

Thank you in advance,

Antoine

Hi,

The oplog entry is different for $set operation.

The entry for "save" operation is:

{
        "ts" : {
                "t" : 1372032972,
                "i" : 1
        },
        "h" : NumberLong("2162081457563127592"),
        "v" : 2,
        "op" : "u",
        "ns" : "mydb91.fs.files",
        "o2" : {
                "_id" : ObjectId("51c78a054ce10426a81a3e27")
        },
        "o" : {
                "_id" : ObjectId("51c78a054ce10426a81a3e27"),
                "filename" : "test-document.pdf",
                "chunkSize" : 262144,
                "uploadDate" : ISODate("2013-06-23T23:51:33.229Z"),
                "md5" : "947090a3e9cac07c13adabb25b9a3fa9",
                "length" : 50573,
                "contentType" : "applicaton/pdf",
                "title" : "test",
                "metadata" : {
                        "title" : "woww"
                }
        }
}

For $set operation:

{
        "ts" : {
                "t" : 1372065805,
                "i" : 1
        },
        "h" : NumberLong("8302104313737943305"),
        "v" : 2,
        "op" : "u",
        "ns" : "mydb91.fs.files",
        "o2" : {
                "_id" : ObjectId("51c78a07ae251a257e0e4d3e")
        },
        "o" : {
                "$set" : {
                        "metadata.titleDoc" : "test91"
                }
        }
}

The object id was extract from "o" but with $set is is only available in "o2". I will fix the code soon.

Perfect ! Thank you !

Fix is available in release 1.6.11.

Thanks,
Richard.