Adding a field to the JSON of a PDF in MongoDB => NullPointerException for the river

Question

Adding a field to the JSON of a PDF in MongoDB => NullPointerException for the river

antoinecarton opened this issue 11 years ago · 5 comments

Hi,

First of all, here is the Exception from ElasticSearch :

Exception in thread "elasticsearch[Nathaniel Richards][mongodb_river_slurper][T#1]" java.lang.NullPointerException
at org.elasticsearch.river.mongodb.MongoDBRiver$Slurper.processOplogEntry(MongoDBRiver.java:1074)
at org.elasticsearch.river.mongodb.MongoDBRiver$Slurper.run(MongoDBRiver.java:986)
at java.lang.Thread.run(Thread.java:679)

Here is my configuration :
River : 1.6.9
ElasticSearch : 0.90.1
MongoDB : 2.4.4

Configuration used for MongoDB :
http://docs.mongodb.org/manual/tutorial/deploy-replica-set/, partie "Deploy a Development or Test Replica Set"

Next, in a console :

mongo --port 27017
use pdf_database5

In a second console, I add a PDF file :

mongofiles --host localhost:27017 --db pdf_database5 --collection fs --type applicaton/pdf put /PATH_TO_A_PDF

After that, I create a MongoDB river for ElasticSearch :

curl -XPUT "${host}/_river/mongodb/_meta" -d '{
"type": "mongodb",
"mongodb": {
"db": "pdf_database5",
"collection": "fs",
"gridfs": true
},
"index": {
"name": "mongoindex",
"type": "files"
}
}'

Until now, everything is OK and my PDF file is correctly indexed and full text search is OK.

However, once I add a field to the JSON of the PDF file, that is to say with the following step in mongoDB console :

db.fs.files.find({});

(for instance, 51c05f881a13d534df7463c4 is the ID of my PDF).

I add a field "titleDoc" to the object with the id 51c05f881a13d534df7463c4 thanks to the following command :

db.fs.files.update({"_id": ObjectId("51c05f881a13d534df7463c4")}, {$set: {"titleDoc":"MY TITLE DOC"}})

I then have the exception in the ElasticSearch log. I tried to edit the _mapping in ElasticSearch but there's still the error.

Maybe it is an error due to the fact that I forgot something for the river to map new fields of raw file like PDF in Mongo.

Thank in advance,

Antoine

Answer 1 · 2013-06-24T00:19:26.000Z

Hi Antoine,

Additional gridfs metadata should be stored in metadata attribute (see here [1]).

doc.metadata = {}
doc.metadata.title = "woww"
db.fs.files.save(doc)
{
        "_id" : ObjectId("51c78a054ce10426a81a3e27"),
        "filename" : "test-document.pdf",
        "chunkSize" : 262144,
        "uploadDate" : ISODate("2013-06-23T23:51:33.229Z"),
        "md5" : "947090a3e9cac07c13adabb25b9a3fa9",
        "length" : 50573,
        "contentType" : "applicaton/pdf",
        "title" : "test",
        "metadata" : {
                "title" : "woww"
        }
}

Does it help?

[1] - http://docs.mongodb.org/manual/reference/gridfs/#gridfs-files-collection

Thanks,
Richard.

Answer 2 · 2013-06-24T08:01:20.000Z

Hi,

Thank you for your answer.

You are right for metadata attribute. However, I have already tried to use it and I still have the problem with the following steps :

My initial object :

{ "_id" : ObjectId("51c7f5dc71f6549c212cae37"), "filename" : "/home/acarton/Téléchargements/Cairngorm.pdf", "chunkSize" : 262144, "uploadDate" : ISODate("2013-06-24T07:31:41.611Z"), "md5" : "2d7d1f636a4e07b675eebb873330205e", "length" : 661649, "contentType" : "applicaton/pdf" }

The update command :

db.fs.files.update({"_id": ObjectId("51c7f5dc71f6549c212cae37")}, {$set: {"metadata.titleDoc":"Framework CAIRNGORM"}})

And the final object :

{ "_id" : ObjectId("51c7f5dc71f6549c212cae37"), "chunkSize" : 262144, "contentType" : "applicaton/pdf", "filename" : "/home/acarton/Téléchargements/Cairngorm.pdf", "length" : 661649, "md5" : "2d7d1f636a4e07b675eebb873330205e", "metadata" : { "titleDoc" : "Framework CAIRNGORM" }, "uploadDate" : ISODate("2013-06-24T07:31:41.611Z") }

I still have the NullPointerException with this update command.

However, the steps you give work fine. What is the difference between the "update" and the "save" commands ?

Thank you in advance,

Antoine

Answer 3 · 2013-06-24T10:34:01.000Z

Hi,

The oplog entry is different for $set operation.

The entry for "save" operation is:

{
        "ts" : {
                "t" : 1372032972,
                "i" : 1
        },
        "h" : NumberLong("2162081457563127592"),
        "v" : 2,
        "op" : "u",
        "ns" : "mydb91.fs.files",
        "o2" : {
                "_id" : ObjectId("51c78a054ce10426a81a3e27")
        },
        "o" : {
                "_id" : ObjectId("51c78a054ce10426a81a3e27"),
                "filename" : "test-document.pdf",
                "chunkSize" : 262144,
                "uploadDate" : ISODate("2013-06-23T23:51:33.229Z"),
                "md5" : "947090a3e9cac07c13adabb25b9a3fa9",
                "length" : 50573,
                "contentType" : "applicaton/pdf",
                "title" : "test",
                "metadata" : {
                        "title" : "woww"
                }
        }
}

For $set operation:

{
        "ts" : {
                "t" : 1372065805,
                "i" : 1
        },
        "h" : NumberLong("8302104313737943305"),
        "v" : 2,
        "op" : "u",
        "ns" : "mydb91.fs.files",
        "o2" : {
                "_id" : ObjectId("51c78a07ae251a257e0e4d3e")
        },
        "o" : {
                "$set" : {
                        "metadata.titleDoc" : "test91"
                }
        }
}

The object id was extract from "o" but with $set is is only available in "o2". I will fix the code soon.

Answer 4 · 2013-06-24T12:13:03.000Z

Perfect ! Thank you !

Answer 5 · 2013-07-16T13:12:45.000Z

Fix is available in release 1.6.11.

Thanks,
Richard.