richardwilly98/elasticsearch-river-mongodb

How it is handle Chinese in mongodb?

fzxu opened this issue · 4 comments

fzxu commented

I have utf-8 in mongodb and stores normal documents with Chinese content. But when hooking the content in elasticsearch, it shows me parse error(seems like it gets all the '?'):

[2013-06-23 04:51:44,512][DEBUG][action.bulk ] [Omen] [test][1] failed to execute bulk item (index) index {[test][question][51c6d1f14b90c3be18174882], source[{"_id":"51c6d1f14b90c3be18174882","_class":"me.test.entities.Question","title":"??????????????????","content":"?????????????????????????????? ????????75??????????(Morgan Freeman)??????????????????(Michael Caine)??????????????????...http://t.cn/zHtoIH1","answers":[{"_id":"51c6d1f14b90c3be18174881","content":"@?????","createdAt":"2013-05-25T10:40:20.000Z","updatedAt":"2013-06-23T10:46:09.890Z","votesCount":0,"source":{"providerId":"weibo","referenceId":"3581914474562346"},"createdBy":"{ "$ref" : "users", "$id" : "51c6d1f14b90c3be1817487d" }"}],"createdAt":"2013-05-25T09:20:02.000Z","updatedAt":"2013-06-23T10:46:09.888Z","tags":[],"viewsCount":0,"votesCount":0,"source":{"providerId":"weibo","referenceId":"3581894262156440"},"createdBy":"{ "$ref" : "users", "$id" : "51c6d1f14b90c3be18174880" }"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:553)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:327)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:381)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal character '?' (code 0xe3) in base64 content
at [Source: [B@13c519a5; line: 1, column: 152]
at org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1369)
at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getBinaryValue(UTF8StreamJsonParser.java:428)
at org.elasticsearch.common.jackson.core.JsonParser.getBinaryValue(JsonParser.java:1048)
at org.elasticsearch.common.xcontent.json.JsonXContentParser.binaryValue(JsonXContentParser.java:183)
at org.elasticsearch.index.mapper.attachment.AttachmentMapper.parse(AttachmentMapper.java:276)

Hi,

At this point I am not sure where the encoding issue comes from but I will investigate.

Could you please try to index another document with trace logging enable?

Add logging in $ES_HOME\config\logging.yml
In logger: section
river.mongodb: TRACE

Then restart ES.
Please post ES log file.

Thanks,
Richard.

As a note to help reproduce this, I found that my logging file was located in /etc/elasticsearch/logging.yml

The test to reproduce this issue was broken. However, after fixing it (#130), I don't see any problem with Chinese characters. Can you try again with the latest versions of elasticsearch and elasticsearch-river-mongodb?

We have a test ensuring Chinese works now, so I think it's probably safe to close this issue. @arkxu let us know if you still have any trouble