Plugin crashes when search returns docs containing invalid UTF-8 byte sequences.

Question

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences.

Opened this issue 5 years ago · 3 comments

This is a rephrasing of elastic/logstash#10516, opened by @matteogrolla on 2019-03-06.

I have a document in Elasticsearch that crashes Logstash elasticsearch inputplugin when it tries to read it
the document is reported at the end of the message with the error log reported by logstash.
I'm using logstash to migrate documents from Elasticsearch to Mongo, but when logstash encounters the critical document the input plugin is restarted and starts from the beginning.
I'd like at least to skip the documents that can't be parsed, but I can't find a way to do so.
Can you help me?

P.S. If I create a new document in ES using curl and the textual representation of the critical document given here, I don't get parse error from logstash on this new document

-------Error log-------

[2019-03-06T12:43:47,696][ERROR][logstash.pipeline        ] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Elasticsearch index=>"fulltextmg_33", id=>"3d2d80a0e02debd1b54d39b3e6b88b54a1ea45fe2c8ae8ddf2b0ec42e080ff61", hosts=>["pbauci01"], query=>"{ \"query\": { \"term\": { \"_id\": \"http://www.facebook.com/114701051917886_2073179089403396\"} } }", enable_metric=>true, codec=><LogStash::Codecs::JSON id=>"json_149580ae-80e8-4f8f-8728-66db3890cf1f", enable_metric=>true, charset=>"UTF-8">, size=>1000, scroll=>"1m", docinfo=>false, docinfo_target=>"@metadata", docinfo_fields=>["_index", "_type", "_id"], ssl=>false>
  Error: invalid byte sequence in UTF-8
  Exception: MultiJson::ParseError
  Stack: /opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:91:in `is_time_string?'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:36:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapters/jr_jackson.rb:11:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapter.rb:21:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json.rb:122:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/serializer/multi_json.rb:24:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/base.rb:322:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/client.rb:131:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-api-5.0.5/lib/elasticsearch/api/actions/search.rb:183:in `search'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:200:in `do_run'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:188:in `run'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:426:in `inputworker'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:420:in `block in start_input'

[...]

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?

Potentially related:

Elasticsearch accepts invalid utf-8 text (elastic/elasticsearch#9538; closed as "won't fix")
JrJackson fails to parse valid JSON in UTF-16 and UTF-32 (guyboertje/jrjackson#72; opened)

Answer 1 · 2019-04-03T09:23:36.000Z

Hi Ry,
I don't understand why you stripped my workaround when you moved the issue.
At minimum It clearly exhibits where the problem comes from.
The workaround isn't the proper solution, since it modifies jrjackson, but it works and could help those who need an urgent solution.

Answer 2 · 2019-04-03T15:55:55.000Z

@matteogrolla there was no malicious intent on my part; the issue was initially filed in the wrong place and I attempted to move and link to it in the places where it would be better addressed, but failed to also copy along the commentary.

We are still waiting on a follow-up from you with a document that exhibits the symptom:

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?

Answer 3 · 2019-04-04T10:23:33.000Z

I've downloaded the content with

curl -X POST http://pbauci01:9200/fulltext_33/_s85f59a70' -H 'cache-control: no-cache' -d '{ 4-965b8
"query": {
"term": { "url": "http://www.facebook.com/114701051917886_2073179089403396"}
}
}' > logstash_problematic_doc.json

and edited the file to keep only the _source field value

logstash_problematic_doc.txt