janmg/logstash-input-azure_blob_storage

Log processing stopped while reading corrupted blob

acidmind opened this issue · 4 comments

Hi,
we collecting logs in .gz format, thats why we using gzip_lines codec plugin.
this our input config:

    azure_blob_storage {
        storageaccount => "myacc"
        access_key => "password"
        container => "logs"
        interval => 30
        file_head => '{"@t"'
        file_tail => '"}'
        registry_path => "logstash/registry.dat"
        codec => gzip_lines { charset => "ASCII-8BIT"}
    }

Log processing stopt and the following error occured then Logstash try to process corrupted blob:

[2020-03-12T03:30:32,730][ERROR][logstash.javapipeline    ][main] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::AzureBlobStorage container=>"logs", codec=><LogStash::Codecs::GzipLines charset=>"ASCII-8BIT", id=>"ea144f5d-e64d-4341-b297-2dcfc7f1cf2d", enable_metric=>true>, file_head=>"{\"@t\"", storageaccount=>"myacc", access_key=><password>, file_tail=>"\"}", interval=>30, id=>"7a23ccd608a25d6b67053c55689b3080829ddf0bb25017crty50b27ebec539fc", registry_path=>"logstash/registry.dat", enable_metric=>true, logtype=>"raw", dns_suffix=>"core.windows.net", registry_create_policy=>"resume", debug_until=>0, path_filters=>["**/*"]>
  Error: Broken pipe - Unexpected end of ZLIB input stream
  Exception: Errno::EPIPE
  Stack: org/jruby/ext/zlib/JZlibRubyGzipReader.java:652:in `each'
org/jruby/ext/zlib/JZlibRubyGzipReader.java:662:in `each_line'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-gzip_lines-3.0.4/lib/logstash/codecs/gzip_lines.rb:38:in `decode'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-azure_blob_storage-0.11.2/lib/logstash/inputs/azure_blob_storage.rb:227:in `block in run'
org/jruby/RubyHash.java:1428:in `each'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-azure_blob_storage-0.11.2/lib/logstash/inputs/azure_blob_storage.rb:201:in `run'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:328:in `inputworker'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:320:in `block in start_input'
[2020-03-12T03:33:27,090][ERROR][logstash.javapipeline    ][main] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main

I think in this situation plugin should raising error but just skipping this blod and going to the next one.

janmg commented

This plugin supports the codecs json and line and using gzip_lines be complex, because the plugin can read partial files while they are being written in the blob-blocks. For gzip you can't do that hence the unexpected end of ZLIB input stream from the gzip_lines codec. So unless you can cut and past the gzip headers to fix up the partial reads ... that will be an impressive achievement.

I can add on the codec config a validation so that it only accepts json and line, potentially blocking out other valid codecs, but that has the same effect as not configuring the codecs outside the supported range... Next version I'll add it for good measure.
:validate => ['json',line']

I can do some reshuffling of the internal logic to accomodate gzip_lines as plugin, but it would take me some time and I don't much time at the moment.

Yes, but we use blobs with one block with size less than 100MB. Your plugin works great with it. The only one think is it crashed when met corrupted blob file. It will be greate if plugin just writing to log "Oh I am met corrupted file" but still processing other blobs while we fixing corrupted one.

janmg commented

ah, I understand now, that it normally works ... I'll add begin/rescue around line 227 that deals with codec, so that corrupted files will just be a log message and the plugin will skip the file and continue processing other files without crashing. That's easy and I'll do that today.

Thanks a lot!!! 👍