logstash-plugins/logstash-input-file

bug in file plugin for sincedb with gz file

skorzan opened this issue · 4 comments

Logstash information:

Please include the following information:

  1. Logstash version 8.2.1
  2. Logstash installation source -docker docker --version
    Docker version 19.03.2, build 6a30dfc
  3. How is Logstash being run (docker)
  4. How was the Logstash Plugin installed

build on logstash container

OS version (uname -a if on a Unix-like system):
Linux SRV40990KAB-B01 4.4.155-1.el7.elrepo.x86_64 #1 SMP Sun Sep 9 16:08:40 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) pipeline definition(s), settings, locale, etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

I'm facing with issue for *.xml.gz files generally issue was reproduced on gz compress files.
Even on the latest version of logstash 8.2.1
When I recovered logstash service with sincedb configuration it seems that queue (with files are processing once again).
The sincedb mechanism is designed to prevent such situations. In my opinion sincedb file is also badly written. Only last one line has a fully name of file and then after parsing the file it disappears. This is quite an important feature (and it would be nice to have coverage for compressed logs). Please fix it.

Provide logs (if relevant):
6443347244 0 64768 76277 1653479279.183906 6443347246 0 64768 73283 1653479281.760788 6443347248 0 64768 73315 1653479283.950932 6443347250 0 64768 75366 1653479286.382374 6443347252 0 64768 73616 1653479289.040152 6443347254 0 64768 76145 1653479291.547809 6443347256 0 64768 76333 1653479294.038769 6443347258 0 64768 76901 1653479296.130042 6443347260 0 64768 76592 1653479298.502644 6443347262 0 64768 74610 1653479300.670604 6443348418 0 64768 75824 1653479302.671524 6443348421 0 64768 77399 1653479305.086715 6443348434 0 64768 79737 1653479307.621136 6443348446 0 64768 76225 1653479309.9653728 6443348448 0 64768 77433 1653479312.1789799 6443348464 0 64768 76415 1653479314.736031 6443348466 0 64768 80364 1653479316.9134922 6443348468 0 64768 78276 1653479319.3346992 6443348470 0 64768 77413 1653479321.901298 6443348474 0 64768 77112 1653479324.234174 6443348476 0 64768 80971 1653479326.536776 6443348705 0 64768 83687 1653479329.428642 /opt/data/input/A20220525.0530+0200-0535+0200_HSS40.xml.gz

any update?

Also affects us with gzipped json lines data.

Sample restore pipeline config /opt/data/restore/restore-ecorp.conf:

input {
  # We read from backup files. Note that the start of the path MUST be absolute
  file {
    # uncompressed works 
    #path => "/opt/data/backup/ecorp-fluffy-data-2022.33.json"
    # Compressed does not work
    path => "/opt/data/backup/ecorp-fluffy-data-2022.33.json.gz"
    sincedb_path => "/tmp/sincedb_restore-ecorp.db"
    start_position => "beginning"
    file_chunk_size => 268435456
    mode => "read"
    codec => "json"
    file_completed_action => "log"
    file_completed_log_path => "/opt/data/logstash/logs/restore-ecorp-fluffy-data.log"
  }
}
filter {
  if "_jsonparsefailure" in [tags] {
    drop { }
  }
# sort
  json {
    source => "message"
  }
}

output {
  # We write to the "new" cluster
  elasticsearch {
    manage_template => false
    sniffing => false
    ilm_enabled => true
    ilm_rollover_alias => "ecorp-fluffy-data"
    ilm_policy => "ecorp-fluffy-data"
    doc_as_upsert => true
    action => "update"
    document_id => "%{identifier}"
    http_compression => true
    ssl => true
    ssl_certificate_verification => false
    hosts => ["https://localhost:9200"]
    api_key => " not going to tell you"
  }
  # We print dots to see it in action
  stdout {
    codec => "dots"
  }
}

We start this Logstash as commandline tool with:

#!/bin/bash
rm -f /tmp/sincedb_restore-ecorp.db
logstash  --pipeline.ecs_compatibility disabled -f /opt/data/restore/restore-ecorp.conf --path.data /opt/data/logstash/restore-recorp -w 1 -b  20

It does not work on an Intel Mac, WSL2 on Windows and on Arm based AWS machines.
The latest 7.x release does work.
And yes the gzip file is not corrupt or empty.

We believe this issue has now been fixed in v4.4.4 of the plugin. The relevant commit is in #315.

As such, I'm going to go ahead and close this issue. If you believe this issue still persists after using the latest, please re-open this issue.

Hi, it still persist for logstash v 8.5.2, the behavior the same. Please reopen the issue.