logstash-plugins/logstash-input-file

unable to read the whole file when pipeline get reload

kaisecheng opened this issue · 0 comments

When Logstash start with --config.reload.automatic, the file input can ingest all data without any reload
However, if pipeline got reload in the middle of ingestion, let's say have already read 300 out of 600 lines, Logstash read the first 300 lines again and leave the rest unread.

  • Version: 4.2.4
  • LS Version: 7.12
  • Operating System: macOS
  • Config File (if you have sensitive info, please remove it):
- pipeline.id: SDH_650
  pipeline.workers: 1
  pipeline.batch.size: 5
  config.string: |
    input {
        file {
            path => "/650/merged.csv"
            mode  => "read"
            start_position => "beginning"
        }
    }

    filter {
        csv {
            separator => ","
            columns => ["id", "host", "fqdn", "IP", "mac", "role", "type", "make", "model", "oid", "fid", "time"]
            remove_field => ["path", "host", "message", "@version" ]   
        }
    }

    output {
        elasticsearch { index => "650" }
        stdout { codec => rubydebug }
    }
  • Sample Data:
"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-09 02:36:17.154791"
"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-10 02:36:17.154791"
"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-11 02:36:17.154791"
  • Steps to Reproduce:
  1. run the pipeline in 7.12 with auto-reload > bin/logstash --config.reload.automatic
  2. change the pipeline.workers from 1 to 2 during ingestion
  3. change the pipeline.workers multiple times during ingestion
  4. check data in elasticsearch. You will find duplication of the head of csv, while the tail of csv is missing

Currently the workaround is use tail mode