unable to read the whole file when pipeline get reload

Question

unable to read the whole file when pipeline get reload

kaisecheng opened this issue 3 years ago · 0 comments

When Logstash start with --config.reload.automatic, the file input can ingest all data without any reload
However, if pipeline got reload in the middle of ingestion, let's say have already read 300 out of 600 lines, Logstash read the first 300 lines again and leave the rest unread.

Version: 4.2.4
LS Version: 7.12
Operating System: macOS
Config File (if you have sensitive info, please remove it):

- pipeline.id: SDH_650
  pipeline.workers: 1
  pipeline.batch.size: 5
  config.string: |
    input {
        file {
            path => "/650/merged.csv"
            mode  => "read"
            start_position => "beginning"
        }
    }

    filter {
        csv {
            separator => ","
            columns => ["id", "host", "fqdn", "IP", "mac", "role", "type", "make", "model", "oid", "fid", "time"]
            remove_field => ["path", "host", "message", "@version" ]   
        }
    }

    output {
        elasticsearch { index => "650" }
        stdout { codec => rubydebug }
    }

Sample Data:

"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-09 02:36:17.154791"
"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-10 02:36:17.154791"
"464783b9468bed39b19aff0c98128af4f26c3b972092cb26ede33b28ace57bad","aff4.bc","aff4.bc.org","127.0.0.1","cb:91:bc:28:3b:be","MOBILE DEVICE","TABLET","make","model","DHS","","2000-03-11 02:36:17.154791"

Steps to Reproduce:

run the pipeline in 7.12 with auto-reload > bin/logstash --config.reload.automatic
change the pipeline.workers from 1 to 2 during ingestion
change the pipeline.workers multiple times during ingestion
check data in elasticsearch. You will find duplication of the head of csv, while the tail of csv is missing

Currently the workaround is use tail mode