logstash-plugins/logstash-input-file

UTF-16LE file creates garbled output

TomonoriSoejima opened this issue · 4 comments

  • Version: 6.3.2
  • Operating System: OS X
  • Config File (if you have sensitive info, please remove it):

With this config, the output is garbled.
The workaround is to use UTF-8 for charset as well as saving the sample.log in UTF-8

input {
  file {
    path => "/tmp/sample.utf16.log"
    start_position => "beginning"
    ignore_older => -1
    codec => plain {
      charset => "UTF-16LE"
    }
  }
}

filter {
}

output {
  stdout {
    codec => rubydebug
  }
}
input {
  file {
    path => "/tmp/sample.utf8.log"
    start_position => "beginning"
    ignore_older => -1
    codec => plain {
      charset => "UTF-8"
    }
  }
}

filter {
}

output {
  stdout {
    codec => rubydebug
  }
}
  • Sample Data:

Archive.zip

I also found similar reports as below for your reference.

https://discuss.elastic.co/t/logstash-invalid-character-for-utf-16-unicode-encoding/56702/7
https://discuss.elastic.co/t/utf-16-broken-since-logstash-6/135558/2

Agreed. As mentioned here, the parsing into lines is done before the codec is applied, so the nul that is the second character of the line ending on the first line is not consumed. This effectively makes the rest of the file UTF-16BE.

I am having the same problem:

    input {
  file {
    path => "/log/playstore/installs_random_playstore_app_202011_overview.csv"
    sincedb_path => ["/var/log/since.db"]
    codec => plain { charset => "UTF-16LE" }
    type => "playstore-installs"  # a type to identify those logs (will need this later)
    start_position => "beginning"
  }
}
filter {
  csv {
      separator => ","
      skip_header => "true"
      columns => ["Date","Package Name","Daily Device Installs","Daily Device Uninstalls","Daily Device Upgrades","Total User Installs","Daily User Installs","Daily User Uninstalls","Active Device Installs","Install events","Update events","Uninstall events"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "playstore"
  }
  stdout
    {
        codec => rubydebug
    }
}

I made sure that's the encoding of the file using

file -i /log/playstore/installs_random_playstore_app_202011_overview.csv

The output is: application/csv; charset=utf-16le

If I import it as is, this is what I get in Elasticsearch in each row:

{
          "type" => "playstore-installs",
       "column1" => "㈀ ㈀ ⴀ\u3100\u3100ⴀ㈀㌀Ⰰ攀挀⸀最漀戀⸀愀猀椀⸀愀渀搀爀漀椀搀Ⰰ\u3100㌀㌀㜀Ⰰ Ⰰ Ⰰ Ⰰ\u3100\u3100㠀\u3100Ⰰ\u3100㔀㠀 Ⰰ\u3100㠀㈀ 㜀㈀Ⰰ\u3100㐀㜀㔀Ⰰ㈀㐀Ⰰ\u3100㘀㈀㌀�",
      "@version" => "1",
       "message" => "㈀ ㈀ ⴀ\u3100\u3100ⴀ㈀㌀Ⰰ攀挀⸀最漀戀⸀愀猀椀⸀愀渀搀爀漀椀搀Ⰰ\u3100㌀㌀㜀Ⰰ Ⰰ Ⰰ Ⰰ\u3100\u3100㠀\u3100Ⰰ\u3100㔀㠀 Ⰰ\u3100㠀㈀ 㜀㈀Ⰰ\u3100㐀㜀㔀Ⰰ㈀㐀Ⰰ\u3100㘀㈀㌀�",
    "@timestamp" => 2021-01-15T01:58:28.754Z,
          "host" => "hostname",
          "path" => "/log/playstore/installs_random_playstore_app_202011_overview.csv"
}

If I import it with a wrong codec, this is what I get (at least I get all the fields):

 {
    "Daily Device Uninstalls" => "\u00000\u0000",
                       "path" => "/log/playstore/installs_random_playstore_app_202011_overview.csv",
        "Daily User Installs" => "\u00001\u00000\u00008\u00007\u0000",
                       "type" => "playstore-installs",
                 "@timestamp" => 2021-01-15T02:10:19.956Z,
     "Active Device Installs" => "\u00001\u00007\u00008\u00007\u00007\u00004\u0000",
      "Daily User Uninstalls" => "\u00001\u00003\u00005\u00004\u0000",
                    "message" => "\u00002\u00000\u00002\u00000\u0000-\u00001\u00001\u0000-\u00003\u00000\u0000,\u0000e\u0000c\u0000.\u0000g\u0000o\u0000b\u0000.\u0000a\u0000s\u0000i\u0000.\u0000a\u0000n\u0000d\u0000r\u0000o\u0000i\u0000d\u0000,\u00001\u00002\u00001\u00005\u0000,\u00000\u0000,\u00000\u0000,\u00000\u0000,\u00001\u00000\u00008\u00007\u0000,\u00001\u00003\u00005\u00004\u0000,\u00001\u00007\u00008\u00007\u00007\u00004\u0000,\u00001\u00003\u00003\u00000\u0000,\u00001\u00009\u0000,\u00001\u00004\u00002\u00005\u0000",
      "Daily Device Upgrades" => "\u00000\u0000",
                       "host" => "hostname",
           "Uninstall events" => "\u00001\u00004\u00002\u00005\u0000",
        "Total User Installs" => "\u00000\u0000",
             "Install events" => "\u00001\u00003\u00003\u00000\u0000",
               "Package Name" => "\u00001\u00003\u00003\u00000\u0000",
      "Daily Device Installs" => "\u00001\u00002\u00001\u00005\u0000",
              "Update events" => "\u00001\u00009\u0000",
                   "@version" => "1",
                       "Date" => "\u00002\u00000\u00002\u00000\u0000-\u00001\u00001\u0000-\u00003\u00000\u0000"
}

Any ideas?

Edit:

Here's a sample of the csv file:

Date,Package Name,Daily Device Installs,Daily Device Uninstalls,Daily Device Upgrades,Total User Installs,Daily User Installs,Daily User Uninstalls,Active Device Installs,Install events,Update events,Uninstall events
2021-01-01,com.package,1203,0,0,0,1045,2168,186444,1320,17,2214
2021-01-02,com.package,1276,0,0,0,1124,2164,185313,1395,7,2222

Same bug here.

file -i log.txt
filet.txt text/plain; charset=utf-16le

I have tried

codec => plain {
    charset => "UTF-16LE"
}

and

codec => line {
            charset => "UTF-16LE"
}

Both produce unreadable output.

Has anybody found a fix for this in 3 years?