logstash-plugins/logstash-filter-xml

Error caused by empty xml elements being represented as an empty hash and then output to elasticsearch

ChrisMagnuson opened this issue · 1 comments

Quick summary

The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.

When used with the elastic search output plugin this results in empty xml #elements being mapped as properties of type object.

When a subsequent document that has that same xml element populated with a value is output you get the following error object mapping for [fieldName] tried to parse field [null] as object, but found a concrete value.

I have submitted pull request #23 with the necessary code changes to provide a suppress_empty option including unit tests to resolve this issue.

Detailed summary

The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.

When you are using the elasticsearch output plugin with logstash- as the prefix of the index name then it uses this index template when it creates the index.

Using this logstash configuration:

input {
    stdin {
    }
}
filter {
    xml {
        target => ParsedXML
        source => message
    }
}
output {
    stdout{ codec => rubydebug }

    elasticsearch { 
        hosts => localhost
        index => "logstash-test"
    }
}

and then pasting this xml sample into the terminal window followed by hitting enter once:

<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>

results in this error message:

←[33mFailed action.  {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-test", :_type=>"logs", :_routing=>
nil}, #<LogStash::Event:0x68f2a46a @metadata_accessors=#<LogStash::Util::Accessors:0x68905b47 @store={}, @lut={}>, @canc
elled=false, @data={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Addres
sLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"
AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @metadata={}, @accessors=#<LogStash::Util::Acce
ssors:0x466367b @store={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Ad
dressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"
=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @lut={"host"=>[{"message"=>"<Address><Addre
ssLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timest
amp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLi
ne2"=>["Apartment 12"]}}, "host"], "message"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><Addre
ssLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cm
agnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "message"], "Parsed
XML"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Add
ress>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1
"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "ParsedXML"], "type"=>[{"message"=>"<Address><AddressLine1>
333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2
016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["
Apartment 12"]}}, "type"]}>>], :response=>{"create"=>{"_index"=>"logstash-test", "_type"=>"logs", "_id"=>"AVKImcTNxEzXzs
2VmWC5", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLi
ne2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}←[0m

The most important snippet from this appears to be "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLine2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}.

The first xml record passed to elasticsearch resulted in the following properties because of the elasticsearch index template used by the elasticsearch output plugin:

 "properties": {
      "ParsedXML": {
        "properties": {
          "AddressLine2": {
            "type": "object"
          },
          "AddressLine1": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          }
        }
      }

When this record is output to stdout with the ruby debug codec it looks like

       "message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T14:18:43.078Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "555 Some Address"
        ],
        "AddressLine2" => [
            [0] {}
        ]
    }
}

You can see that AddressLine2 is represented as an empty hash {} and that the resulting property in elastic search is "type": "object".

When the next xml record is sent to elasticsearch it results in an error because the AddressLine2 now has a string value and elastic search cannot change the property from an object to a string.

The underlying xmlsimply library has an option to suppress empty elements so that they don't show up in the output.

I have updated the xml filter to support a supress_empty boolean property that allows for the following logstash configuration:

input {
    stdin {
    }
}
filter {
    xml {
        target => ParsedXML
        source => message
        suppress_empty => true
    }
}
output {
    stdout{ codec => rubydebug }

    elasticsearch { 
        hosts => localhost
        index => "logstash-test"
    }
}

Now after deleting the index to get rid of the incorrect mapping, if I process the xml records again I get the following output with no errors:

Logstash startup completed
<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>{
       "message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:25:31.657Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "555 Some Address"
        ]
    }
}

{
       "message" => "<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Ad
dress>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:25:32.623Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "333 Some Address"
        ],
        "AddressLine2" => [
            [0] "Apartment 12"
        ]
    }
}

Both records were properly parsed and stored as documents in elastic search.

The resulting properties now look like you would expect:

"properties": {
      "ParsedXML": {
        "properties": {
          "AddressLine2": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          },
          "AddressLine1": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          }
        }
      },

Pull request #23 has been submitted to add this feature and resolve this error.

As an aside, I think something other than an empty hash should be the default option as I would not expect to have to configure something special to be able to support outputing xml to elasticsearch where some documents have empty elements and some do not.

Fixed by #32 and #34

@suyograo can you close please