Error caused by empty xml elements being represented as an empty hash and then output to elasticsearch
ChrisMagnuson opened this issue · 1 comments
Quick summary
The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.
When used with the elastic search output plugin this results in empty xml #elements being mapped as properties of type object.
When a subsequent document that has that same xml element populated with a value is output you get the following error object mapping for [fieldName] tried to parse field [null] as object, but found a concrete value
.
I have submitted pull request #23 with the necessary code changes to provide a suppress_empty option including unit tests to resolve this issue.
Detailed summary
The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.
When you are using the elasticsearch output plugin with logstash-
as the prefix of the index name then it uses this index template when it creates the index.
Using this logstash configuration:
input {
stdin {
}
}
filter {
xml {
target => ParsedXML
source => message
}
}
output {
stdout{ codec => rubydebug }
elasticsearch {
hosts => localhost
index => "logstash-test"
}
}
and then pasting this xml sample into the terminal window followed by hitting enter once:
<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>
results in this error message:
←[33mFailed action. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-test", :_type=>"logs", :_routing=>
nil}, #<LogStash::Event:0x68f2a46a @metadata_accessors=#<LogStash::Util::Accessors:0x68905b47 @store={}, @lut={}>, @canc
elled=false, @data={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Addres
sLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"
AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @metadata={}, @accessors=#<LogStash::Util::Acce
ssors:0x466367b @store={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Ad
dressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"
=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @lut={"host"=>[{"message"=>"<Address><Addre
ssLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timest
amp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLi
ne2"=>["Apartment 12"]}}, "host"], "message"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><Addre
ssLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cm
agnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "message"], "Parsed
XML"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Add
ress>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1
"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "ParsedXML"], "type"=>[{"message"=>"<Address><AddressLine1>
333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2
016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["
Apartment 12"]}}, "type"]}>>], :response=>{"create"=>{"_index"=>"logstash-test", "_type"=>"logs", "_id"=>"AVKImcTNxEzXzs
2VmWC5", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLi
ne2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}←[0m
The most important snippet from this appears to be "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLine2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}
.
The first xml record passed to elasticsearch resulted in the following properties because of the elasticsearch index template used by the elasticsearch output plugin:
"properties": {
"ParsedXML": {
"properties": {
"AddressLine2": {
"type": "object"
},
"AddressLine1": {
"fielddata": {
"format": "disabled"
},
"norms": {
"enabled": false
},
"type": "string",
"fields": {
"raw": {
"ignore_above": 256,
"index": "not_analyzed",
"type": "string"
}
}
}
}
}
When this record is output to stdout with the ruby debug codec it looks like
"message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
"@version" => "1",
"@timestamp" => "2016-01-28T14:18:43.078Z",
"host" => "cmagnuson-lt",
"ParsedXML" => {
"AddressLine1" => [
[0] "555 Some Address"
],
"AddressLine2" => [
[0] {}
]
}
}
You can see that AddressLine2 is represented as an empty hash {}
and that the resulting property in elastic search is "type": "object"
.
When the next xml record is sent to elasticsearch it results in an error because the AddressLine2 now has a string value and elastic search cannot change the property from an object to a string.
The underlying xmlsimply library has an option to suppress empty elements so that they don't show up in the output.
I have updated the xml filter to support a supress_empty boolean property that allows for the following logstash configuration:
input {
stdin {
}
}
filter {
xml {
target => ParsedXML
source => message
suppress_empty => true
}
}
output {
stdout{ codec => rubydebug }
elasticsearch {
hosts => localhost
index => "logstash-test"
}
}
Now after deleting the index to get rid of the incorrect mapping, if I process the xml records again I get the following output with no errors:
Logstash startup completed
<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>{
"message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
"@version" => "1",
"@timestamp" => "2016-01-28T15:25:31.657Z",
"host" => "cmagnuson-lt",
"ParsedXML" => {
"AddressLine1" => [
[0] "555 Some Address"
]
}
}
{
"message" => "<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Ad
dress>\r",
"@version" => "1",
"@timestamp" => "2016-01-28T15:25:32.623Z",
"host" => "cmagnuson-lt",
"ParsedXML" => {
"AddressLine1" => [
[0] "333 Some Address"
],
"AddressLine2" => [
[0] "Apartment 12"
]
}
}
Both records were properly parsed and stored as documents in elastic search.
The resulting properties now look like you would expect:
"properties": {
"ParsedXML": {
"properties": {
"AddressLine2": {
"fielddata": {
"format": "disabled"
},
"norms": {
"enabled": false
},
"type": "string",
"fields": {
"raw": {
"ignore_above": 256,
"index": "not_analyzed",
"type": "string"
}
}
},
"AddressLine1": {
"fielddata": {
"format": "disabled"
},
"norms": {
"enabled": false
},
"type": "string",
"fields": {
"raw": {
"ignore_above": 256,
"index": "not_analyzed",
"type": "string"
}
}
}
}
},
Pull request #23 has been submitted to add this feature and resolve this error.
As an aside, I think something other than an empty hash should be the default option as I would not expect to have to configure something special to be able to support outputing xml to elasticsearch where some documents have empty elements and some do not.