logstash-plugins/logstash-filter-xml

Parsing of XML with inconsistent structure results in imcompatible mapping

suyograo opened this issue · 2 comments

From: @gingerwizard elastic/logstash#3570


Consider parsing the following XML with logstash:

CA Local Sales Tax <description lang="spanish">Impuesto local sobre las ventas de CA

The above is valid XML.
The elements are, however, parsed sequentially in logstash and converted to json. The first "description" element is converted to a string as it doesn't contain an attribute. The second "description" element has an attribute and thus is converted to an object. The result is:

"description" => [
"CA Local Sales Tax",
{
"lang" => "spanish",
"content" => "Impuesto local sobre las ventas de CA"
}
}

The above is invalid for elastic - a nested field requires consistent types. Here we have an object and a string. The document is indexed with the same field but with inconsistent types - resulting in an index error.

Basic configuration will reproduce i.e:

input {
stdin {
type => "myxml"
}
}

filter {
xml {
source => "message"
target => "parsed"

}
}

output {
stdout { codec => json }
}

Suggestion is to convert all xml elements into objects, irrespectively of whether they contain attributes or sub-elements. Possibly make this setting, thus allowing inconsistent xml structures.

Fixed by usage of new config force_content to ensure consistency

@suyograo can you close please

See Also #24 which describes a similar problem with a better fix