Parsing of XML with inconsistent structure results in imcompatible mapping
suyograo opened this issue · 2 comments
From: @gingerwizard elastic/logstash#3570
Consider parsing the following XML with logstash:
CA Local Sales Tax <description lang="spanish">Impuesto local sobre las ventas de CA
The above is valid XML.
The elements are, however, parsed sequentially in logstash and converted to json. The first "description" element is converted to a string as it doesn't contain an attribute. The second "description" element has an attribute and thus is converted to an object. The result is:
"description" => [
"CA Local Sales Tax",
{
"lang" => "spanish",
"content" => "Impuesto local sobre las ventas de CA"
}
}
The above is invalid for elastic - a nested field requires consistent types. Here we have an object and a string. The document is indexed with the same field but with inconsistent types - resulting in an index error.
Basic configuration will reproduce i.e:
input {
stdin {
type => "myxml"
}
}
filter {
xml {
source => "message"
target => "parsed"
}
}
output {
stdout { codec => json }
}
Suggestion is to convert all xml elements into objects, irrespectively of whether they contain attributes or sub-elements. Possibly make this setting, thus allowing inconsistent xml structures.
Fixed by usage of new config force_content
to ensure consistency
@suyograo can you close please
See Also #24 which describes a similar problem with a better fix