logstash-plugins/logstash-filter-xml

Unexpected poor performance handling CDATA section

tsouza opened this issue · 1 comments

I've ran a xml filter benchmark against 2 types of xml log entries:

  1. One that wraps a nested xml text inside a CDATA section.
  2. One that escapes a nested xml text.

In both cases, this nested xml text is ~18k chars.

I've ran the benchmark in a dedicated Core i5 (4x cores) with 16GB RAM.

I've used Logstash v2.4.0.

JVM is:

openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102-8u102-b14.1-2-b14)
OpenJDK 64-Bit Server VM (build 25.102-b14, mixed mode)

Linux:

Linux homeserver 4.8.0-26-generic #28-Ubuntu SMP Tue Oct 18 14:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

The logstash configuration is as following:
(from https://www.elastic.co/guide/en/logstash/2.4/plugins-filters-metrics.html)

input {
  generator {
    lines => ['<sample_xml_log>']
  }
}

filter {
  xml {
    source => message
    target => entry
  }
  metrics {
    meter => "events"
    add_tag => "metric"
  }

}

output {
  # only emit events with the 'metric' tag
  if "metric" in [tags] {
    stdout {
      codec => line {
        format => "rate: %{[events][rate_1m]}"
      }
    }
  }
}

The output for the log entry that uses CDATA section is:

rate: 6.032965896569965
rate: 6.232386633063851
rate: 6.245788257933181
rate: 6.276051178133002
rate: 6.295079587467865
rate: 6.101150213348986
rate: 6.070909269120961
rate: 5.995443066741132
rate: 5.983002341013909
rate: 6.01280267908362
rate: 6.02451769874094
rate: 6.011403109806091
rate: 6.006735744554418
rate: 6.028854466363812
rate: 6.044576382685371
rate: 6.034575149976681
rate: 6.01545334110413
rate: 6.018715406521946
rate: 6.001227888163898
rate: 6.033517878763309
rate: 6.01778815696872
rate: 6.016365894465621
rate: 5.774201799793936
rate: 5.851427472899567
rate: 5.863306676273883
rate: 5.960459293207369
rate: 5.969307896129195
rate: 5.97401969497492
rate: 6.007417476593245
rate: 6.037758970727505
rate: 6.018748813045858
rate: 6.062558991100919
rate: 6.073548167421381
rate: 6.051676463568129
rate: 6.039506676632154
rate: 6.033441679780281

The output for the log entry that does NOT uses CDATA section is:

rate: 47.8
rate: 50.27862314649097
rate: 58.27311268621327
rate: 69.11691806009138
rate: 74.19274503773141
rate: 79.10259805424599
rate: 87.34742691392835
rate: 89.3185378259186
rate: 92.84309693804262
rate: 94.80655849725238
rate: 100.21004309947467
rate: 100.21004309947467
rate: 103.13561452214401
rate: 109.14742413686035
rate: 111.83813553922879
rate: 114.31370953635731
rate: 116.92716102397382
rate: 118.96385677535498
rate: 120.90167179400903
rate: 122.92441443461854
rate: 124.48159627893531
rate: 125.96224608858356
rate: 127.27653632474998
rate: 129.9340955963991
rate: 129.9340955963991
rate: 130.78689422649668
rate: 132.60297900678245
rate: 133.78608348203736
rate: 134.41084975126867
rate: 135.11359140431682
rate: 135.9680294590949
rate: 136.37036360926098
rate: 137.03312405112845
rate: 137.75007586656673
rate: 137.9779432189174
rate: 138.25155577202057
rate: 138.8710871659807
rate: 139.07328787197613
rate: 139.3223786224067

I believe it's rather odd that the use of CDATA actually makes things worse. In theory the parser should just ignore and fast-forward until it finds a closing ]]>

The performance difference there looks very stark (6 events per second vs 140 events per second). I agree this is very strange!