Performance regression when using Handlebars with large documents in 4.3.1 release (cause: memory re-allocation behavior change in commons text 10)

Question

Performance regression when using Handlebars with large documents in 4.3.1 release (cause: memory re-allocation behavior change in commons text 10)

aschwarte10 opened this issue 2 years ago · 4 comments

After upgrading from Handlebars from 4.3.0 to 4.3.1 we stumbled into a performance regression when parsing large documents.

The analysis revealed that this is a side effect of an actual bug fix in commons-text 10: in the org.apache.commons.text.TextStringBuilder(which is used as a basis of FastStringWriter) the ensureCapacity method was fixes to not double the buffer size (on each re-allocation). The effect is that with each append through handlebars rendering heavy memory re-allocation takes place (which gets slower with increasing size of the buffer / document).

In our case we tried the following workarounds / solutions:

a) switch to regular Java StringWriter

Instead of com.github.jknack.handlebars.Template.apply(Object) we now invoke the variant as follows

Context context = params instanceof Context ? (Context) params : Context.newContext(params);
StringWriter writer = new StringWriter();
template.apply(context, writer);
return writer.toString();

b) use a modified FastStringWriter implementation that in each of its "append" methods invokes the following method to increase the buffer size (with some additional capacity)

/**
     * Make sure that there is enough capacity in the buffer. If there isn't
     * increase the capacity for the new content plus some additional space
     * corresponding to {@link #BUFFER_SIZE}.
     * 
     * @param newContentLength
     */
    private void ensureCapacity(int newContentLength) {
        if (buffer.capacity() - buffer.length() > newContentLength) {
            return;
        }
        int currentSize = buffer.capacity();
        // make sure to reallocate memory for the new content + some additional space
        int desiredCapacity = currentSize + newContentLength + BUFFER_SIZE;
        buffer.ensureCapacity(desiredCapacity);
    }

We found that the plain StringWriter was still better in performance in our concrete test case.

Maybe there are further optimizations in ensureCapacity possible (i..e using the knowledge about the compile template and increasing the buffer size close to what is actually needed)

Answer 1 · 2022-11-29T07:38:09.000Z

Quick addition to @aschwarte10 's comment above: the good results for the Java StringWriter are with JDK 17. There have been some changes between e.g. JDK 11 and 17 there, so the performance of the plain StringWriter might depend on the underlying Java version...

Answer 2 · 2023-06-07T16:27:49.000Z

I think we may be seeing something similar... we've got a large template (4MB).

In a docker container running in Docker Desktop on my box, rendering it precompiled takes > 2,000 milliseconds.

On the same box but running the JVM outside docker, rendering it precompiled takes < 30 milliseconds.

Handlebars 4.3.1

Docker env:

openjdk version "17.0.7" 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-alpine-r1)
OpenJDK 64-Bit Server VM (build 17.0.7+7-alpine-r1, mixed mode, sharing)

I'm surprised that it can be 2 orders of magnitude difference between in and out of a docker container... but taking a JFR recording it does look like a lot of the time is spent in reallocate:

java.util.Arrays.copyOf(Arrays.java:3633)
  at org.apache.commons.text.TextStringBuilder.reallocate(TextStringBuilder.java:2610)
  at org.apache.commons.text.TextStringBuilder.ensureCapacity(TextStringBuilder.java:1829)
  at org.apache.commons.text.TextStringBuilder.append(TextStringBuilder.java:646)
  at org.apache.commons.text.TextStringBuilder.append(TextStringBuilder.java:619)
  at org.apache.commons.text.TextStringBuilder.append(TextStringBuilder.java:531)
  at com.github.jknack.handlebars.internal.FastStringWriter.append(FastStringWriter.java:55)
  at com.github.jknack.handlebars.internal.Variable.merge(Variable.java:148)
  at com.github.jknack.handlebars.internal.BaseTemplate.apply(BaseTemplate.java:126)
  at com.github.jknack.handlebars.internal.TemplateList.merge(TemplateList.java:95)
  at com.github.jknack.handlebars.internal.BaseTemplate.apply(BaseTemplate.java:126)
  at com.github.jknack.handlebars.internal.BaseTemplate.apply(BaseTemplate.java:114)
  at com.github.jknack.handlebars.internal.ForwardingTemplate.apply(ForwardingTemplate.java:100)

Answer 3 · 2023-06-08T16:04:03.000Z

I'm an idiot, I wasn't actually evaluating the template locally - when I did it had similar performance characteristics to the behaviour in Docker.

The issue is that the buffer char[] is only 1,600 characters by default, and as you evaluate a template it only increases by the amount you can append to it each time. In our case we end up with a 4,000,000 chars, and it mostly increments by 10 to 50 characters, so we literally copy the data tens of thousands of times.

I can "fix" it by setting the system property -Dhbs.buffer=4200000 - but allocating a 4MB buffer for every template seems very wasteful when most are much much smaller. If the buffer doubled each time it was too small that might be a better solution? Or if it used the template size as a starting point?

Answer 4 · 2023-06-08T20:40:44.000Z

I've raised #1066 as a suggested fix for this