[performance] Reduce number of calls to eval by the compiler

Question

[performance] Reduce number of calls to eval by the compiler

mrrodriguez opened this issue 7 years ago · 7 comments

In #378 there was discussion about eval becoming a large bottleneck in the Clara compiler when it comes to large rulesets.
Some relevant comments are:

(1) As Ryan mentioned there, I think that one avenue we could explore would be to "merge" all of the eval code into one unified structure prior to calling eval so that we ended up with only on batch call.

(2) I also have noticed that accumulator forms are always evaled with no attempt to avoid duplicate forms being evaled independently. If a large ruleset used the same accumulator form often, this may add up to a lot of unnecessary evals.
A duplicate accumulator could be detected by analyzing the form along with any lexical environment being captured at the time.

(3) There may even be ways to generalize (2) in an attempt to get more sharing from join filter functions on other beta nodes. These forms are a bit more complex though, so care would have to be taken. One particular thing that would need to be handled carefully would be in the use of metadata on parts of the form, such as those for type hints. Type hints could cause 2 forms to not really be equal for the sake of eval.

It should be noted that any change to eval structures would like affect the Clara durability.

Answer 1 · 2018-02-06T02:47:36.000Z

I think this is on the right track. We might consider approaching this with an additional step in our compilation process. Right now we convert to the beta-graph structure, then run compile-beta-graph on that.

For instance, what if we had something like an "extract-beta-functions-to-eval" that got everything we needed to eval from the beta graph, then eval them all at once. Then the existing compile-beta-graph would look up the corresponding function and use it, rather than calling eval each time. This would eliminate runtime overhead since it still calls the function directly...we're just eval'ing them all together into larger structure that is used at compile time.

Answer 2 · 2018-02-06T03:21:57.000Z

@rbrush

For instance, what if we had something like an "extract-beta-functions-to-eval" that got everything we needed to eval from the beta graph, then eval them all at once. Then the existing compile-beta-graph would look up the corresponding function and use it, rather than calling eval each time. This would eliminate runtime overhead since it still calls the function directly...we're just eval'ing them all together into larger structure that is used at compile time.

Thanks for that clarification! I was not thinking about it correctly. I overlooked that these evals could be made in batch, but still directly "linked" with their appropriate nodes during the compilation cycle (and not at runtime). I'll edit the description to not have that misleading note then.

For reference if anyone else is curious about these comments, I removed this note from (1) in the description:

There may be a disadvantage to this approach in that it may be hard to do without adding runtime overhead. For example, if all functions were in some sort of "indexed" data structure, each call to them from a node, would have to first look up the fn by index, then call the fn. This is one more step to get to calling the fn than what is done right now. This may be ok and something that we could make have negligible performance degradation.

Answer 3 · 2018-02-12T21:10:32.000Z

The strategy described would probably reduce session compilation times if making more calls to eval takes more time than a single call with a data structure of all the same elements. This may very well be the case. One thing that jumps out at me is that Clojure's compiler apparently creates a new DynamicClassLoader every time eval is called. Since the class cache for DynamicClassLoader is a class variable, this might not make much difference for class access times, but having that many ClassLoader instances created seems like it might not be a usage pattern expected by the JVM developers and so might not be particularly optimized.

All this being said, I also think it is possible that this has nothing to do with the number of eval calls in and of itself. Other possibilities include:

Compiling the amount of code that we do simply takes lots of time.
That some pattern in our generated Clojure functions is time-consuming to compile.
That eval takes longer to run than normal compilation of equivalent Clojure code for some reason.
That the JVM perceived memory pressure upon creation of an enormous number of classes via Clojure eval and for some reason tried to GC them; this is unlikely based on profiling showing plenty of memory for the example at Cerner but isn't something I'd entirely rule out since perhaps the profiling is misleading in some way.

I'd suggest gathering concrete benchmarks showing that evaluation of Clara rules forms is indeed faster in a large batch than one at a time before spending lots of time and effort implementing the change described above. This can be done manually in a Clojure REPL without actually creating a session, possibly by repeated evaluation of the same form.

Answer 4 · 2018-02-12T21:30:11.000Z

@WilliamParker

I figured I'd make a quick REPL demo to see if the idea has an potential

;;; Mixing up forms just for the fun of it
(def parts (cycle ['(+ x y)
                   '(list 1 x y z)
                   '[x y z]]))

;;; Sort of similar (but smaller) than something Clara may compile
(defn mk-form [part n]
  `(fn ~'[x y]
     (do
       (let [~'z ~n]
         ~part))))

;;; Tweak to see differences as number of forms grows
(def fcount 1000)

(def forms
  (vec (for [n (range fcount)
             :let [p (nth parts n)]]
         (mk-form p n))))

;;; Eval all individually
(time (doseq [f forms] (eval f)))
;; "Elapsed time: 924.176176 msecs"

;;; Eval all in a "big vector"
(time (do (eval forms) :done))
;; "Elapsed time: 350.762857 msecs"

Of course, there may be problems with this benchmark in terms of being really representative. Looks promising though. I believe the computed evals are the same though. I can verify doing something like

(def many-evals (vec (for [f forms] (eval f)))) ; Instead of `doseq`, `vec` + `for`
(def one-eval (eval forms))

(= ((apply juxt many-evals) 1 2) ((apply juxt one-eval) 1 2))
;;= true

ADDED/EDITED:

I was hoping to have some good insight from profiling the above to see what is faster to compile in batches vs individually. After asking a bit on #clojure Slack, I realized what I was overlooking I believe.

In profiling this I noticed that the slower, "many individual eval calls" approach was spending more time in creating classes. I didn't get why at first.

It turns out each call to eval will create a class for the result of the eval . (This is pretty obvious due to how the JVM works, you have to have a class to represent computable code basically.) So there is a class creation overhead to each eval call. If you batch up the forms, you reduce the number of these classes needed. This aligns with Ryan's initial hunch that there was some "minimal" overhead for a call to eval.

You can see this in the REPL by looking at the directory and counting files in *compile-path* and running:

(binding [*compile-files* true]
        (eval '(fn [x] x))
        (eval '(fn [y] y))
        (eval '(fn [z] z)))

;;; 6 classes

;;; vs

(binding [*compile-files* true]
        (eval '[(fn [x] x)
                (fn [y] y)
                (fn [z] z)]))


;;; 4 classes

NOTE on bytecode size:

I should also add that there is a risk of trying to eval too large of forms and getting "Method too large" sorts of compilation errors. This can be seen if you take what I had above and change it to

(def fcount 10000)

;;; Redefine `forms` as above

(time (do (eval forms) :done))
;; CompilerException java.lang.RuntimeException: Method code too large!, compiling:(*cider-repl <etc>)

This batching still performs better, due to less classes made overall (as explained above) it can be modified to something like:

(def batches
  (mapv vec (partition-all 500 forms)))

Which means to do 500 forms at a time. The performance time is significantly faster

(time (doseq [f forms] (eval f)))
;; "Elapsed time: 18207.116637 msecs"

(time (doseq [b batches] (eval b)))
;; "Elapsed time: 8898.366552 msecs"

Mini example of trying this out

I have a really rough demo branch where I messed around with batching evals.
There'd be a lot of work to do to make it realistic though.
https://github.com/mrrodriguez/clara-rules/tree/iss381-batch-evals

Issues that have to be addressed:

Clara durability likely needs to also batch it's eval calls in a likely similar way.
The error messages will be worse if we do not work to try to relate them back to their original line number and columns. I think there are ways to approach this. One may be to ensure the correct line/col/file metadata is on the eval forms that are being eval'ed in batch.
There is likely a reasonable number to choose for the "eval batch size". However, it should be configurable in case the consumer happens to have unusually large chunks of code that get spliced into the forms being evaled.

Answer 5 · 2018-02-15T21:34:03.000Z

I agree with @mrrodriguez's diagnosis of what is happening here. For completeness, I'll include some decompiled classes to demonstrate this.

(def form `(fn [a#] (let [b# (:b a#)] (inc b#))))
(eval form)

This creates two classes: One which is clearly an actual implementation of the function, since it has a call to "inc", and one like the following:

package eval_test;

import clojure.lang.AFunction;

public final class core$eval40040
  extends AFunction
{
  public Object invoke()
  {
    return invokeStatic();
  }

  public static Object invokeStatic()
  {
    return new core.eval40040.fn__40041();
  }
}

This creates three classes, two of which are actual function implementations and one of which is similar to the above class in purpose:

(eval [form form])

package eval_test;

import clojure.lang.AFunction;
import clojure.lang.Tuple;

public final class core$eval40047
  extends AFunction
{
  public Object invoke()
  {
    return invokeStatic();
  }

  public static Object invokeStatic()
  {
    return Tuple.create(new core.eval40047.fn__40048(), new core.eval40047.fn__40051());
  }
}

Basically, each eval call on a body that creates a new function will create at least two classes, one that defines the actual function and one that creates an instance of the function. However, if multiple functions are defined at once a single class can handle creating instances of them. This single new class will be smaller in bytecode than the individual classes that would otherwise fulfill its purpose due to decreased overhead in the "header" of the class. This explains why for a variety of benchmarks like the one in this comment. I found that evaluating each form individually roughly doubled the time as compared with evaluating in a single collection.

On a related note, it appears that the creation of a new DynamicClassLoader for each eval call isn't impacting performance much. I tried creating a patched Clojure with a way to manually call clojure.lang.Compiler/eval such that the existing value of LOADER would be used. There were no meaningful performance improvements. The reason for Clojure to behave this way is so that classes can be redefined with the same name. For example, when I replaced

public static Object eval(Object form, boolean freshLoader) {
	boolean createdLoader = false;
	if(true)//!LOADER.isBound())
// Rest of method

with

public static Object eval(Object form, boolean freshLoader) {
	boolean createdLoader = false;
	if(!LOADER.isBound()) {
// Rest of method

I got errors like the following

compile-clojure:
     [java] Exception in thread "main" java.lang.ExceptionInInitializerError
     [java] 	at clojure.lang.Compile.<clinit>(Compile.java:29)
     [java] Caused by: java.lang.LinkageError: loader (instance of  clojure/lang/DynamicClassLoader): attempted  duplicate class definition for name: "clojure/core$dotimes", compiling:(clojure/core.clj:3305:1)

Note that dotimes is defined multiple times, namely here and here. When I inserted some debugs into the Clojure codebase I found that same class name "clojure/core$dotimes" was being used to define a class multiple classes. I'm still not certain where the exception is being thrown from - it seems to come from somewhere in defineClass1 in the JVM native code - but the behavior empirically seems to be that if the same classloader defines a class with the same name an exception is thrown. In Clojure, the semantics are that the new class replaces the old one for all callers that haven't yet resolved the class name in question and stored the result, but this isn't a normal Java classloading pattern and I can see why the JVM would consider this invalid for more normal Java applications.

Regarding the implementation here, I think the approach outlined by @mrrodriguez is basically on target. Some points on that:

We will indeed need to fix durability. The motivating use-case at Cerner is actually occurring during deserialization of a session, so just fixing the initial compilation wouldn't fix that problem. It could potentially be done in separate PRs, perhaps even deserialization first.
The benefit of having larger batches of eval'ed items will diminish quickly for decently large numbers. For example, if we have 1000 classes, going from 1 at a time to 100 prevents the creation of 990 classes, but going from 100 to 1000 at a time only prevents the creation of 9 classes. I'd be inclined to be conservative with the number we batch to stay under the JVM method size limit, but we can still make it user-configurable.
Regarding losing context information on failed eval calls, we could just try eval'ing as a batch first, and then individually eval with full context if that throws an exception. It would take extra time, but it would only happen in an error case so I don't think it would matter in practice.

So overall, it looks like we can cut times significantly, although they're still going to be higher than we'd like. I wonder if there is some other source of inefficiency as well, since even with this problem it seems like eval is performing poorly relative to compiling a huge number of Clojure files.. perhaps Clojure itself being AOT'ed is mitigating many of those problems though.

Answer 6 · 2018-02-16T16:48:16.000Z

@WilliamParker

Those were some good details above at #381 (comment)

All of what you said makes sense. I think a durability related change would be similar in nature to the changes to the compiler. It does seem that they could potentially be split into two issues if that is helpful for whatever reason. Perhaps some of the batching mechanism could be shared across the two.

A few notes:
(1)

Regarding losing context information on failed eval calls, we could just try eval'ing as a batch first, and then individually eval with full context if that throws an exception. It would take extra time, but it would only happen in an error case so I don't think it would matter in practice.

Re-running one-by-one may be a simple enough brute-force sort of approach. It may also be possible to add the appropriate clj compiler metadata to forms to "map back" to the original code from a given compiler exception. I haven't spent any time digging into that possibility though.

(2)

Yes, Clojure uses a different class loader to be able to redefine classes with the same name. This is a common technique on the JVM to do this sort of thing. Class loaders shouldn't/cant' redefine classes in a mutable fashion. This is laid out in the JVM spec

Given the same name, a good class loader should always return the same Class object.

(3)

In relation to class loaders in (2), Will and I spoke briefly offline about this, I'll add that the new class loader created per-eval (I believe) will be the defining class loader of the eval "shell" class and all other classes compiled in that eval call. So each eval may also add the additional memory overhead of not only an extra class, but also a class loader that must remain in the heap since classes it defines are in use.

Answer 7 · 2018-08-13T13:35:14.000Z

As #396 has been merged, I am going to close this issue