HigherOrderCO/hvm-64

The compiled routines dilemma. When should we implement it?

Closed this issue · 2 comments

Consider the following programs:

// sum_ctr.hvm1

(Sum (Leaf x)) = x
(Sum (Both a b)) = (+ (Sum a) (Sum b))

(Gen 0 x) = (Leaf x)
(Gen n x) = (Both (Gen (- n 1) (* x 2)) (Gen (- n 1) (+ 1 (* x 2))))

Main = (% (Sum (Gen 22 0)) (<< 1 24))
// sum_lam.hvm1

(Match 0 case_zero case_succ) = case_zero
(Match n case_zero case_succ) = (case_succ (- n 1))

Leaf = λx    λleaf λnode (leaf x)
Node = λa λb λleaf λnode (node a b)

Sum = λtree
  let case_leaf = λx x
  let case_node = λa λb (+ ((Sum) a) ((Sum) b))
  (tree case_leaf case_node)

Gen = λn
  let case_zero =    λx ((Leaf) x)
  let case_succ = λm λx ((Node) ((Gen) m (* x 2)) ((Gen) m (+ (* x 2) 1)))
  (Match n case_zero case_succ)

Main = (% ((Sum) ((Gen) 22 0)) (<< 1 24))
// sum_lam.hvm2

leaf = λx    λleaf λnode (leaf x)
node = λa λb λleaf λnode (node a b)

sum = λtree
  let case_leaf = λx x
  let case_node = λa λb (+ (sum a) (sum b))
  (tree case_leaf case_node)

gen = λn
  let case_zero =    λx (leaf x)
  let case_succ = λm λx (node (gen m (* x 2)) (gen m (+ (* x 2) 1)))
  match n { 0: case_zero; 1+m: (case_succ m) }

Main = (sum (gen 22 0))

We implement a micro-benchmark that allocates and sums all elements of a perfect binary tree, in 3 different ways:

    1. Using constructors + equations on HVM1
    1. Using definitions + lambdas on HVM1
    1. Using definitions + lambdas on HVM2

Benchmarking both cases on Apple M1, we have:

v@v ~/vic/dev/hvm-test$ time hvm run -f sum_lam.hvm -t 1 "Main"
14680064
hvm run -f sum_lam.hvm -t 1 "Main"  3.69s user 0.61s system 99% cpu 4.309 total

v@v ~/vic/dev/hvm-test/sum_lam$ time ./target/release/sum_lam run -t 1 "Main"
14680064
./target/release/sum_lam run -t 1 "Main"  2.43s user 0.57s system 99% cpu 3.007 total

v@v ~/vic/dev/hvm-test$ time hvm run -f sum_ctr.hvm -t 1 "Main"
14680064
hvm run -f sum_ctr.hvm -t 1 "Main"  1.22s user 0.37s system 99% cpu 1.596 total

v@v ~/vic/dev/hvm-test/sum_ctr$ time ./target/release/sum_ctr run -t 1 "Main"
14680064
./target/release/sum_ctr run -t 1 "Main"  0.25s user 0.19s system 98% cpu 0.447 total

v@v ~/vic/dev/hvm-test$ hvm-lang compile sum_lam.hvm2 >> sum_lam.hvmc; time hvmc run sum_lam.hvmc
#14680064
hvmc run sum_lam.hvmc  1.23s user 0.29s system 99% cpu 1.524 total

So, we can make the following ranking:

- HVM1-COMP, sum-ctr: 0.447 s
- HVM2-INTR, sum-lam: 1.524 s
- HVM1-INTR, sum-ctr: 1.596 s
- HVM1-COMP, sum-lam: 3.007 s
- HVM1-INTR, sum-lam: 4.309 s

So, as we can see, on the identical sum-lam program, HVM2 (interpreted) outperforms HVM1 (interpreted and compiled) by 2x-3x, all in a single core computations. Since the current CUDA implementation is ~35x faster than the single core implementation, we can expect HVM2 to complete this task in about 0.43s, i.e., almost 100x faster than HVM1. That is really solid and promising.

Yet, thanks to its equational notation, HVM1 allowed us to dispatch calls by pattern-matching on constructors, which, in turn, allowed us to compile routines. This wasn't explored to its full potential, and already made a huge difference, as can be seen on the compiled sum-ctr case, which outperforms all others, because all the numeric operations are compiled to actual machine code, never allocating nodes (like OP1, OP2, etc.). On its current design, HVM2 does not feature constructors nor pattern-matching, which means compiling isn't possible. While this won't make much difference for highly symbolic computations, for most real-world applications the impact would be huge.

The question is: when should we introduce some form of compiled routines to HVM2? For one it would likely result in massive (2x to 100x?) speedups, depending on the workload. In fact, without compiling routines, HVM2 is kinda sabotaging itself, as it would be restricted to being a massively parallel interpreter. Yet, how do we deal with the huge complexity bump that such a thing would entail? Implementing HVM2 as an interpreter with just interaction nodes and numbers in all the target environments (Rust, CUDA, Metal, Vulkan...), in sync and correctly, is already a colossal tas. As such, sadly, I do not think myself or the current team has the scale to implement a compiled version for HVM2 across all these targets.

When we release HVM2, we want people to be able to use it to run their programs in a massively parallel fashion and get a feeling of what is to come. Yet, if it turns out HVM needs 10 cores to do what GHC does in one, that will not paint a good picture. So, what do we do? There are 3 options that make logical sense:

  1. We implement HVM2 as an interpreter on CPU+GPU, release it, and make clear the first version is interpreter only and, even though it is fast, massive speedups can be expected when we compile it, in the future. Due to lacking compilation, numbers wouldn't be nearly as great as they could, and this greatly impacts the public perception and value of the project.

  2. We implement a compiler and restrict ourselves to a single implementation in Rust. That would make the complexity manageable, but would mean we will not be shipping with GPU support, and GPUs are clearly vastly superior than CPUs for interaction net reduction. Due to lacking GPU support, numbers wouldn't be nearly as great as they could.

  3. We somehow raise significant funds (in the order of 100m or so?) and hire to let us build a fully featured compiler and target the GPU. I don't think that's a likely option.

So, I'm leaving this thread as a testament of the current situation and the reality that we must pick one of 2 for the initial release: compiled version or GPU version; and for my own acceptance that we won't have both for a while, sadly.

tl;dr on identical programs, HVM1 is 2-3 faster than HVM2 in a single core, and up to 100x faster on GPUs, but the fact it doesn't support constructors means we can't implement a compiler, which may leave us with interpreter-only, which may completely overshadow the difference in many cases. Sadly, we don't have the budget to have a compilation+GPU on HVM2's first release, so we must pick one.

By the way, the way I think compilation should be done is by 1. generating compiled routines for each supercombinator, 2. intercepting REF-CTR calls and identifying if it has th right number of arguments, 3. if so, call the compiled routine; otherwise, proceed as normal. Then, identifying tail calls would allow us to perform loops. So, basically, very similar to the OP2 optimization I just pushed, but for general supercombinators. Only once that is implemented on GPUs, HVM2 will start reaching its full potential...