Performance, a standard JArray benchmark runs very slow relative to Java/Kotlin, etc...

Question

Performance, a standard JArray benchmark runs very slow relative to Java/Kotlin, etc...

GordonBGood opened this issue 6 years ago · 5 comments

Description

One of the standard things I do when discovering a new language is to try an optimized bit-packed Sieve of Eratosthenes benchmark using only the CPU L1 cache to test for fastest loop speed. I expect any language running on the JVM to perform about as fast as Java/Kotlin/Scala, etc. Eta did not, running several times slower than these, and at speeds more comparable to Frege, the other Haskell-like language for the JVM.

Expected Behavior

I had hoped for better performance than Frege based on Eta seemingly using a compiler based on GHC, which strips out the "safety" categories on optimized builds so that the produced native machine code from GHC using the LLVM backend can run about as fast as C/C++ code.

Actual Behaviorv

The benchmark was two or three times slower than expected for code run on the JVM; note that I do not expect the speed of native code output from GHC.

Steps to Reproduce

The following code is the benchmark:

{-# LANGUAGE FlexibleContexts #-}

import Java
import System.CPUTime
import Data.Bits

primesbench :: Int -> IO JByteArray
primesbench lps = java $ do
  cmpsts <- anew 16384 :: Java a JByteArray
  let
    loop x =
      if x <= 0 then return cmpsts
      else
        let
          loopi i =
            if i > 254 then loop (x - 1) else do
              v <- withObject cmpsts $ aget (i `shiftR` 3)
              if v .&. (1 `shiftL` (i .&. 7)) /= 0 then loopi (i + 1) else
                let
                  p = i + i + 3
                  s0 = 2 * i * (i + 3) + 3
                  lmt = min 131072 (s0 + 8 * p)
                  loops s =
                    if s >= lmt then loopi (i + 1) else
                      let
                        msk = 1 `shiftL` (s .&. 7)
                        loopc c =
                          if c >= 16384 then loops (s + p) else do
                            v <- withObject cmpsts $ aget c
                            withObject cmpsts $ aset c (v .|. msk)
                            loopc (c + p)
                      in loopc (s `shiftR` 3)
                in loops s0
        in loopi 0
  loop lps

countem :: IO JByteArray -> IO Int
countem arr = java $ do
  jarr <- io arr :: Java a JByteArray
  let
    go i c =
      if i >= 131072 then return c else do
        v <- withObject jarr $ aget (i `shiftR` 3)
        if v .&. (1 `shiftL` (i .&. 7)) /= 0 then go (i + 1) c
        else go (i + 1) (c + 1)
  go 0 1

main :: IO ()
main = do
  let jba = primesbench 100 -- warm up to trigger JIT
  cnt <- countem jba
  print cnt
  strt <- getCPUTime
  let fjba = primesbench 1000
  fcnt <- countem fjba
  stop <- getCPUTime
  print fcnt
  print $ (stop - strt) `div` 1000000000

When run with "etlas run -O2", it should have run in about two billion CPU clock cycles on my low end Intel CPU, but took an average of over two times that long to produce the correct answer (the number of primes to 262146) of 23000. Note that the code runs the test routine before timing it to trigger the JIT compilation outside of the timed run.

Context

This affects the usability of Eta for anything requiring high performance, although IIRC it is faster than Frege than when run on Frege. When run with Kotlin, this takes the expected two billion CPU clock cycles.

Your Environment

Windows 10 64-bit updated to everything except the October 2018 Update, and Eta installed and run from "etlas" version 1.5.0.0.
This my first use of Eta.
Etlas version 1.5.0.0; Eta version 0.8.6b3

Answer 1 · 2019-01-21T00:40:33.000Z

There was a recent improvement on master that cuts out a lot of space leaks that could potentially be one of the factors for the slowness. I'll cut a release today for v0.8.6b4 that includes the change.

Would you be willing to contribute this code as a benchmark in https://github.com/typelead/eta-benchmarks ?

Also how did you run the benchmark? eta-benchmarks runs using JMH so that warmup time is accounted. The warmup time for Eta will likely be more than Java/Scala/Kotlin.

Answer 2 · 2019-01-21T00:44:43.000Z

Btw can you make your cross-language benchmark public? All the languages you're testing (including Eta) have a Gradle plugin so you can make a multi-module Gradle project and run a comparative benchmark using JMH. I can help you with the Eta part of the setup if you setup the repo.

Answer 3 · 2019-01-21T09:20:27.000Z

@rahulmutt, yes, in spite of accounting for JIT warm up time by running the benchmark before timing it (the above program runs it for 100 loops before timing it for 1000 loops for attempted timing accuracy), there were inconsistencies in timing results indicating some kind of leak; I'll look forward to trying it with your new version. However, it shows the problem of using average times as a comparison criteria as there was quite a wide range of testing times (perhaps due to the leaks you mention) and an average time isn't so meaningful as what is somewhat distressing is the range of times. Hopefully your new version will solve both the wide range of test times and the overall performance.

You are welcome to use the code as part of the benchmark suite. I would be willing to contribute it myself but don't use Gradle as most of my JVM projects are small, and when they get a little bigger I have chosen Kotlin as my "go-to" language (so far, always looking for something better :) ) and just develop within Jetbrains Intellij IDEA if the project is big enough. Even that is rare, as JVM is no longer my platform of choice.

As to how I ran the benchmark, as specified in the OP I just ran the benchmark once in the code over 100 loops before timing it to try to eliminate the JIT time, which indeed seemed a bit long even for the JVM platform at about 0.75 seconds on my 1.92 Gigahertz CPU.

As to the same benchmark in other languages, you are welcome to use those, too.

The Frege benchmark is why I dropped the idea of "Haskell on the JVM" until I discovered your repo: Even running with only ten loops instead of a thousand, it took many times longer to run the same primes benchmark as above on Frege, making it over a hundred times slower; the source code for Frege is as follows:

-- test Frege Sieve of Eratosthenes primes benchmark...
-- compile to sub `build` directory with "java -Xss1m -jar fregec.jar -d build Primes.fr"
-- run with "java -Xss1m -cp build;fregec.jar benchmark.Primes" 
-- (on Windows -cp separator is `;` not `:` on nix systems)

module benchmark.Primes where

import Data.Bits

native currentTimeMillis
  java.lang.System.currentTimeMillis :: () -> IO Long

test :: () -> [Int]
test() = 2 : genList 0
  where
    rslt :: ST s (JArray Byte)
    rslt = do
      !buf <- PrimitiveArrayElement_Byte.newArray 16384
      let
        loop !x =
          if x <= 0 then readonly id buf else
          let
            loopi !i =
              if i > 254 then loop (x - 1) else do
                !v <- getElemAt buf (i `ushiftR` 3)
                if (fromIntegral v) .&. (1 `shiftL` (i .&. 7)) /= 0
                then loopi (i + 1)
                else
                  let
                    p = i + i + 3
                    s0 = (i * (i + 3) `shiftL` 1) + 3
                    slmt = min 131072 (s0 + (p `shiftL` 3))
                    loops !s =
                      if s >= slmt then loopi (i + 1) else
                      let
                        msk = 1 `shiftL` (s .&. 7)
                        loopc !c =
                          if c >= 16384 then loops (s + p) else do
                            !v <- getElemAt buf c
                            setElemAt buf c
                              (fromIntegral (fromIntegral v .|. msk))
                            loopc (c + p)
                      in loopc (s `ushiftR` 3)
                  in loops s0
          in loopi 0
      loop 10
    arr :: JArray Byte
    arr = ST.run rslt
    genList :: Int -> [Int]
    genList i =
      if i >= 131072 then [] else
        let !v = fromIntegral $ elemAt arr (i `ushiftR` 3) in
        if v .&. (1 `shiftL` (i .&. 7)) /= 0 then genList (i + 1)
        else case i + i + 3 of !p -> p `seq` p : genList (i + 1)

-- incredibly slow at about 4,900 milliseconds for 2,000,000 culls
-- or 4700 clock cycles per cull at 1.92 Gigahertz!!!!!
main = do
  let !warmup = length $ test()
  start <- currentTimeMillis()
  let !answer = length $ test()
  stop <- currentTimeMillis()
  println $ show answer
  println $ "It took " ++ show (stop - start) ++ " milliseconds."

I no longer use Java at all, and don't have an up to date benchmark for it, but IIRC it runs with about the same performance as Kotlin, with the Kotlin source code as follows:

    fun primesBench(): Iterable<Int> {
        val cmpsts = ByteArray(16384)
    
        tailrec fun loop(x: Int) {
        if (x > 0) {
            tailrec fun testloop(i: Int) {
                if (i <= 254) {
                    if (cmpsts[i shr 3].toInt() and (1 shl (i and 7)) == 0) {
                        val p = i + i + 3
                        val s0 = 2 * i * (i + 3) + 3
                        val slmt = minOf(131072, s0 + (p shl 3))
                        tailrec fun cull(s: Int) {
                            if (s < slmt) {
                                val msk = 1 shl (s and 7)
                                tailrec fun cullp(j: Int) {
                                    if (j < 16384) {
                                        cmpsts[j] = (cmpsts[j].toInt() or msk).toByte()
                                        cullp(j + p)
                                    }
                                }
                                cullp(s shr 3); cull(s + p)
                            }
                        }
                        cull(s0)
                    }
                    testloop(i + 1)
                }
            }
            testloop(0); loop(x - 1)
        }
        }
    
        loop(1000)
    
        tailrec fun test(i: Int): Int {
            return if (i < 131072 && cmpsts[i shr 3].toInt() and (1 shl (i and 7)) != 0) {
                test(i + 1)
            } else {
                i
            }
        }
    
        val iter = object : IntIterator() {
            var i = -1
            override fun nextInt(): Int {
                val oi = i
                i = test(i + 1)
                return if (oi < 0) 2 else oi + oi + 3
            }
            override fun hasNext() = i < 131072
        }
        return Iterable { -> iter }
    }

    fun main(args: Array<String>) {
        primesBench() // for warmup of JIT
        val strt = System.currentTimeMillis()
        val answr = primesBench()
        val elpsd = System.currentTimeMillis() - strt
        println(answr.count())
        println("The calculation took $elpsd milliseconds.")
    }

The Scala benchmark is perhaps the fastest of the languages running on the JVM, with the code as follows:

object Benchmark {
  def makeSoE_Primes(loops: Int) = {
    import scala.annotation.tailrec
    val cmpsts = new Array[Byte](16384)
 
    @inline def isPrm(ci: Int): Boolean =
      (cmpsts(ci >>> 3) & (1 << (ci & 7))) == 0
 
    @tailrec def loop(x: Int): Unit =
      if (x > 0) {
        @tailrec def forCndNdxsFrom(cndNdx: Int): Unit =
          if (cndNdx <= 254) {
            if (isPrm(cndNdx)) { // is prime
              val p = cndNdx + cndNdx + 3
              val s0 = (cndNdx * (cndNdx + 3) << 1) + 3
              val slmt = scala.math.min(131072, s0 + (p << 3))

              @tailrec def loopPatterns(s: Int): Unit =
                if (s < slmt) {
                  val msk = 1 << (s & 7)
 
                  @tailrec def cullPrmCmpstsFrom(cmpstNdx: Int): Unit =
                    if (cmpstNdx < 16384) {
                      cmpsts(cmpstNdx) = (cmpsts(cmpstNdx).toInt | msk).toByte
                      cullPrmCmpstsFrom(cmpstNdx + p) }
 
                  cullPrmCmpstsFrom(s >>> 3); loopPatterns(s + p) }
              loopPatterns(s0) }; forCndNdxsFrom(cndNdx + 1) }
        forCndNdxsFrom(0); loop(x - 1)
      }; loop(loops)
 
    @tailrec def getNxtPrmFrom(cndNdx: Int): Int =
      if ((cndNdx >= 131072) || isPrm(cndNdx)) cndNdx + cndNdx + 3
      else getNxtPrmFrom(cndNdx + 1)
 
    Iterator.single(2) ++
      Iterator.iterate(3)(p => getNxtPrmFrom(((p + 2) - 3) >>> 1))
        .takeWhile(_ <= 262146)
  }
}

object Main extends App {
  import Benchmark._
  val warmup = makeSoE_Primes(100).size
  val strt = System.nanoTime()
  val answr = makeSoE_Primes(1000)
  val end = System.nanoTime()
  val count = answr.size
  println(s"Successfully completed without errors. [total ${(end - strt) / 1000000} ms]")
  println(f"Found $count primes up to 262146.")
}

As I said, feel free to add these benchmark tests on your benchmark comparison repo,, as there is nothing special about them other than they are written to have no memory speed bottlenecks as the data memory is completely within the CPU L1 cache and they test minimum loop speed as the inner loop where almost all of the time is spent is about as tight as a loop can be made without manual unrolling.

For comparison, machine code/C/C++/Nim/Swift/Rust/Haskell (languages that compile to efficient machine code) run this benchmark at about 2.5 CPU clock cycles per loop; we don't expect that of JVM languages with mandatory array bounds checking and the less efficient JIT produced code, but usually they should run at about 6 to 10 CPU cycles per loop (there are about 0.198 billion cycles for 1000 loops as tested).

As things progress, there are a couple of other micro-benchmarks that I could contribute as in a functional incremental Sieve of Eratosthenes based on Co-Inductive Stream's (it will be much slower than this one, but interesting as it shows how well you manage many object allocations and deallocations), and a Hamming (regular) number generator based on lazy lists that we might have to construct ourselves as I don't think that your lists are lazy (?) which shows how efficiently that can be done, but I'll wait to see how far we get with this benchmark first.

As far as Eta goes and its early stages of development, being with a factor of two or three of the fastest JVM languages for this benchmark isn't all that bad. But you'll fighting a bit of a disadvantage here in building a "Haskell for the JVM" unless you figure out how to do as GHC does in stripping the "monad" token fields before emitting the final p-code. It seems that Frege has not successfully done that. It also compiles much slower than Eta, which is quite a pleasure to use from the compilation speed aspect.

I suppose that I am not so much enamored with Haskell that I would have chosen it as the language to implement for the JVM, but F# emitting machine code and efficient JVM code would be quite attractive (there is already a F# for web development in Fable, and now that Core.Net is open source, F# itself runs on every platform). F# would probably have been a little easier to deal with as mutability is a (discouraged) option and you would not have had to deal with monadic code and compulsory laziness as you do to be completely compatible with Haskell.

Answer 4 · 2019-01-21T09:29:28.000Z

@rahulmutt, stop the press!!!

As per your advice on #931 that one can use Data.Time by adding time to the requirements in the project cabal file and then use getCurrentTime, timing became immediately both more consistent and faster as in about as fast as Kotlin!

I then followed your advice in changing the target method signature to make the Java currentTimeMillis work, and timing got even more consistent and faster to about eight CPU clock cycles per cull operation!

So the problem isn't so much with performance but rather with your implementation of System.CPUTime.getCPUTime, and that is mostly what needs to be investigated.

Congratulations on such good performance at such an early stage in the development!

Answer 5 · 2019-01-23T08:32:51.000Z

As discussed in issue #934, and to be used in developing the better way as proposed in issue #940, this is a benchmark that uses a conventional Haskell IArray/MArray to do the same as the other examples, and which runs over 10 times slower than the others for possible reasons as given by @rahulmutt in issue #934:

import Data.Word
import Data.Bits
import Data.Array.Base
import Control.Monad.ST
import Data.Array.ST (runSTUArray, STUArray(..))

foreign import java unsafe "@static java.lang.System.currentTimeMillis"
  currentTimeMillis :: IO Int64

primesbench :: Int -> [Int]
primesbench lps = makelist
  where
    composites = runSTUArray $ do
      cmpsts <- newArray (0,16383) 0 :: ST s (STUArray s Int Word8)
      let
        loop x =
          if x <= 0 then return cmpsts else
            let
              loopi i =
                if i > 254 then loop (x - 1) else do
                  v <- unsafeRead cmpsts (i `shiftR` 3)
                  if v .&. (1 `shiftL` (i .&. 7)) /= 0 then loopi (i + 1) else
                    let
                      p = i + i + 3
                      s0 = 2 * i * (i + 3) + 3
                      lmt = min 131072 (s0 + 8 * p)
                      loops s =
                        if s >= lmt then loopi (i + 1) else
                          let
                            msk = 1 `shiftL` (s .&. 7)
                            loopc c =
                              if c >= 16384 then loops (s + p) else do
                                v <- unsafeRead cmpsts c
                                unsafeWrite cmpsts c (v .|. msk)
                                loopc (c + p)
                          in loopc (s `shiftR` 3)
                    in loops s0
            in loopi 0
      loop lps
    makelist = 2 : go 0
      where
        go 131072 = []
        go i =
          let v = unsafeAt composites (i `shiftR` 3) in
          if v .&. (1 `shiftL` (i .&. 7)) /= 0 then go (i + 1)
          else i + i + 3 : go (i + 1)

main :: IO ()
main = do
  let wrmup = primesbench 100
  print $ length wrmup
  strt <- currentTimeMillis
  print $ length $ primesbench 1000
  stop <- currentTimeMillis
  print $ stop - strt