Eta version 0.8.6b4 change in behavior for un-evaluated thunks...
GordonBGood opened this issue · 7 comments
Description
For the primes benchmark that successfully was changed to be very performant in #932, it compiles but crashes with a StackOverflow in the recursive call to the worker function of "countem" at line 49. The behavior on StackOverflow is quite strange as well, as it seems there are as many stacked exceptions rather than one and program exit, so many screens full of exception errors are produced instead of just one and exit.
When BangPatterns
is enabled and the c
accumulated count parameter has a "bang" applied to it to force evaluation every loop, the program works; performance might be just a little bit slower than previously, but it may also be more consistent.
Expected Behavior
Haskell behavior, which is slow for un-evaluated thunks but doesn't crash. This may be because in Haskell thunks are built on the heap, not the stack...
Actual Behavior
Described above...
Possible Fix
This might be something to do with the Strictness Analyser, since it appears for a parameter that formerly didn't need to be forced to evaluate now is evaluated lazily with thunks to the stack...
Steps to Reproduce
Run the following program without the "bang" !
before the c
parameter on the go
function inside the countem
function; run with "etlas run -O2":
{-# LANGUAGE FlexibleContexts, BangPatterns #-}
import Java
import Data.Bits
foreign import java unsafe "@static java.lang.System.currentTimeMillis"
currentTimeMillis :: IO Int64
primesbench :: Int -> IO JByteArray
primesbench lps = java $ do
cmpsts <- anew 16384 :: Java a JByteArray
let
loop x =
if x <= 0 then return cmpsts
else
let
loopi i =
if i > 254 then loop (x - 1) else do
v <- withObject cmpsts $ aget (i `shiftR` 3)
if v .&. (1 `shiftL` (i .&. 7)) /= 0 then loopi (i + 1) else
let
p = i + i + 3
s0 = 2 * i * (i + 3) + 3
lmt = min 131072 (s0 + 8 * p)
loops s =
if s >= lmt then loopi (i + 1) else
let
msk = 1 `shiftL` (s .&. 7)
loopc c =
if c >= 16384 then loops (s + p) else do
v <- withObject cmpsts $ aget c
withObject cmpsts $ aset c (v .|. msk)
loopc (c + p)
in loopc (s `shiftR` 3)
in loops s0
in loopi 0
loop lps
countem :: IO JByteArray -> IO Int
countem arr = java $ do
jarr <- io arr :: Java a JByteArray
let
go i !c = -- the "bang" goes here to fix the problem...
if i >= 131072 then return c else do
v <- withObject jarr $ aget (i `shiftR` 3)
if v .&. (1 `shiftL` (i .&. 7)) /= 0 then go (i + 1) c
else go (i + 1) (c + 1)
go 0 1
main :: IO ()
main = do
let jba = primesbench 100
cnt <- countem jba
print cnt
strt <- currentTimeMillis
let fjba = primesbench 1000
fcnt <- countem fjba
stop <- currentTimeMillis
print fcnt
print $ stop - strt
To fix it, put the "bang" back in ;)
Context
We are trying to prove that Eta produces efficient code using the benchmark. In Haskell, we would find this sort of bug by running the "-rtsops" and the "+RTS -s" argument to the run time then looking at the amount of heap space used and the time spend in GC, but that doesn't seem to work in Eta. Is there an alternative to show heap and GC usage (although in this case it seems that the un-evaluated thunks are accumulated to the stack not the heap, which is the problem)?
Your Environment
- Did you install an older version of Eta/Etlas before?, yes, Eta version 0.8.6b3, on which the "bang" wasn't necessary.
- Current Eta & Etlas version: Eta version 0.8.6b4; Etlas version 1.5.0.0
- Operating System and version: Windows 10 64-bit, updated to everything except the October 2018 Update
Since Eta is a JVM language, you can just use standard JVM profiling tools. You can go to your command line and type jvisualvm
to load up VisualVM which gives you nice diagnostics on memory usage and you can do sampling as well to figure out which objects are being allocated the most.
If the program runs too fast to connect to it, then use hprof. To translate it to your setup, if it says to do:
java -agentlib:hprof[=options] ToBeProfiledClass
Then you do
export ETA_JAVA_ARGS="-agentlib:hprof[=options]"
etlas run
Hope that helps!
Will need to investigate why the behavior changes in between versions. SOEs are known to happen if you build up a long chain of thunks and evaluate them at the end. You may be able to get around that with the trampoline
function depending on the evaluation tree but often times, it means you should be more strict in your accumulator as you observed. These classes of space leaks easier to identify with Eta and we could change the code generation so that it doesn't throw a SOE, but often times it's a mistake to leave such a giant chain of thunks lying around like that.
@rahulmutt, yes, I realize what the problem is, and I should have forced evaluation from the start. However, it was lucky I didn't in that we can now consider why the change between version 0.8.6b3 and 0.8.6b4 caused the more expected behavior of causing the long chain of thunks when it didn't before.
I agree that, even with Haskell, these types of "leaks" due to chains of un-evaluated thunks is almost always a problem that needs to be eliminated for performant code; my raising the issue is just to cause you to investigate why it didn't occur with the old version and to investigate ways to be sure they don't exist as well as document things to check for any how to eliminate them for new users of Eta.
Your advice about using "jvisualvm" and "=hprof" is invaluable; I think I said already in another issue that I'm no JVM expert. It turns out that my JDK 11 didn't contain "jvisualvm" but "VisualVM" can be downloaded from Oracle and installed, which I have done and have tried it. As to using "-hprof", I presume that your "etlas" or "eta" checks for the environment variable "ETA_JAVA_ARGS" and uses whatever value it is as an extra parameter to the runtime when the program is executed? Appication of those two hints should cover all of my needs.
My comments in the performance issue of "Why Haskell on the JVM and not F#?" is related to this issue, not for me as I already know Haskell well enough to be adapted to Eta, but rather for other JVM programmers that use Scala or especially Kotlin or Java and may find it difficult to convert to preferring Eta given the problems of always having to consider and deal with laziness: F# would be somewhat easier for them as laziness is not the default and needs to be opted into as in the others and (due to having to run on the CLI just as Eta needs to run on the JVM, both object oriented) already has the concept of support of object oriented programming baked in.
It appears that you have already overcome the concept of supporting classes and objects in your FFI, so that likely isn't an issue, but this problem of always having to consider laziness is quite a big one. However, I see from your FAQ that you have plans to support the Strict
and StrictData
extensions so when you do, these problems can easily go away for new converts to Eta just by using "{-# Strict #-}" at the top of all of their Eta source code files. In the meantime, I think that you perhaps need more discussion and examples of how to deal with everything being potentially lazy in the user guide documentation. Many potential users don't find Haskell documentation easy to read and understand.
Due to how I saw how one of the library functions was written, (arrayToList
as I discuss below), I had thought that your lists might not be lazy, but I have confirmed that they are. One would write arrayToList
as it currently exists if lists weren't lazy, as it builds the list in an accumulator which causes evaluation of every element of the list in advance or (if laziness is considered) builds a long chain of thunks which will cause potential SOE as here:
{-# INLINE arrayToList #-}
arrayToList :: JArray e c => Java c [e]
arrayToList = do
len <- alength
go (len - 1) []
where go n xs
| n >= 0 = do
x <- aget n
go (n - 1) (x:xs)
| otherwise = return xs
whereas considering lazy lists, the more "Haskellish" way of writing this would be as follows:
{-# INLINE arrayToList #-}
arrayToList :: JArray e c => Java c [e]
arrayToList = do
len <- alength
go 0
where
go len = []
go n = do
x <- aget n
x : go (n + 1)
This does not cause building of stack if lazy lists are implemented conventionally, as the list tails are just either evaluated or not depending on their current state considering an embedded thunk and the list is entirely lazy without evaluation in advance.
However, in the current Eta concept of JArray's this won't compile because we are trying to generate a List in the context of the Java JArray, which is perhaps why you use the pre-evaluated list production.
In Haskell, this is easy because we generate non-monadic immutable arrays to/from lists, and such immutable arrays can be easily indexed, etc., because they don't need to be wrapped in monads; when mutability is required we convert it to a monad (STArray, STUArray) by "thawing" it, manipulate the array in that context then "freeze" it again back into an immutable array. I see none of those facilities in Eta, so working with arrays is going to be very difficult without it.
The only way I can see around the problem ATM is to create non-monadic immutable array type oneself to accomplish the desired abilities to be able to generate a list from an array without evaluating it beforehand.
As you said in issue #935, the Array UX isn't very good, but I don't think just providing a Direct FFI as proposed in issue #647 is going to entirely fix the Array UX problem other than making it easier to define my here proposed immutable array UX oneself (which one shouldn't have to do): what is really needed is more of the Haskell immutable/STmutable interface including freeze/thaw/unsafeFreeze/unsafeThaw and the immutabe array library, and one may as well have runJavaArray as well to avoid having to use one of the "unsafePerform's" directly.
Such things are show stoppers for potential Eta users that need to do intensive array manipulations, as the current UX doesn't seem to be adequate.
As for me, since Haskell is already one of my favorite languages, I already am becoming an Eta convert from my usual JVM language, Kotlin, and look forward to when you reach version 1.0 status, as I expect you to have these things fixed by then.
@rahulmutt, I have done some more thinking about this and also tested the speed with the minimal changes required to use Haskell conventional STUArray's after adding array
to the project .cabal
file. The same benchmark using STUArray's was over ten times slower than using the JByteArray.
That is still quite a bit better than trying to do the same thing with Frege, but still not really "there".
I wonder if it would be possible to patch the arrays
implementation so that the actual implementation of unboxed STUArray's is actually the equivalent JArray, the boxed STArray's are actually JObjectArray's, and with all the equivalent IArray's just immutable version of the same two types of JArray (primitive and object) just without any modification API and therefore no need for the ST/Java (and IO if one were to include the variations of the IOArray's). I wonder of the same optimizations that apply to the Java monad could also be applied to the ST/IO monad so that the "standard" version could run just as fast?
In this way, one wouldn't need the Java.Array module at all, and would be able to just use code as if written in Haskell, which would seem to me to be very advantageous.
Of course, the "unsafe" functions that are meant to bypass array bounds checking (unsafeRead
/unsafeWrite
/unsafeAt
) would not actually do so as array bounds checking is built into the JVM, but it would bypass some of the extra overhead of using the "safe" versions.
One can then write the proper un-evaluated lazy conversions to any type including lazy list, which is kind of the subject here.
If you think this could be made to work and retain the performance of the Java arrays, then another issue should be opened. Would the work required by major? Perhaps I could contribute if it weren't too big. Let me know if you would like to see the "standard" STUArray benchmark and I'll post it.
@GordonBGood I apologize - the FAQ is horribly outdated, I'll update it today. Strict
and StrictData
available right now and they have been for a while. I agree that laziness is something we'll have to document a lot more thoroughly and make sure new users have a good mental model of understanding how the thunks will build up. (#939)
Thank you for bringing up the idea of providing a immutable/STmutable interface for Java arrays - that sounds wonderful. (#940)
As for your observation that STUArray's are a lot slower - this may have to do with the fact that Haskell's Ptr
type is implemented using off-heap memory. Off-heap memory does not get garbage collected and the memory is guaranteed to not move - the exact semantics of an STUArray as it is implemented in Haskell. We had to write our own off-heap memory allocator using the JVM's provision for non-GC'd memory. Perhaps we should do it as you suggested and rewrite bits of the array
library to use Java arrays directly. I think keeping an STUArray benchmark in eta-benchmarks
would be useful as improvements to the MemoryManager will reflected.
Just so some misconceptions are cleared on array bounds checking on the JVM: the JVM does aggressive JIT optimizations to make null checks and array bounds checks seem almost non-existent if it can prove at runtime that your loop bounds respect the checks. Besides eliminating range checks, it can also cut off branches if they are never taken and deoptimize if they are.
To get a feel for some of the advanced tricks the (HotSpot) JVM does, here are some links:
@rahulmutt, thanks for the update on the extra Strict
extensions; I'll watch for your changes to the FAQ.
As to use of Ptr
, I understood that standard GHC doesn't use "pinned" storage for normal array's, but you perhaps did use it for your emulation and, if so, that certainly would explain the slow-down; these sorts of facilities are provided in VM's for use in marshalling to other languages, but aren't optimized to discourage their use when not necessary - DotNet CLI/CLR is the same. If that is the case, remapping directly to Java native array's should definitely make them faster.
Yes, I'll post the ST/immutable array benchmark on the Performance issue I raised so someone can add it to benchmarks as a measure of how well the improvements perform.
I knew that the JVM Hotspot takes considerable effort in optimizations, but have found it difficult to view what it does as it is more difficult to view the actual native assembly code generated inside the JVM than it is from the CLI. For these kinds of benchmarks, I usually find JVM code about 25% slower than CLI but don't know the reason; OTOH, benchmarks testing memory allocation speed for many small objects generally greatly favor the JVM. Eventually, I'll provide a few of those, too.