jepsen-io/knossos

Assertion errors from Java

itamararjuan opened this issue · 10 comments

Hi @aphyr - Thank you for the amazing framework you've supplied us with to test out our databases
I have a working repo with Jepsen testing out our Blockchain Go Implementation at Orbs
and have encountered something weird with Knossos verifying history and firing many Java Assertion Errors like the following attached here

The weird thing is, the history checks appears to be valid and is quite confusing for me to know whether the test has passed or not due to these errors.
jepsen.log.zip

Would love your feedback. I am attaching here a small fragment of what comes out from stdout:

WARN [2018-11-06 13:46:47,969] clojure-agent-send-pool-2 - jepsen.checker Error while checking history: java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at java.util.concurrent.FutureTask.report(FutureTask.java:122) [na:1.8.0_181] at java.util.concurrent.FutureTask.get(FutureTask.java:192) [na:1.8.0_181] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] at clojure.core$map$fn__4785.invoke(core.clj:2646) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] at clojure.lang.LazySeq.seq(LazySeq.java:56) ~[clojure-1.8.0.jar:na] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] at jepsen.checker$compose$reify__3952.check(checker.clj:88) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.independent$checker$reify__2302$fn__2303.invoke(independent.clj:272) [na:na] at dom_top.core$bounded_pmap$launcher__218$fn__219.invoke(core.clj:129) [jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) [clojure-1.8.0.jar:na] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value nil, :process 9, :time 682238600, :index 10} i at knossos.history$complete_fold_op.invokeStatic(history.clj:187) ~[knossos-0.3.3.jar:na] at knossos.history$complete_fold_op.invoke(history.clj:160) ~[knossos-0.3.3.jar:na] at clojure.lang.ArrayChunk.reduce(ArrayChunk.java:58) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invokeStatic(protocols.clj:136) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6750.invoke(protocols.clj:124) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6710$G__6705__6719.invoke(protocols.clj:19) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:31) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] at clojure.core$reduce.invoke(core.clj:6527) ~[clojure-1.8.0.jar:na] at knossos.history$complete.invokeStatic(history.clj:241) ~[knossos-0.3.3.jar:na] at knossos.history$complete.invoke(history.clj:221) ~[knossos-0.3.3.jar:na] at jepsen.checker.timeline$html$reify__2386.check(timeline.clj:163) ~[na:na] at jepsen.checker$check_safe.invokeStatic(checker.clj:71) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$check_safe.invoke(checker.clj:64) [jepsen-0.1.9-SNAPSHOT.jar:na] at jepsen.checker$compose$reify__3952$fn__3954.invoke(checker.clj:87) ~[jepsen-0.1.9-SNAPSHOT.jar:na] at clojure.core$pmap$fn__6970$fn__6971.invoke(core.clj:6736) ~[clojure-1.8.0.jar:na] ... 6 common frames omitted

I also am attaching the full log of the running

Thanks and hope to hear from you soon!
-- Itamar Arjuan

aphyr commented

Hi @aphyr , I recorded a smaller running, and staggered the requests a bit less dense
and wound up with 2 such history assertion fails. but here's the thing,
the process that knossos claimed was providing :ok op without an :invoke was perfect when following the log line by line.

Each :invoke had a corresponding :ok or :fail and still it said the same thing.
Also something here is quite bizarre in my eyes since the :invoke op is not something that my code adds into knossos , it's added by your code and is implicit to me.

The only part which is up to the code that is written into the test is the reify implementation of the client 's invoke! method which is responsible for the response part of the history record.

So I can't really think of a way that my code would of generated such a scenario where the implicit insertion to history of the :invoke part would be missing from the process.

Regardless of this: following the claimed process's history came up with it being perfectly symetric in terms of the ratio between invokes and ok(s)/fail(s)

Is there any reason in your eyes as to why knossoswould flag the history analysis as valid after these violations? which is also a weird part of this whole situation.

Thanks for feedback!
-- Itamar

For example, this is what the history.edn file includes at the complaint-about index:

{:type :invoke, :f :read, :value [1 nil], :process 1, :time 47689039400, :index 622} {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}

while the error that the checker throw is the following one:

INFO [2018-11-07 12:23:54,831] main - jepsen.core {:perf {:latency-graph {:valid? true}, :rate-graph {:valid? true}, :valid? true}, :indep {:valid? :unknown, :results {0 {:valid? :unknown, :error "java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}\ni\n at java.util.concurrent.FutureTask.report (FutureTask.java:122)\n java.util.concurrent.FutureTask.get (FutureTask.java:192)\n clojure.core$deref_future.invokeStatic (core.clj:2208)\n clojure.core$future_call$reify__6962.deref (core.clj:6688)\n clojure.core$deref.invokeStatic (core.clj:2228)\n clojure.core$deref.invoke (core.clj:2214)\n clojure.core$map$fn__4785.invoke (core.clj:2646)\n clojure.lang.LazySeq.sval (LazySeq.java:40)\n clojure.lang.LazySeq.seq (LazySeq.java:56)\n clojure.lang.RT.seq (RT.java:521)\n clojure.core$seq__4357.invokeStatic (core.clj:137)\n clojure.core.protocols$seq_reduce.invokeStatic (protocols.clj:24)\n clojure.core.protocols$fn__6738.invokeStatic (protocols.clj:75)\n clojure.core.protocols/fn (protocols.clj:75)\n clojure.core.protocols$fn__6684$G__6679__6697.invoke (protocols.clj:13)\n clojure.core$reduce.invokeStatic (core.clj:6545)\n clojure.core$into.invokeStatic (core.clj:6610)\n clojure.core$into.invoke (core.clj:6604)\n jepsen.checker$compose$reify__3952.check (checker.clj:88)\n jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n jepsen.checker$check_safe.invoke (checker.clj:64)\n jepsen.independent$checker$reify__2302$fn__2303.invoke (independent.clj:272)\n dom_top.core$bounded_pmap$launcher__218$fn__219.invoke (core.clj:129)\n clojure.core$binding_conveyor_fn$fn__4676.invoke (core.clj:1938)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:266)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n java.lang.Thread.run (Thread.java:748)\nCaused by: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: {:type :ok, :f :read, :value 0, :process 1, :time 47716834600, :index 623}\ni\n at knossos.history$complete_fold_op.invokeStatic (history.clj:187)\n knossos.history$complete_fold_op.invoke (history.clj:160)\n clojure.lang.ArrayChunk.reduce (ArrayChunk.java:63)\n clojure.core.protocols$fn__6750.invokeStatic (protocols.clj:136)\n clojure.core.protocols/fn (protocols.clj:124)\n clojure.core.protocols$fn__6710$G__6705__6719.invoke (protocols.clj:19)\n clojure.core.protocols$seq_reduce.invokeStatic (protocols.clj:31)\n clojure.core.protocols$fn__6738.invokeStatic (protocols.clj:75)\n clojure.core.protocols/fn (protocols.clj:75)\n clojure.core.protocols$fn__6684$G__6679__6697.invoke (protocols.clj:13)\n clojure.core$reduce.invokeStatic (core.clj:6545)\n clojure.core$reduce.invoke (core.clj:6527)\n knossos.history$complete.invokeStatic (history.clj:241)\n knossos.history$complete.invoke (history.clj:221)\n jepsen.checker.timeline$html$reify__2386.check (timeline.clj:163)\n jepsen.checker$check_safe.invokeStatic (checker.clj:71)\n jepsen.checker$check_safe.invoke (checker.clj:64)\n jepsen.checker$compose$reify__3952$fn__3954.invoke (checker.clj:87)\n clojure.core$pmap$fn__6970$fn__6971.invoke (core.clj:6736)\n clojure.core$binding_conveyor_fn$fn__4676.invoke (core.clj:1938)\n clojure.lang.AFn.call (AFn.java:18)\n java.util.concurrent.FutureTask.run (FutureTask.java:266)\n java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1149)\n java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:624)\n java.lang.Thread.run (Thread.java:748)\n"},

it appears that the expected history value is 1 while 0 was returned and that is what bothers knossos rather than the :invoke missing... or am I missing something ? :-)

-- Itamar

aphyr commented
aphyr commented

Hmmm. This still feels like a bug in knossos, but now that I'm not on my phone and can read properly, I notice that your invoke and complete ops have different shapes--the value for the invoke is [1 nil], but the returned value is 0. That feels like you might have a client that isn't constructing appropriate responses--like maybe it's taking [key value] pairs and forgetting to construct [key value] completions, instead just returning values. I don't thiiink that should break this particular knossos codepath, though.

Indeed my client invoke! method returns the register's value as is after casting to int of course (I based my test on your etcd example from scratch)

I wasn't aware of the fact I need to return an array or anything else
So I guess I'll have a look at what your etcd Clojure client code returns in the original test to see I'm aligned with that.

My client code looks like so

`(defn client
"A client for a single compare-and-set register"
[conn]
(reify client/Client
(open! [_ test node]
(client node))

(invoke! [this test op]
  (case (:f op)
    :read (assoc op :type :ok, :value (gamma-cli-read-singular conn))
    :write (do (gamma-cli-write-singular-value (:value op) conn (get op :process))
               (assoc op :type, :ok))

    :cas (try+
          (let [[old new] (:value op)]
            (assoc op :type (if (gamma-cli-singular-cas old new conn (get op :process))
                              :ok
                              :fail)))
          (catch [:errorCode 100] ex
            (assoc op :type :fail, :error :not-found)))))

; If our connection were stateful, we'd close it here.
(close! [_ _])

(setup! [_ _])
(teardown! [_ _])))`

Hi @aphyr and @itamararjuan. I am having the exact same problem and was wondering if what I have helps. I get the following exception

java.util.concurrent.ExecutionException: java.lang.AssertionError: Assert failed: Process completed an operation without a prior invocation: #knossos.op.Op{:process 6, :type :ok, :f :read, :value nil, :index 2, :time 3420418100}

corresponding to this clipped history

:type :invoke, :f :read, :value [6 nil], :process 6, :time 3159622400, :index 0}
{:type :invoke, :f :read, :value [3 nil], :process 3, :time 3168123700, :index 1}
{:type :ok, :f :read, :value nil, :process 6, :time 3420418100, :index 2}
{:type :ok, :f :read, :value nil, :process 3, :time 3420418100, :index 3}
...

My read function is

(defn r [_ _] {:type :invoke, :f :read, :value nil})

and in my client if I try to read a key which doesn't exist I do a (assoc op :type :ok, :value nil) which seems to be the cause of the exception. If I turn them into (assoc op :type :fail, :error :nil) or something, then everything seems to be okay.

I felt that (assoc op :type :ok, :value nil) should be okay based on the Jepsen tutorial, but should I be turning these into fail type?

Thanks for any help!

Oh, by the way, I am using the following checker:

:checker    (checker/linearizable {:model     (model/cas-register)
                                   :algorithm :linear})
aphyr commented

Hi @siyopao. This error message means that process 6's completion didn't have a matching prior invocation. If you look at process 6's operations in that history, they are:

{:type :invoke, :f :read, :value [6 nil], :process 6, :time 3159622400, :index 0}
{:type :ok, :f :read, :value nil, :process 6, :time 3420418100, :index 2}

In this history, process 6 invoked a read with value [6 nil], and completed that read with value nil. Those values don't match, which is why Knossos is confused. You haven't included your generator or client here, so I can't say for sure, but I suspect you may be using an independent generator, which emits operations of the form [key value], but you haven't updated your client or checker to return and analyze [key value] tuples yet.

Hello @aphyr! Thank you for the super fast reply. Well now I feel silly. That is exactly what I missed doing in my client. Thank you for your help!