anishathalye/porcupine

verify 400 events is too slow for my bank case

siddontang opened this issue · 8 comments

Hi @anishathalye

I use Porcupine to verify my bank case but find it can't stop and run for a long time. I don't know why, but seem that my bank Model is right. see
https://github.com/siddontang/chaos/blob/master/db/tidb/bank.go#L222

You can check out the chaos repo and make verifier to build the chaos verifier, then run

./bin/chaos-verifier -names tidb_bank -history ./history.log

The history log is here

history.log

It is very appreicated that you can help me figure the problem out.

Checking linearizability is NP-complete, so for certain histories, Porcupine is probably going to take a long time.

We should understand if this is a history where the checker should finish quickly, or if this history is just "hard". I'll try to investigate this weekend.

any update? @anishathalye

Sorry about the delay, Siddon. I finally had a chance to take a look at this.

I read through the code in your bank.go, and I didn't see anything that looked wrong. I tried playing with your history.log file, passing it to BankVerifier.Verify(). I tried culling history events, and if you remove enough events from the suffix of the log, you can get to a point where Porcupine runs very fast. With this particular history.log, apparently, keeping only the first 251 lines of the log, it runs very fast (100ms), but with the first 252 lines, it seems to hang (I wasn't patient enough to run it for more than several minutes). I have seen similar behaviors in other domains, e.g. testing key-value stores. Certain histories are very fast, and others can be very slow (or seem to take forever).

I think this basically comes down to the issue of linearizability checking being NP-complete. The particular algorithm we're using in Porcupine uses some heuristics which often leads to a reasonably fast search on many histories, but in the worst case, it will take exponential time. I don't think there's a good solution for this. I'm kind of curious if Jepsen/Knossos handles this particular history well.

Looking more closely at this particular history itself, the read operations occur very infrequently. For example, there is a big gap between the read operation on line 104 and the next read operation returning on line 252. This really blows up the search space, especially when there are lots of possible orderings for the operations that occurred in between those points. For some more intuition on this, consider a scenario where N transfer operations overlap, and then there's a read operation that happens way later, and only one particular ordering of the N operations produces that particular read result; in this scenario, the linearizability checker would need to check all possible orderings of the N operations to find the right one that produces the read result. I haven't looked into how exactly you're generating the workload that produces the history, but you could try increasing the read:write ratio, and I think that'll help.

Also, by the way, I tried looking at a cpu profile and a heap profile, but I didn't find anything suspicious there. Porcupine just takes a long time to search the space, because it's really big. And while it's doing that, it requires a lot of cycles and memory.

@anishathalye I just wanted to say thanks for the great explanation of state space search explosion with WGL style algorithms. I will likely refer people and myself to this comment in the future. :)

Thanks @anishathalye

I will adjust the read:write ratio to introduce more reads in our test :-)

Closing this due to inactivity. If there are any new findings, please re-open and let us know :)

aphyr commented

Late to this party, but I agree with Anish: this problem feels inherent to the state space explosion involved in general-purpose linearizability checking. We hit this in Knossos tests frequently, and while we've made as many efforts to optimize constant factors as possible, checking is still O(c!) with respect to concurrency--and concurrency almost always rises in real-world histories. You may be able to work around the issue via careful partitioning of your workload in both time and across distinct objects, reducing the state space, and performing frequent reads to prune the search space often.