CozySynthesizer/cozy

Hard timeout abort

izgzhen opened this issue · 7 comments

Currently, Cozy will hang at the following state for a long time:

$ time cozy examples/select-flatmap.ds -t 10
Checking call legality...
Checking invariant preservation...
Done!
Adding query q...
STARTING IMPROVEMENT JOB q
updating with 1 new solutions
update order:
  --> q
considering update 1/1...
SOLUTION FOR q AT 0:00:00.037117 [size=74]
----------------------------------------
  _var131 : Bag<{ A : Int, B : String }> = Rs
  _var132 : Bag<{ B : String, C : Int }> = Ss
  _var133 : Bag<{ B : String, C : Int }> = Qs
  _var134 : Bag<{ B : String, C : Int }> = Ws
  return FlatMap {r -> FlatMap {s -> FlatMap {q -> Map {w -> ((r).A, (s).C, (q).B, (w).C)} (Filter {w -> (((q).B == (w).B) and ((r).A == 15))} (_var134))} (_var133)} (_var132)} (_var131)
----------------------------------------
Stopping jobs
requesting stop for ImproveQueryJob[q]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Stopping SafeQueue...
Done!
Generating IR...
Inlining calls...
Generating code for extension types...
Concretization functions:

_var79 : Bag<{ A : Int, B : String }> = Rs
_var80 : Bag<{ B : String, C : Int }> = Ss
_var81 : Bag<{ B : String, C : Int }> = Qs
_var82 : Bag<{ B : String, C : Int }> = Ws

SelectFlatmap:
  type R = { A : Int, B : String }
  type S = { B : String, C : Int }
  type Q = { B : String, C : Int }
  type W = { B : String, C : Int }
  state _var79 : Bag<{ A : Int, B : String }>
  state _var80 : Bag<{ B : String, C : Int }>
  state _var81 : Bag<{ B : String, C : Int }>
  state _var82 : Bag<{ B : String, C : Int }>

  query q():
    FlatMap {r -> FlatMap {s -> FlatMap {q -> Map {w -> ((r).A, (s).C, (q).B, (w).C)} (Filter {w -> (((q).B == (w).B) and ((r).A == 15))} (_var82))} (_var81)} (_var80)} (_var79)

Number of improvements done: 1
cozy examples/select-flatmap.ds -t 10  51.85s user 0.34s system 99% cpu 52.625 total

I guess Cozy is waiting for Z3 to return -- but this might far exceed the specified timeout (10s).

We should be able to let Cozy give up early, kill the Z3 job and output immediately after timeout.

NOTE: some hints on the semantic of timeout #52

Hey! Can I work on this? Can I know where file this is being produced from?

@punndcoder28 Sure, I've added more detailed and assigned it to you. Thanks!

Unfortunately, this issue may not be easy to solve. Cozy uses the Z3 library directly, rather than invoking Z3 as an external process. The Z3 library API offers timeouts, but as far as I know it does not offer a way to interrupt a running call. Since Cozy uses Python multiprocessing you might try to kill the multiprocessing jobs, but that also won't work: killing a multiprocess job can corrupt the communication channels between the main process and the multiprocessing job, resulting in unexpected crashes.

I believe that this issue is still worth working on, since it is a huge annoyance when it happens. I hope that I have missed some easy solution.

OK. If I run into any problems I will leave a comment here.
Thanks.

since it is a huge annoyance when it happens

My experience is that it will happen 50% of all time...and when it does happen and I can't wait, I simply send interrupt from Ctrl-C and kill it off. So, I think the most essential thing is to save the work on disk (as specified in --save) before killing everything off after receiving an interrupt signal from the users. In this vein, we can try to add an interrupt handler and let it access the best solution state so far. It doesn't matter if we need to use terminate API and corrupt the queue, as long as we don't corrupt the existing solution in memory.

To summarize: my recommendation is to save the work when interrupted by user/timer and brutally exit ASAP.

Modern versions of Z3 catch SIGINT:
https://github.com/Z3Prover/z3/blob/d75ce380169617379fb9dff40ed6ff62b1955dc0/src/util/scoped_ctrl_c.cpp#L47

A SIGINT signal causes Z3 to return "unknown", as it would do for a timeout.

Here's my proposal for resolving this issue:

  • When the parent process needs to stop a worker, it first sets the "stop" flag (Job.request_stop) and then sends a SIGINT to the worker process.
  • If a solver call ever returns "unknown", the worker should check the "stop" flag. If it is set, then the "unknown" result can be ignored and the worker can shut down gracefully.
  • Workers should install a SIGINT handler that initiates a graceful shutdown in case the signal is delivered when the worker is not in a Z3 call.

In the original example, it appears the job is actually busy converting the syntax tree into a Z3 query, so additionally:

  • The solver should check the stop flag periodically while converting ASTs to Z3 queries.

(And perhaps we should look into optimizing encoding bag equality a little...)