Stack overflows

Question

Stack overflows

Closed this issue a year ago · 18 comments

This is a follow up from jonathan-laurent/KaTie#17

When running the big trace for rep_0_trace_event/trace.json (shared via email), I get the following stack overflow after a couple hundred thousand events are churned by KaTie:

Abbreviated output

[user]@[computer]:~/Wnt/rep_0_trace_event$ ~/KaTie/_build/install/default/bin/KaTie -q testing_query.katie -t trace.json --output-dir testing_query
Evaluating queries: 1 simple and 0 complex
Fatal error: exception Stack overflowssed)
Raised by primitive operation at Stdlib__Hashtbl.find in file "hashtbl.ml", line 541, characters 9-23
Called from Kappa_terms__Instantiation.subst_map_concrete_agent in file "core/term/instantiation.ml", line 256, characters 9-13
Called from Kappa_terms__Instantiation.subst_map_site in file "core/term/instantiation.ml", line 260, characters 12-16
Called from Kappa_terms__Instantiation.subst_map2_agent_in_action in file "core/term/instantiation.ml", line 297, characters 16-38
Called from Stdlib__List.map in file "list.ml", line 92, characters 20-23
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
[...]
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
[user]@[computer]:~/Wnt/rep_0_trace_event$

This appears to be a truncated stack. I do not know how to produce a longer stack.

Trying to compile KaTie to byte-code, and running with OCAMLRUNPARAM and a larger word budget produced a different stack overflow inmediately, with KaTie not even creating the output files.

Abbreviated output

[user]@[computer]:~/Wnt/rep_0_trace_event$ env OCAMLRUNPARAM=b,l=8000 ~/KaTie/_build/default/src/main.bc -q testing_query.katie -t trace.json --output-dir testing_query
Fatal error: exception Stack overflow
Raised by primitive operation at Kappa_mixtures__Navigation.port_of_yojson in file "core/siteGraphs/navigation.ml", line 47, characters 27-53
Called from Kappa_mixtures__Navigation.step_of_yojson in file "core/siteGraphs/navigation.ml", line 63, characters 5-21
Called from Stdlib__List.map in file "list.ml", line 92, characters 20-23
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Kappa_terms__Pattern.Env.transition_of_yojson in file "core/term/pattern.ml", line 989, characters 48-70
Called from Stdlib__List.map in file "list.ml", line 92, characters 20-23
[...]
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39
Called from Kappa_terms__Pattern.Env.point_of_yojson in file "core/term/pattern.ml", line 1015, characters 27-58
Called from Kappa_generic_toolset__Tools.array_map_of_list.fill in file "core/dataStructures/tools.ml", line 134, characters 27-32
Called from Kappa_generic_toolset__Tools.array_map_of_list.(fun) in file "core/dataStructures/tools.ml", line 141, characters 15-29
Called from Kappa_terms__Pattern.Env.of_yojson in file "core/term/pattern.ml", line 1078, characters 18-70
Called from Kappa_terms__Model.of_yojson in file "core/term/model.ml", line 334, characters 19-64
Called from Kappa_runtime__Trace.get_headers_from_file in file "core/simulation/trace.ml", line 476, characters 12-102
Called from Dune__exe__Trace_header.load in file "src/trace_header.ml", line 4, characters 20-58
Called from Dune__exe__Main.main in file "src/main.ml", line 98, characters 17-58
Called from Dune__exe__Main in file "src/main.ml", line 122, characters 4-11
[user]@[computer]:~/Wnt/rep_0_trace_event$

These are produced in a fresh opam switch, pinned to ocaml 4.14.1, with Kappa Simulator: v4.1-97-gb9248b7, on the HMS cluster.

Answer 1 · 2023-05-17T22:06:53.000Z

The full output of each run exceeds the GitHub character limit, but can be shared on request.

Answer 2 · 2023-05-18T07:29:24.000Z

The new stack traces aren't useful because the stack is then so small that the program crashes very early at a completely different place. Anyway, I think I have an idea what can cause the stack overflow. Can you please run KaTie from the step-size-stats branch with option "--no-progress-bars and report stdout and stderr?

Answer 3 · 2023-05-18T22:42:17.000Z

$ ~/KaTie/_build/install/default/bin/KaTie --no-progress-bars -q testing_query.katie -t trace.json --output-dir testing_query 1> stdout.txt 2> stderr.txt

Standard output holds:

$ more stdout.txt
Evaluating queries: 1 simple and 0 complex
$

Standard error is a very large file:

$ wc -l stderr.txt
38216 stderr.txt

It begins with 37,192 identical lines with

INIT (0 actions, 0 tests, 0 side effects)

Then this block

Fatal error: exception Stack overflow
Raised by primitive operation at Stdlib__Hashtbl.find in file "hashtbl.ml", line 541, characters 9-23
Called from Dune__exe__Safe_replay.s2u in file "src/safe_replay.ml", line 64, characters 6-43
Called from Kappa_terms__Instantiation.subst_map_site in file "core/term/instantiation.ml", line 260, characters 12-16
Called from Kappa_terms__Instantiation.subst_map2_agent_in_action in file "core/term/instantiation.ml", line 297, characters 16-38
Called from Stdlib__List.map in file "list.ml", line 92, characters 20-23
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39

Followed by 1,018 repetitions of

Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39

I'll zip and email you the file...

Answer 4 · 2023-05-19T06:40:42.000Z

Thanks! I updated the branch, can you do the same thing again?

Answer 5 · 2023-05-19T20:46:06.000Z

Rebuild, re-ran, and

$ ~/KaTie/_build/install/default/bin/KaTie --no-progress-bars -q testing_query.katie -t trace.json --output-dir testing_query 1> stdout.txt 2> stderr.txt

$ more stdout.txt
Evaluating queries: 1 simple and 0 complex
$

stderr.txt is a very large file again, with:

~37,192 lines similar to:

INIT (14 actions, 0 tests, 0 side effects)
INIT (20 actions, 0 tests, 0 side effects)

Followed by:

INIT (393893 actions, 0 tests, 0 side effects)
Fatal error: exception Stack overflow
Raised by primitive operation at Stdlib__Hashtbl.find in file "hashtbl.ml", line 541, characters 9-23
Called from Dune__exe__Safe_replay.s2u in file "src/safe_replay.ml", line 64, characters 6-43
Called from Kappa_terms__Instantiation.subst_map_site in file "core/term/instantiation.ml", line 260, characters 12-16
Called from Kappa_terms__Instantiation.subst_map2_agent_in_action in file "core/term/instantiation.ml", line 297, characters 16-38
Called from Stdlib__List.map in file "list.ml", line 92, characters 20-23
Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39

and repeating 1,017 times the line

Called from Stdlib__List.map in file "list.ml", line 92, characters 32-39

I'll zip the full stderr and email it to you.

Answer 6 · 2023-05-20T06:36:18.000Z

This is what I thought. How is it possible for an init event to have 393893 actions?
Are you doing something like loading a snapshot?

Answer 7 · 2023-05-20T19:24:27.000Z

Yes, that's exactly where these stack-overflowing traces start.

When started from an "all monomeric" state, the system goes through a nucleation phase before demixing. For the current analysis, given I have a run of the whole init -> demixing -> signaling event -> second demixing transitions, I derive snapshots of the first and second demixing steady states (the time to demixing is stochastic, and that's a future paper), plus at the start of the signaling event, then run KaSim-with-trace-output for short simulations from those snapshots, including the "idiom" for explicit binding of the substrate and kinase where appropriate. Those simulations produce the 75GB of traces I send over email.

Answer 8 · 2023-05-20T19:38:40.000Z

How do you load a snapshot in KaSim? Do you literally concatenate its description to the *.ka model file?

I'll have to make a few calls tail recursive to allow such big events to be handled without stack overflows.

Update: I pushed a new version where I made some functions tail-recursive. Is your big trace running now?

Answer 9 · 2023-05-21T21:42:09.000Z

Do you literally concatenate its description to the *.ka model file?

That used to be the case in the old days, but Pierre added a -mixture [file] option to the command line, that tells KaSim to use the %init: directive from the mixture file and ignore the ones in any other files.

About the update; equivalent behavior. I'll zip & email the outputs.

Answer 10 · 2023-05-21T21:56:59.000Z

I removed all calls to List.map on my side so this means the stack overflow is happening on KaSim side. I'll either track it down or reimplement some KaSim functions internally.

On your side, can you try and run it with a much bigger stack (say 24GB or 3G words)? You know the command for the bytecode version. For the native version, you can try ulimit.

Answer 11 · 2023-05-21T22:21:44.000Z

Also, ocamldebug may enable you to get the full stack trace, which would really help here. Would you mind trying this too? See instructions here or here.

Answer 12 · 2023-05-22T19:56:04.000Z

The shell running these currently has a ulimit-derived maximum stack of 8192 bytes; however the last stderr.out file I sent you is ~75Kb, ~13K words, so I'm not sure how that value is being interpreted.

Raising it to double the value, 16384, has yielded a sample run that hasn't crashed yet, and has been running for 12h, churning 5.53M events.

Assuming this "simple" query has similar runtime to the "complex" query we're after, and that the "take measures" second pass occurs at a rate comparable to this "populate event schedule" first pass, this simple query might finish in 19 days.

Answer 13 · 2023-05-22T20:01:27.000Z

Assuming this "simple" query has similar runtime to the "complex" query we're after, and that the "take measures" second pass occurs at a rate comparable to this "populate event schedule" first pass, this simple query might finish in 19 days.

Are you running the code in native or in bytecode? Is this slower than the previous engine?
For very long queries, you may also want to try and compile KaTie with ocaml+flambda and use the --no-backtrace option for better performances.

Answer 14 · 2023-05-22T20:33:51.000Z

This is a native code build. Assuming the above estimate of 19 days is valid, it is faster than the previous engine, which churned through this trace in about a full month.

My current setups do not allow Flambda (ocamlopt -config says flambda: false); apparently it requires a new switch, so I'll rebuild and test that next.

Answer 15 · 2023-05-22T20:46:12.000Z

Yes, flambda cannot be enabled by a compiler flag and so you must create a whole new switch. Last time I tried flambda, the difference wasn't huge but it is is worth trying.

Answer 16 · 2023-05-23T08:30:40.000Z

Hmm... 12h later, the flambda-compiled version is slower, at 4/5ths the speed. I must be doing something wrong (right?)

Answer 17 · 2023-05-23T08:52:32.000Z

Not necessarily. Some of flambda's optimizations are double-edged swords and can result in worst performances. Also, flambda is pretty old at this point and the main compiler has been more than catching up in some areas. We could probably get better performances out of flambda by tuning the compilation parameters but it is likely not worth the effort. Another thing we could try and tweak to find better performances are the GC parameters, but I doubt it would be a game changer either.

Answer 18 · 2023-05-23T23:03:51.000Z

Ok, at this point, I believe these stack overflows are not an issue with the KaTools themselves, so I'll close this issue.