sisl/ngsim_env

Some tips to solve the segmentation fault error when using validate.py.

DarrenRuan opened this issue · 0 comments

Here, I try to summarize the issue of the segmentation fault and potential solutions (although it might fail).

My env: Julia 1.1

The Error Description

#27
signal (11): Segmentation fault
in expression starting at no file:0

#20
pid: 0 traj: 58 / 2062 (then stuck)
signal (11): Segmentation fault
in expression starting at no file:0
GetResult at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_ctypes/callproc.c:911 [inlined]

Potential Solutions

(I really appreciate if you could share your insights here)

  1. Using Julia v0.6 and
    cd ~/.julia/lib/v0.6 rm PyCall.jl
    Reference: https://github.com/sisl/ngsim_env/blob/master/docs/usingTrainedPolicy.md
    (But I did not know how to install Julia packages on v0.6, like Pkg.add(PackageSpec(url="https://github.com/sisl/Vec.jl"))).
  • How could use PackageSpec in Julia v0.6?
  • How could install 'LinearAlgebra' in Julia v0.6? because

LinearAlgebra is a standard library introduced in Julia v0.7 containing Base.LinAlg from Julia v0.6, so it is not available on Julia v0.6 A package that requires it will not work on Julia 0.6 either. (https://discourse.julialang.org/t/package-linearalgebra/16064)

  1. Just like what authors suggested in 1, we could also rm ~/julia/.julia/compiled/v1.1/PyCall/****.jl. (i still got the error even I did this)

  2. using 'single_process_collect_trajectories', set --debug = True ( I failed.)

  3. Try to use different '--n_proc' and sleep time. (It is really hard.)
    e.g. there are 4 vCPUs on your machine, how to make it match what the authors have mentioned below. (Try --n_proc = 3 or 4 or 5)

Running validate.py occasionally hangs with no error messages or anything like that. Previous experience suggests that this is somehow related to julia processes remaining unfinished and the python script moving on. Looking in validate.py, there is a sleep() call. In the past, we have had some limited success in overcoming the hanging problem by increasing the sleep duration. However, it is not guaranteed. We have been unable to produce a minimal reproducible example of this happening, but the thoughts are that it is related to the machine's load. A higher load means we need to wait longer.

Reference: https://github.com/sisl/ngsim_env/tree/master/scripts/imitation

  1. My point: is it possible for us to output trajlist (in validate.py) even if one of the processes failed? Really appreciate any response here. Could you give me some hints about this?