Training does shutdown right after start

Question

Training does shutdown right after start

Closed this issue 3 years ago · 10 comments

Hey Fabian, I got a problem when I execute the python_job.sh in order to start a training:
After I execute the script I can see the 12 trajectories are being generated in the JOB screen but they are closed right after start and the SLURM output files just say this here:

/var/tmp/slurmd_spool/job1711768/slurm_script: Zeile 13: ./Allrun.singularity: Keine Berechtigung

Do you know any fix for this?
Maybe I set up the repository incorrectly..

greetings
Erik

Answer 1 · 2021-10-21T16:21:01.000Z

Hi Erik,

it looks like the Allrun.singularity file lacks the permission to be executed. That is something I have encountered too.
You can change this with the "chmod" command. For that navigate to the base case in "DRL_py\env\base_case\agentRotatingWallVelocity" and change the permissions there. It will copy the permissions into each individual trajectory.

Greetings
Fabian

Answer 2 · 2021-10-22T08:16:32.000Z

I suspected something like this. That did the trick for this error, although I encountered a new one ^^

cp: der Aufruf von stat für „./env/base_case/baseline_data/Re_100/processor0/4.71025“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden

I checked the directory and the folder baseline_data is not there, so do I have to run some script first to create it or is there something hardcoded that I have to change first?

Edit:
Just talked to Andre and he mentioned that you edited the starting time of the training to 4.5 sec or so. As I understood I have to generate the first few seconds of the uncontrolled trajectories in order to get the training started. Do I have to run the Allrun.singularity script in the ./env/base_case/agentRotatingWallVelocity/ ?

Answer 3 · 2021-10-22T17:23:10.000Z

Yes, that is indeed correct. To accelarate the training process the trajectories start at a later time from a snapshot of the simulation. The needed data was missing here.
Sadly it isn't quite as easy as just copying the base_case and letting it run. You would need to make a number of changes to the setup to prevent premature starting of the control actions etc.
However, I can provide you with the necessary data with this link: baseline_data(400MB)
Which is also now added to the Read_me file

Answer 4 · 2021-10-27T17:19:24.000Z

Okay I inserted the Re_100, Re_200 and Re_400 folders in to the correct directories but my simulations still shut down. The py.log looks like this:

waiting for traj_0 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_7 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_9 ...
waiting for traj_10 ...

 starting trajectory : 0 


 starting trajectory : 1 


 starting trajectory : 2 


 starting trajectory : 3 


 starting trajectory : 4 


 starting trajectory : 5 


 starting trajectory : 6 


 starting trajectory : 7 


 starting trajectory : 8 


 starting trajectory : 9 


 starting trajectory : 10 


 starting trajectory : 11 

job :  trajectory_0 finished with rc = 0
job :  trajectory_1 finished with rc = 0
job :  trajectory_2 finished with rc = 0
job :  trajectory_3 finished with rc = 0
job :  trajectory_4 finished with rc = 0
job :  trajectory_5 finished with rc = 0
job :  trajectory_6 finished with rc = 0
job :  trajectory_7 finished with rc = 0
job :  trajectory_8 finished with rc = 0
job :  trajectory_11 finished with rc = 0
job :  trajectory_9 finished with rc = 0
job :  trajectory_10 finished with rc = 0
Traceback (most recent call last):
  File "main.py", line 124, in <module>
    action_bounds)
  File "/home/y0079256/DRL_py_beta/ppo.py", line 77, in train_model
    states, actions, rewards, returns, logpas = fill_buffer(env, sample, n_sensor, gamma, r_1, r_2, r_3, r_4, action_bounds)
  File "/home/y0079256/DRL_py_beta/reply_buffer.py", line 55, in fill_buffer
    assert n_traj > 0
AssertionError

Since it is an AssertianError, raised if the number of active trajectories is <= 0, I suppose the simulations shutdown before it reaches this line of code. Any ideas on why this happens?

Answer 5 · 2021-10-27T18:17:11.000Z

Hello Erik,

I suspect, the assertion error is occurred due to not completing any of the simulations. Hence, Please check whether the simulation is finished correctly.

Could you please show the slurm output of a trajectory ?

Best Regards,
Darshan Thummar.

Answer 6 · 2021-10-28T06:26:23.000Z

Hey Darshan,

the slurm output basically contain all the same:

Running blockMesh on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running setExprBoundaryFields on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running decomposePar on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running renumberMesh (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel
Running pimpleFoam (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel

What I noticed is that there is no trajectory_0 folder in /env/sample_0/trajectory_X folder.

Kind regards
Erik Schulze

Answer 7 · 2021-10-28T08:21:46.000Z

Hi Erik,
the slurm outputs look fine. You probably need to take a look at the log files in the trajectories.
If a trajectory fails it is copied over to a newly created "failed" directory in the DRL_py_beta folder. You can find them there.

Best Regards,
Fabian

Answer 8 · 2021-10-28T09:15:16.000Z

Hey all,

I did look into the failed directory and found that all openFOAM log.* files contain this warning message:

--> FOAM Warning : 
    From void* Foam::dlLibraryTable::openLibrary(const Foam::fileName&, bool)
    in file db/dynamicLibrary/dlLibraryTable/dlLibraryTable.C at line 188
    Could not load "../../../libAgentRotatingWallVelocity.so"
../../../libAgentRotatingWallVelocity.so: cannot open shared object file: No such file or directory

I also looked into the directory and it seems that there is no libAgentRotatingWallVelocity.so file. Do I have to run the make script in the "agentRotatatingWallVelocity" folder first, in order to get the simulations running?

Kind regards
Erik

Answer 9 · 2021-10-28T09:35:55.000Z

Hi Erik,
check out the README for the instructions to compile the boundary condition.
Best, Andre

Answer 10 · 2021-10-28T11:04:55.000Z

Hey all,

it works now, thank you! Although I had to copy the libAgentRotatingVelocity.so file into the parent directory afterwards in order to make it work. I hope this was correct, since I guessed I could also have changed the path in the contolDict for that.

Kind regards
Erik