FabianGabriel/Active_flow_control_past_cylinder_using_DRL

Training does shutdown right after start

Closed this issue · 10 comments

Hey Fabian, I got a problem when I execute the python_job.sh in order to start a training:
After I execute the script I can see the 12 trajectories are being generated in the JOB screen but they are closed right after start and the SLURM output files just say this here:

/var/tmp/slurmd_spool/job1711768/slurm_script: Zeile 13: ./Allrun.singularity: Keine Berechtigung

Do you know any fix for this?
Maybe I set up the repository incorrectly..

greetings
Erik

Hi Erik,

it looks like the Allrun.singularity file lacks the permission to be executed. That is something I have encountered too.
You can change this with the "chmod" command. For that navigate to the base case in "DRL_py\env\base_case\agentRotatingWallVelocity" and change the permissions there. It will copy the permissions into each individual trajectory.

Greetings
Fabian

I suspected something like this. That did the trick for this error, although I encountered a new one ^^

cp: der Aufruf von stat für „./env/base_case/baseline_data/Re_100/processor0/4.71025“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden

I checked the directory and the folder baseline_data is not there, so do I have to run some script first to create it or is there something hardcoded that I have to change first?

Edit:
Just talked to Andre and he mentioned that you edited the starting time of the training to 4.5 sec or so. As I understood I have to generate the first few seconds of the uncontrolled trajectories in order to get the training started. Do I have to run the Allrun.singularity script in the ./env/base_case/agentRotatingWallVelocity/ ?

Yes, that is indeed correct. To accelarate the training process the trajectories start at a later time from a snapshot of the simulation. The needed data was missing here.
Sadly it isn't quite as easy as just copying the base_case and letting it run. You would need to make a number of changes to the setup to prevent premature starting of the control actions etc.
However, I can provide you with the necessary data with this link: baseline_data(400MB)
Which is also now added to the Read_me file

Okay I inserted the Re_100, Re_200 and Re_400 folders in to the correct directories but my simulations still shut down. The py.log looks like this:

waiting for traj_0 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_7 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_0 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_1 ...
waiting for traj_2 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_3 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_4 ...
waiting for traj_5 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_6 ...
waiting for traj_9 ...
waiting for traj_7 ...
waiting for traj_10 ...
waiting for traj_8 ...
waiting for traj_11 ...
waiting for traj_9 ...
waiting for traj_10 ...

 starting trajectory : 0 


 starting trajectory : 1 


 starting trajectory : 2 


 starting trajectory : 3 


 starting trajectory : 4 


 starting trajectory : 5 


 starting trajectory : 6 


 starting trajectory : 7 


 starting trajectory : 8 


 starting trajectory : 9 


 starting trajectory : 10 


 starting trajectory : 11 

job :  trajectory_0 finished with rc = 0
job :  trajectory_1 finished with rc = 0
job :  trajectory_2 finished with rc = 0
job :  trajectory_3 finished with rc = 0
job :  trajectory_4 finished with rc = 0
job :  trajectory_5 finished with rc = 0
job :  trajectory_6 finished with rc = 0
job :  trajectory_7 finished with rc = 0
job :  trajectory_8 finished with rc = 0
job :  trajectory_11 finished with rc = 0
job :  trajectory_9 finished with rc = 0
job :  trajectory_10 finished with rc = 0
Traceback (most recent call last):
  File "main.py", line 124, in <module>
    action_bounds)
  File "/home/y0079256/DRL_py_beta/ppo.py", line 77, in train_model
    states, actions, rewards, returns, logpas = fill_buffer(env, sample, n_sensor, gamma, r_1, r_2, r_3, r_4, action_bounds)
  File "/home/y0079256/DRL_py_beta/reply_buffer.py", line 55, in fill_buffer
    assert n_traj > 0
AssertionError

Since it is an AssertianError, raised if the number of active trajectories is <= 0, I suppose the simulations shutdown before it reaches this line of code. Any ideas on why this happens?

Hello Erik,

I suspect, the assertion error is occurred due to not completing any of the simulations. Hence, Please check whether the simulation is finished correctly.

Could you please show the slurm output of a trajectory ?

Best Regards,
Darshan Thummar.

Hey Darshan,

the slurm output basically contain all the same:

Running blockMesh on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running setExprBoundaryFields on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running decomposePar on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
Running renumberMesh (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel
Running pimpleFoam (4 processes) on /home/y0079256/DRL_py_beta/env/sample_0/trajectory_0 with image ../../../../of_v2012.sif
-parallel

What I noticed is that there is no trajectory_0 folder in /env/sample_0/trajectory_X folder.

Kind regards
Erik Schulze

Hi Erik,
the slurm outputs look fine. You probably need to take a look at the log files in the trajectories.
If a trajectory fails it is copied over to a newly created "failed" directory in the DRL_py_beta folder. You can find them there.

Best Regards,
Fabian

Hey all,

I did look into the failed directory and found that all openFOAM log.* files contain this warning message:

--> FOAM Warning : 
    From void* Foam::dlLibraryTable::openLibrary(const Foam::fileName&, bool)
    in file db/dynamicLibrary/dlLibraryTable/dlLibraryTable.C at line 188
    Could not load "../../../libAgentRotatingWallVelocity.so"
../../../libAgentRotatingWallVelocity.so: cannot open shared object file: No such file or directory

I also looked into the directory and it seems that there is no libAgentRotatingWallVelocity.so file. Do I have to run the make script in the "agentRotatatingWallVelocity" folder first, in order to get the simulations running?

Kind regards
Erik

Hi Erik,
check out the README for the instructions to compile the boundary condition.
Best, Andre

Hey all,

it works now, thank you! Although I had to copy the libAgentRotatingVelocity.so file into the parent directory afterwards in order to make it work. I hope this was correct, since I guessed I could also have changed the path in the contolDict for that.

Kind regards
Erik