DeNA/HandyRL

train stop at 690epoch

yoyoyo-yo opened this issue · 6 comments

Thank you for nice library!

I train this, but training stop at 690 epoch.
I don't know why this happen.

When then, I kill that job.
Next, I changed train_args:restart_epoch in config.yaml like below in order to continue training.

train_args:
restart_epoch: 690

Is this right method ?

Thanks

Thank you for your trial!

Yes, you are right about restart_epoch.
Perhaps when you restart the training, minimum_episode parameter should be increased a bit for stable training (i.e. the variation of the episodes in the buffer increases when restarting the model update) But this depends on the game though...

I train this, but training stop at 690 epoch.
I don't know why this happen.

Do you have any logs when error occurred?

Thanks!

@ikki407
Thank you!

No error occured , but the training process stopped.
I tried Kaggle hungry geese environment.
Below is config.yaml

env_args:
    #env: 'TicTacToe'
    #source: 'handyrl.envs.tictactoe'
    #env: 'Geister'
    #source: 'handyrl.envs.geister'
    env: 'HungryGeese'
    source: 'handyrl.envs.kaggle.hungry_geese'


train_args:
    turn_based_training: False
    observation: False
    gamma: 0.8
    forward_steps: 16
    compress_steps: 4
    entropy_regularization: 1.0e-1
    entropy_regularization_decay: 0.1
    update_episodes: 200
    batch_size: 256
    minimum_episodes: 400
    maximum_episodes: 100000
    num_batchers: 2
    eval_rate: 0.1
    worker:
        num_parallel: 6
    lambda: 0.7
    policy_target: 'TD' # 'UGPO' 'VTRACE' 'TD' 'MC'
    value_target: 'TD' # 'VTRACE' 'TD' 'MC'
    seed: 0
    restart_epoch: 0


worker_args:
    server_address: ''
    num_parallel: 8

Thanks. Your config looks right.

I think this may be a connection problem between learner and workers. In practice, the restart of the training is good enough, so could you use the way unless the process stops frequently?

Tips: if you have a time, please try server mode, that is —train-server and —worker. In this way, you can connect to the server from client again after the process stops.

Thanks!

qent commented

The same issue, 3 times start and 3 times stopped saving models after 693epoch, but actually, GPU used even after several days, until I kill the process

Hi, @yoyoyo-yo and @qent.
We have updated model sending procedure and it will avoid PyTorch shared-memory error.
Please try again!

Close this PR because #149 is merged into master.