Failed to train MiniRTS model

Question

Failed to train MiniRTS model

bching28 opened this issue 7 years ago · 8 comments

Hi,

I was following your tutorial on installing the necessary scripts. I then proceeded to train a model in MiniRTS. I am confused as to why my training is always "killed" around the 142/5000 mark. I tried training the model multiple times. I attached a screenshot:

In the /ELF folder, I run the /train_minirts.sh script with the following command: sh ./train_minirts.sh. Note that I do omit --gpu 0 just because I don't have a GPU driver installed on my VM. For my VM, I have allocated 4 CPU cores. I don't know if that is an insufficient amount and if that may be the cause of the problem.

Also, on a side note, omitting the gpu argument should not cause any problems in terms of running the scripts, correct?

Thank you for your help.

Bryan

Answer 1 · 2018-01-22T21:29:20.000Z

Your script should be correct.
Can you try running with 8 cores? And make sure it may need 1G+ memory

Answer 2 · 2018-01-22T22:54:25.000Z

@qucheng, I changed the number of cores to 8 and I had 4GB of memory allocated to the VM, but I am still getting that same issue as above at 143/5000. I noticed this warning after running the script. I don't know if it might tell you something.

Are there any other data that might be useful for debugging this issue?

Answer 3 · 2018-01-23T02:04:02.000Z

@bching28 ok..since you are using CPU only to train the model, it might be the case that the system find 100% cpu usage and kill the program. Maybe you can restrict your cpu usage on the VM (e.g.,
prefixed with taskset -c 0-3.

Answer 4 · 2018-01-23T09:04:29.000Z

@yuandong-tian so I ran the command taskset -c 0-3 sh ./train_minirts.sh and it ran as usual. I don't see the process "killed" anymore, but I get a segmentation fault now. It still happens around 143/5000.

I monitored the CPU usage at the same time. Looks like it was maxing out. But it was pretty much maxing out the entire time before before it reached 143/5000. I notice that it is complaining about memory allocation. I did allocate 4GB of RAM already.

Answer 5 · 2018-01-24T16:13:51.000Z

I have the same problem,have u solved it?

Answer 6 · 2018-01-25T05:38:17.000Z

@Tangent-Wei @yuandong-tian Not yet. So I tried bumping up the amount of memory my VM has to 8GB. However, around the 144th iteration of running the game, it crashes and gives the segmentation fault again. I was monitoring the CPU and memory usage this time. I noticed that, yes, the CPU percentage is maxing out at the 144th iteration. I also notice that the memory percentage slowly spikes up and crashes the program at around 90% memory usage.

Why exactly is this a problem around the 140+ iteration? This may be just a lack of understanding of what's happening behind the scenes on my part, but what is exactly is going on behind the scenes within these 5000 iterations?

I have tried downloading the Nvidia driver necessary for my GPU (Geforce 960), but when I restart my VM, I get stuck in an infinite login loop. The only way to fix it is to remove the Nvidia driver.

Answer 7 · 2018-01-25T07:40:52.000Z

@bching28 I have solved this problem. I used the the command ' free -h' to watch out the usage of memory, then the memory ran out the around the 143 iteration. I turned down the ' --batchsize' in train_minirts.sh and the 'num_minibatch' in 'rlpytorch/runner/single_process.py'(line 16).It works now.
@yuandong-tian If yours code consider about the low condition of hardware, I would appreciate it very much.

Answer 8 · 2018-01-26T19:25:19.000Z

@bching28 ok this is a bit weird. Thanks for reporting this.