facebookresearch/ELF

Failed to train MiniRTS model

bching28 opened this issue · 8 comments

Hi,

I was following your tutorial on installing the necessary scripts. I then proceeded to train a model in MiniRTS. I am confused as to why my training is always "killed" around the 142/5000 mark. I tried training the model multiple times. I attached a screenshot:
screenshot

In the /ELF folder, I run the /train_minirts.sh script with the following command: sh ./train_minirts.sh. Note that I do omit --gpu 0 just because I don't have a GPU driver installed on my VM. For my VM, I have allocated 4 CPU cores. I don't know if that is an insufficient amount and if that may be the cause of the problem.

Also, on a side note, omitting the gpu argument should not cause any problems in terms of running the scripts, correct?

Thank you for your help.

  • Bryan

Your script should be correct.
Can you try running with 8 cores? And make sure it may need 1G+ memory

@qucheng, I changed the number of cores to 8 and I had 4GB of memory allocated to the VM, but I am still getting that same issue as above at 143/5000. I noticed this warning after running the script. I don't know if it might tell you something.
screenshot

Are there any other data that might be useful for debugging this issue?

@bching28 ok..since you are using CPU only to train the model, it might be the case that the system find 100% cpu usage and kill the program. Maybe you can restrict your cpu usage on the VM (e.g.,
prefixed with taskset -c 0-3.

@yuandong-tian so I ran the command taskset -c 0-3 sh ./train_minirts.sh and it ran as usual. I don't see the process "killed" anymore, but I get a segmentation fault now. It still happens around 143/5000.
screenshot

I monitored the CPU usage at the same time. Looks like it was maxing out. But it was pretty much maxing out the entire time before before it reached 143/5000. I notice that it is complaining about memory allocation. I did allocate 4GB of RAM already.

I have the same problem,have u solved it?

@Tangent-Wei @yuandong-tian Not yet. So I tried bumping up the amount of memory my VM has to 8GB. However, around the 144th iteration of running the game, it crashes and gives the segmentation fault again. I was monitoring the CPU and memory usage this time. I noticed that, yes, the CPU percentage is maxing out at the 144th iteration. I also notice that the memory percentage slowly spikes up and crashes the program at around 90% memory usage.

Why exactly is this a problem around the 140+ iteration? This may be just a lack of understanding of what's happening behind the scenes on my part, but what is exactly is going on behind the scenes within these 5000 iterations?

I have tried downloading the Nvidia driver necessary for my GPU (Geforce 960), but when I restart my VM, I get stuck in an infinite login loop. The only way to fix it is to remove the Nvidia driver.

@bching28 I have solved this problem. I used the the command ' free -h' to watch out the usage of memory, then the memory ran out the around the 143 iteration. I turned down the ' --batchsize' in train_minirts.sh and the 'num_minibatch' in 'rlpytorch/runner/single_process.py'(line 16).It works now.
@yuandong-tian If yours code consider about the low condition of hardware, I would appreciate it very much.

@bching28 ok this is a bit weird. Thanks for reporting this.