facebookresearch/ELF

Flipped tensor dimensions in reply when running train_minirts.sh

SimpleConjugate opened this issue · 3 comments

After fixing errors in my local version of ELF related to async and device_id (async -> non_blocking and device_id -> device) I am still encountering an error in running the train_minirts.sh script.

The following error was obtained by me by simply following the following the install instructions https://github.com/facebookresearch/ELF/#install-scripts

The Main Error

sh ./train_minirts.sh --gpu 0
Warning: argument ValueMatcher/grad_clip_norm cannot be added. Skipped.
PID: 28041
========== Args ============
Loader: handicap_level=0,players="type=AI_NN,fs=50,args=backup/AI_SIMPLE|start/500|decay/0.99;type=AI_SIMPLE,fs=20",max_tick=30000,shuffle_player=False,num_frames_in_state=1,max_unit_cmd=1,seed=0,actor_only=False,model_no_spatial=False,save_replay_prefix=None,output_file=None,cmd_dumper_prefix=None,gpu=0,use_unit_action=False,disable_time_decay=False,use_prev_units=False,attach_complete_info=False,feature_type="ORIGINAL"
ContextArgs: num_games=1024,batchsize=128,game_multi=None,T=20,eval=False,wait_per_group=False,num_collectors=0,verbose_comm=False,verbose_collector=False,mcts_threads=0,mcts_rollout_per_thread=1,mcts_verbose=False,mcts_save_tree_filename="",mcts_verbose_time=False,mcts_use_prior=False,mcts_pseudo_games=0,mcts_pick_method="most_visited"
MoreLabels: additional_labels="id,last_terminal"
ActorCritic: 
PolicyGradient: entropy_ratio=0.01,grad_clip_norm=None,min_prob=1e-06,ratio_clamp=10,policy_action_nodes="pi,a"
DiscountedReward: discount=0.99
ValueMatcher: grad_clip_norm=None,value_node="V"
Sampler: sample_policy="epsilon-greedy",greedy=False,epsilon=0.0,sample_nodes="pi,a"
ModelLoader: load=None,onload=None,omit_keys=None,arch="ccpccp;-,64,64,64,-"
ModelInterface: opt_method="adam",lr=0.001,adam_eps=0.001
Trainer: freq_update=1
Evaluator: keys_in_reply="V"
Stats: trainer_stats="winrate"
ModelSaver: record_dir="./record",save_prefix="save",save_dir="./",latest_symlink="latest"
SingleProcessRun: num_minibatch=5000,num_episode=10000,tqdm=True
========== End of Args ============
Options:
Map: 20 by 20
Handicap: 0
Max tick: 30000
Max #Unit Cmd: 1
Seed: 0
Shuffled: False
[name=][fs=50][type=AI_NN][FoW=True][#frames_in_state=1][args=backup/AI_SIMPLE|start/500|decay/0.99]
[name=][fs=20][type=AI_SIMPLE][FoW=True][#frames_in_state=1]
Output_prompt_filename: ""
Cmd_dumper_prefix: ""
Save_replay_prefix: ""
ContextOptions:
#Game: 1024
#Max_thread: 0
#Collectors: 0
T: 20
Wait per group: False
Maximal #moves (0 = no constraint): 0
#Threads: 0
#Rollout per thread: 1
Verbose: False, Verbose_time: False
Use prior: False
Persistent tree: False
#Pseudo game: 0
Pick method: most_visited

Use time decay: True
Save prev seen units: False
Attach complete info: False

ORIGINAL
Version:  1f790173095cd910976d9f651b80beb872ec5d12_GIT_UNSTAGED
Num Actions:  9
Num unittype:  6
num planes:  22
#recv_thread = 4
Group 0: 
  Collector[0] Batchsize: 128 Info: [gid=0][T=1][name=""]
  Collector[1] Batchsize: 128 Info: [gid=1][T=1][name=""]
  Collector[2] Batchsize: 128 Info: [gid=2][T=1][name=""]
  Collector[3] Batchsize: 128 Info: [gid=3][T=1][name=""]
Group 1: 
  Collector[4] Batchsize: 128 Info: [gid=4][T=20][name=""]
  Collector[5] Batchsize: 128 Info: [gid=5][T=20][name=""]
  Collector[6] Batchsize: 128 Info: [gid=6][T=20][name=""]
  Collector[7] Batchsize: 128 Info: [gid=7][T=20][name=""]

  0%|                    | 0/5000 [00:00<?, ?it/s]./rts/game_MC/model.py:63: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  policy = self.softmax(self.linear_policy(h))
  0%|                    | 0/5000 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 34, in <module>
    runner.run()
  File "/workspace/gamebreaker/build/ELF/rlpytorch/runner/single_process.py", line 56, in run
    self.GC.Run()
  File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 378, in Run
    res = self._call(self.infos)
  File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 364, in _call
    sel_reply.copy_from(reply, batch_key=batch_key)
  File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 155, in copy_from
    bk[:] = v
RuntimeError: The expanded size of the tensor (1) must match the existing size (128) at non-singleton dimension 0.  Target sizes: [1, 128].  Tensor sizes: [128, 1]
Prepare to stop ...
^C

Configuration

Here is the description of my conda environment:

conda list
# packages in environment at $HOME/miniconda3/envs/elf:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_pytorch_select           0.2                       gpu_0  
blas                      1.0                         mkl  
ca-certificates           2020.1.1                      0  
certifi                   2020.4.5.1               py38_0  
cffi                      1.14.0           py38he30daa8_1  
cudatoolkit               10.1.243             h6bb024c_0  
cudnn                     7.6.5                cuda10.1_0  
intel-openmp              2020.1                      217  
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.3                  he6710b0_1  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
mkl                       2020.1                      217  
mkl-service               2.3.0            py38he904b0f_0  
mkl_fft                   1.0.15           py38ha843d7b_0  
mkl_random                1.1.0            py38h962f231_0  
msgpack                   1.0.0                    pypi_0    pypi
msgpack-numpy             0.4.5                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1  
ninja                     1.9.0            py38hfd86e86_0  
numpy                     1.18.1           py38h4f9e942_0  
numpy-base                1.18.1           py38hde5b4d6_1  
openssl                   1.1.1g               h7b6447c_0  
pip                       20.1                     pypi_0    pypi
pycparser                 2.20                       py_0  
python                    3.8.2               hcff3b4d_14  
pytorch                   1.4.0           cuda101py38h02f0884_0  
readline                  8.0                  h7b6447c_0  
setuptools                46.2.0                   py38_0  
six                       1.14.0                   py38_0  
sqlite                    3.31.1               h62c20be_1  
tk                        8.6.8                hbc83047_0  
tqdm                      4.46.0                     py_0  
wheel                     0.34.2                   py38_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3

My CUDA information is as follows:

| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |

Hey friend, have you solved the problem?

Simply bk = v solves this problem.

I've got all fixes for ELF to run smoothly (sh ./train_minirts.sh --gpu 0) in Python 3.7 & PyTorch 1.0+ here.

Thank you for your answer~