Flipped tensor dimensions in reply when running train_minirts.sh
SimpleConjugate opened this issue · 3 comments
SimpleConjugate commented
After fixing errors in my local version of ELF related to async
and device_id
(async -> non_blocking and device_id -> device) I am still encountering an error in running the train_minirts.sh
script.
The following error was obtained by me by simply following the following the install instructions https://github.com/facebookresearch/ELF/#install-scripts
The Main Error
sh ./train_minirts.sh --gpu 0
Warning: argument ValueMatcher/grad_clip_norm cannot be added. Skipped.
PID: 28041
========== Args ============
Loader: handicap_level=0,players="type=AI_NN,fs=50,args=backup/AI_SIMPLE|start/500|decay/0.99;type=AI_SIMPLE,fs=20",max_tick=30000,shuffle_player=False,num_frames_in_state=1,max_unit_cmd=1,seed=0,actor_only=False,model_no_spatial=False,save_replay_prefix=None,output_file=None,cmd_dumper_prefix=None,gpu=0,use_unit_action=False,disable_time_decay=False,use_prev_units=False,attach_complete_info=False,feature_type="ORIGINAL"
ContextArgs: num_games=1024,batchsize=128,game_multi=None,T=20,eval=False,wait_per_group=False,num_collectors=0,verbose_comm=False,verbose_collector=False,mcts_threads=0,mcts_rollout_per_thread=1,mcts_verbose=False,mcts_save_tree_filename="",mcts_verbose_time=False,mcts_use_prior=False,mcts_pseudo_games=0,mcts_pick_method="most_visited"
MoreLabels: additional_labels="id,last_terminal"
ActorCritic:
PolicyGradient: entropy_ratio=0.01,grad_clip_norm=None,min_prob=1e-06,ratio_clamp=10,policy_action_nodes="pi,a"
DiscountedReward: discount=0.99
ValueMatcher: grad_clip_norm=None,value_node="V"
Sampler: sample_policy="epsilon-greedy",greedy=False,epsilon=0.0,sample_nodes="pi,a"
ModelLoader: load=None,onload=None,omit_keys=None,arch="ccpccp;-,64,64,64,-"
ModelInterface: opt_method="adam",lr=0.001,adam_eps=0.001
Trainer: freq_update=1
Evaluator: keys_in_reply="V"
Stats: trainer_stats="winrate"
ModelSaver: record_dir="./record",save_prefix="save",save_dir="./",latest_symlink="latest"
SingleProcessRun: num_minibatch=5000,num_episode=10000,tqdm=True
========== End of Args ============
Options:
Map: 20 by 20
Handicap: 0
Max tick: 30000
Max #Unit Cmd: 1
Seed: 0
Shuffled: False
[name=][fs=50][type=AI_NN][FoW=True][#frames_in_state=1][args=backup/AI_SIMPLE|start/500|decay/0.99]
[name=][fs=20][type=AI_SIMPLE][FoW=True][#frames_in_state=1]
Output_prompt_filename: ""
Cmd_dumper_prefix: ""
Save_replay_prefix: ""
ContextOptions:
#Game: 1024
#Max_thread: 0
#Collectors: 0
T: 20
Wait per group: False
Maximal #moves (0 = no constraint): 0
#Threads: 0
#Rollout per thread: 1
Verbose: False, Verbose_time: False
Use prior: False
Persistent tree: False
#Pseudo game: 0
Pick method: most_visited
Use time decay: True
Save prev seen units: False
Attach complete info: False
ORIGINAL
Version: 1f790173095cd910976d9f651b80beb872ec5d12_GIT_UNSTAGED
Num Actions: 9
Num unittype: 6
num planes: 22
#recv_thread = 4
Group 0:
Collector[0] Batchsize: 128 Info: [gid=0][T=1][name=""]
Collector[1] Batchsize: 128 Info: [gid=1][T=1][name=""]
Collector[2] Batchsize: 128 Info: [gid=2][T=1][name=""]
Collector[3] Batchsize: 128 Info: [gid=3][T=1][name=""]
Group 1:
Collector[4] Batchsize: 128 Info: [gid=4][T=20][name=""]
Collector[5] Batchsize: 128 Info: [gid=5][T=20][name=""]
Collector[6] Batchsize: 128 Info: [gid=6][T=20][name=""]
Collector[7] Batchsize: 128 Info: [gid=7][T=20][name=""]
0%| | 0/5000 [00:00<?, ?it/s]./rts/game_MC/model.py:63: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
policy = self.softmax(self.linear_policy(h))
0%| | 0/5000 [00:01<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 34, in <module>
runner.run()
File "/workspace/gamebreaker/build/ELF/rlpytorch/runner/single_process.py", line 56, in run
self.GC.Run()
File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 378, in Run
res = self._call(self.infos)
File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 364, in _call
sel_reply.copy_from(reply, batch_key=batch_key)
File "/workspace/gamebreaker/build/ELF/elf/utils_elf.py", line 155, in copy_from
bk[:] = v
RuntimeError: The expanded size of the tensor (1) must match the existing size (128) at non-singleton dimension 0. Target sizes: [1, 128]. Tensor sizes: [128, 1]
Prepare to stop ...
^C
Configuration
Here is the description of my conda environment:
conda list
# packages in environment at $HOME/miniconda3/envs/elf:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_pytorch_select 0.2 gpu_0
blas 1.0 mkl
ca-certificates 2020.1.1 0
certifi 2020.4.5.1 py38_0
cffi 1.14.0 py38he30daa8_1
cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
intel-openmp 2020.1 217
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20181209 hc058e9b_0
libffi 3.3 he6710b0_1
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
mkl 2020.1 217
mkl-service 2.3.0 py38he904b0f_0
mkl_fft 1.0.15 py38ha843d7b_0
mkl_random 1.1.0 py38h962f231_0
msgpack 1.0.0 pypi_0 pypi
msgpack-numpy 0.4.5 pypi_0 pypi
ncurses 6.2 he6710b0_1
ninja 1.9.0 py38hfd86e86_0
numpy 1.18.1 py38h4f9e942_0
numpy-base 1.18.1 py38hde5b4d6_1
openssl 1.1.1g h7b6447c_0
pip 20.1 pypi_0 pypi
pycparser 2.20 py_0
python 3.8.2 hcff3b4d_14
pytorch 1.4.0 cuda101py38h02f0884_0
readline 8.0 h7b6447c_0
setuptools 46.2.0 py38_0
six 1.14.0 py38_0
sqlite 3.31.1 h62c20be_1
tk 8.6.8 hbc83047_0
tqdm 4.46.0 py_0
wheel 0.34.2 py38_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
My CUDA information is as follows:
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
yh-gong commented
Hey friend, have you solved the problem?
HaoshengZou commented
Simply bk = v
solves this problem.
I've got all fixes for ELF to run smoothly (sh ./train_minirts.sh --gpu 0
) in Python 3.7 & PyTorch 1.0+ here.
leonda commented
Thank you for your answer~