Training problem about local_rank
Closed this issue · 0 comments
wenyuzzz commented
Thank you for your excellent work.
I encountered some problems while running the code. Could you help to answer them? Here are the training parameters.
import os
root_data_dir = '../../'
dataset = 'dataset/HM'
behaviors = 'hm_50w_users.tsv'
images = 'hm_50w_items.tsv'
lmdb_data = 'hm_50w_items.lmdb'
logging_num = 2
testing_num = 1
CV_resize = 224
CV_model_load = 'swin_tiny'
freeze_paras_before = 0
mode = 'train'
item_tower = 'modal'
epoch = 150
load_ckpt_name = 'None'
l2_weight_list = [0.01]
drop_rate_list = [0.1]
batch_size_list = [16]
lr_list_ct = [(1e-4, 1e-4), (5e-5, 5e-5), (1e-4, 5e-5)]
embedding_dim_list = [512]
for l2_weight in l2_weight_list:
for batch_size in batch_size_list:
for drop_rate in drop_rate_list:
for embedding_dim in embedding_dim_list:
for lr_ct in lr_list_ct:
lr = lr_ct[0]
fine_tune_lr = lr_ct[1]
label_screen = '{}_bs{}_ed{}_lr{}_dp{}_L2{}_Flr{}'.format(
item_tower, batch_size, embedding_dim, lr,
drop_rate, l2_weight, fine_tune_lr)
run_py = "CUDA_VISIBLE_DEVICES='2,3' \
/home/zwy/anaconda3/envs/m/bin/python -m torch.distributed.launch --nproc_per_node 2 --master_port 1289\
run.py --root_data_dir {} --dataset {} --behaviors {} --images {} --lmdb_data {}\
--mode {} --item_tower {} --load_ckpt_name {} --label_screen {} --logging_num {} --testing_num {}\
--l2_weight {} --drop_rate {} --batch_size {} --lr {} --embedding_dim {}\
--CV_resize {} --CV_model_load {} --epoch {} --freeze_paras_before {} --fine_tune_lr {}".format(
root_data_dir, dataset, behaviors, images, lmdb_data,
mode, item_tower, load_ckpt_name, label_screen, logging_num, testing_num,
l2_weight, drop_rate, batch_size, lr, embedding_dim,
CV_resize, CV_model_load, epoch, freeze_paras_before, fine_tune_lr)
os.system(run_py)
Here is the error that occurred.
/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING]
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
[--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
[--behaviors BEHAVIORS] [--images IMAGES]
[--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
[--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
[--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
[--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
[--l2_weight L2_WEIGHT]
[--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
[--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
[--freeze_paras_before FREEZE_PARAS_BEFORE]
[--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
[--num_attention_heads NUM_ATTENTION_HEADS]
[--transformer_block TRANSFORMER_BLOCK]
[--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
[--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
[--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
[--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=0
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
[--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
[--behaviors BEHAVIORS] [--images IMAGES]
[--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
[--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
[--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
[--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
[--l2_weight L2_WEIGHT]
[--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
[--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
[--freeze_paras_before FREEZE_PARAS_BEFORE]
[--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
[--num_attention_heads NUM_ATTENTION_HEADS]
[--transformer_block TRANSFORMER_BLOCK]
[--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
[--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
[--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
[--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=1
[2023-10-14 21:32:30,604] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 3708157) of binary: /home/zwy/anaconda3/envs/m/bin/python
Traceback (most recent call last):
File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-10-14_21:32:30
host : gpuserver
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 3708158)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-14_21:32:30
host : gpuserver
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 3708157)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Looking forward to your reply, thank you.