microsoft/ProphetNet

Unable to run Genie_Finetune.py

Opened this issue · 1 comments

i run the following bash

OUT_DIR = "/Your/output/path"
DATA_PATH = "/Your/data/path"
DATA_NAME = "xsum_data"
PRETRAIN_CKPT_PATH = "/Your/pretrain_ckpt/path"


python -u -m torch.distributed.launch --nproc_per_node=4 --master_port=9421 \
./GENIE_main/Genie_Finetune.py \
--checkpoint_path=$OUT_DIR \
--model_channels 128 --in_channel 128 --out_channel 128 --vocab_size 30522 \
--config_name="bert-base-uncased" --token_emb_type="random" --model_arch="s2s_CAT" \
--diffusion_steps 2000 --predict_xstart --noise_schedule="sqrt" --training_mode="s2s" \
--schedule_sampler="loss-second-moment" --tgt_max_len 64 --src_max_len 512 --data_name=$DATA_NAME \
--data_path=$DATA_PATH \
--lr_anneal_steps 120000 --batch_size 64 --lr 5e-05 --warmup_steps 7200 --train_type="S2S_Diffusion" \
--eval_interval 200 --log_interval 200 --save_interva 20000 \
--pretrain_model_path=$PRETRAIN_CKPT_PATH

Then the system prompted me that I don't have mpi4py, and then I installed mpi4py

Collecting mpi4py
  Using cached http://mirrors.aliyun.com/pypi/packages/2e/1a/1393e69df9cf7b04143a51776727dd048586781bca82543594ab439e2eb4/mpi4py-3.1.5.tar.gz (2.5 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... done
  Created wheel for mpi4py: filename=mpi4py-3.1.5-cp38-cp38-linux_x86_64.whl size=6024408 sha256=64ef1c54d03ecb2c862c4e57da02d6dd8d9e33673ad3948afafca08d60edfd64
  Stored in directory: /root/.cache/pip/wheels/9d/2a/7e/c6575a1d595c7d8cce796177f1b9827975c5b48b31e28f25b9
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-3.1.5

When I try to run the program again, the program has no output and seems to be stuck somewhere in the program.

/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(

Can anyone help solve this problem? I don't know if mpi causes this problem. My operating system is ubuntu

I have the same problem.. any updates on this issue?