can not run `test_rap_llama3.sh`?
Opened this issue · 0 comments
david101-hunter commented
I downloaded llama model follow here
huggingface-cli download meta-llama/Llama-3.2-1B --include "original/*" --local-dir Llama-3.2-1B
test_rap_llama3.sh 's content
export CUDA_VISIBLE_DEVICES=0
export llama_path="/media/manhdt4/sda1/llm-reasoners/test/Llama-3.2-1B/original"
export llama_size="1B"
python -m torch.distributed.run --nproc_per_node 1 examples/RAP/blocksworld/rap_inference.py --llama_path $llama_path --llama_size $llama_size --data_path 'examples/CoT/blocksworld/data/split_v1/split_v1_step_2_data.json' --depth_limit 2 --batch_size 1 --output_trace_in_each_iter --prompt_path examples/CoT/blocksworld/prompts/pool_prompt_v1.json --log_dir logs/v1_step2
I have changed llama_path with some value like below
- ../Llama-3.2-1B/original
- ../Llama-3.2-1B/
- ../Llama-3.2-1B/origina/consolidated.00.pth
Or change folder name fromLlama-3.2-1B
toLlama-3-1B
and tried these case above
with all of cases, I can not run test_rap_llama3.sh
detail of log with ../Llama-3.2-1B/origina/consolidated.00.pth
(llm-reasoneers) [ai_agent@gpu-dmp-10254137153 llm-reasoners]$ ./examples/RAP/blocksworld/test_rap_llama3.sh
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
[rank0]: Traceback (most recent call last):
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 227, in <module>
[rank0]: fire.Fire(llama2_main) # user will need to switch the model in the code
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 191, in llama2_main
[rank0]: llama_model = Llama2Model(llama_path, llama_size, max_batch_size=1)
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 79, in __init__
[rank0]: self.model, self.tokenizer = self.build(os.path.join(path, f"llama-2-{size.lower()}"), os.path.join(path, "tokenizer.model"),
[rank0]: File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 52, in build
[rank0]: assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
[rank0]: AssertionError: no checkpoint files found in /u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
[rank0]:[W1127 14:33:12.053919630 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any
pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been a
dded since PyTorch 2.4 (function operator())
E1127 14:33:12.978000 52391 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 52394) of binary: /u01/vtpay/miniconda3/envs/llm-reasoneers/bin/python
Traceback (most recent call last):
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
main()
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/RAP/blocksworld/rap_inference.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-27_14:33:12
host : gpu-dmp-10254137153
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52394)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Another question?
instance-41.pddl
where can I download this file?
............
[rank0]: stream = FileStream(filename, encoding='utf-8')
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 20, in __init__
[rank0]: super().__init__(self.readDataFrom(fileName, encoding, errors))
[rank0]: File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 25, in readDataFrom
[rank0]: with open(fileName, 'rb') as file:
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'LLMs-Planning/llm_planning_analysis/instances/blocksworld/generated_basic/instance-41.pddl'
[rank0]:[W1127 16:53:51.972999855 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application shoul
d call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member
of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
.......................