can not run `test_rap_llama3.sh`?

Question

can not run `test_rap_llama3.sh`?

Opened this issue a month ago · 0 comments

I downloaded llama model follow here

huggingface-cli download meta-llama/Llama-3.2-1B --include "original/*" --local-dir Llama-3.2-1B

result:

test_rap_llama3.sh 's content

export CUDA_VISIBLE_DEVICES=0
export llama_path="/media/manhdt4/sda1/llm-reasoners/test/Llama-3.2-1B/original"
export llama_size="1B"
python -m torch.distributed.run --nproc_per_node 1 examples/RAP/blocksworld/rap_inference.py --llama_path $llama_path --llama_size $llama_size --data_path 'examples/CoT/blocksworld/data/split_v1/split_v1_step_2_data.json' --depth_limit 2  --batch_size 1 --output_trace_in_each_iter --prompt_path examples/CoT/blocksworld/prompts/pool_prompt_v1.json --log_dir logs/v1_step2

I have changed llama_path with some value like below

../Llama-3.2-1B/original
../Llama-3.2-1B/
../Llama-3.2-1B/origina/consolidated.00.pth
Or change folder name from Llama-3.2-1B to Llama-3-1B and tried these case above

with all of cases, I can not run test_rap_llama3.sh

detail of log with ../Llama-3.2-1B/origina/consolidated.00.pth

(llm-reasoneers) [ai_agent@gpu-dmp-10254137153 llm-reasoners]$ ./examples/RAP/blocksworld/test_rap_llama3.sh
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
/u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/tokenizer.model
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
[rank0]: Traceback (most recent call last):
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 227, in <module>
[rank0]:     fire.Fire(llama2_main) # user will need to switch the model in the code
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/examples/RAP/blocksworld/rap_inference.py", line 191, in llama2_main
[rank0]:     llama_model = Llama2Model(llama_path, llama_size, max_batch_size=1)
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 79, in __init__
[rank0]:     self.model, self.tokenizer = self.build(os.path.join(path, f"llama-2-{size.lower()}"), os.path.join(path, "tokenizer.model"),
[rank0]:   File "/u01/vtpay/manhdt4/llm-reasoners/reasoners/lm/llama_2_model.py", line 52, in build
[rank0]:     assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
[rank0]: AssertionError: no checkpoint files found in /u01/vtpay/manhdt4/llm-reasoners/test/Llama-3-1B/consolidated.00.pth/llama-2-1b
[rank0]:[W1127 14:33:12.053919630 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any
pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been a
dded since PyTorch 2.4 (function operator())
E1127 14:33:12.978000 52391 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 52394) of binary: /u01/vtpay/miniconda3/envs/llm-reasoneers/bin/python
Traceback (most recent call last):
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/RAP/blocksworld/rap_inference.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-27_14:33:12
  host      : gpu-dmp-10254137153
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 52394)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Another question?

instance-41.pddl where can I download this file?

............
[rank0]:     stream = FileStream(filename, encoding='utf-8')
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 20, in __init__
[rank0]:     super().__init__(self.readDataFrom(fileName, encoding, errors))
[rank0]:   File "/u01/vtpay/miniconda3/envs/llm-reasoneers/lib/python3.10/site-packages/antlr4/FileStream.py", line 25, in readDataFrom
[rank0]:     with open(fileName, 'rb') as file:
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'LLMs-Planning/llm_planning_analysis/instances/blocksworld/generated_basic/instance-41.pddl'
[rank0]:[W1127 16:53:51.972999855 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application shoul
d call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member
of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
.......................