Score is always 0.0, and it takes so long to prepare the dataset
YJHMITWEB opened this issue · 1 comments
YJHMITWEB commented
Hi, I am following the instructions to run the synthetic benchmark.
I use the LLaMA-2-chat-hf model, and I specify the path in run.sh
GPUS="1" # GPU size for tensor_parallel.
ROOT_DIR="RULER/results" # the path that stores generated task samples and model predictions.
MODEL_DIR="RULER/models" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
However, I found that the prepare.py
takes so long to run
for TASK in "${TASKS[@]}"; do
echo "Start prepare..."
python data/prepare.py \
--save_dir ${DATA_DIR} \
--benchmark ${BENCHMARK} \
--task ${TASK} \
--tokenizer_path ${TOKENIZER_PATH} \
--tokenizer_type ${TOKENIZER_TYPE} \
--max_seq_length ${MAX_SEQ_LENGTH} \
--model_template_type ${MODEL_TEMPLATE_TYPE} \
--num_samples ${NUM_SAMPLES} \
${REMOVE_NEWLINE_TAB}
echo "Start call api..."
python pred/call_api.py \
--data_dir ${DATA_DIR} \
--save_dir ${PRED_DIR} \
--benchmark ${BENCHMARK} \
--task ${TASK} \
--server_type ${MODEL_FRAMEWORK} \
--model_name_or_path ${MODEL_PATH} \
--temperature ${TEMPERATURE} \
--top_k ${TOP_K} \
--top_p ${TOP_P} \
${STOP_WORDS}
done
And the evaluation score is always 0.0, for example, I use the hotpotQA
benchmark, and it keeps outputting:
Prepare qa_2 with lines: 500 to RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
Used time: 5.8 minutes
Start call api...
Predict qa_2
from RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
to RULER/results/llama2-7b-chat/synthetic/131072/pred/qa_2.jsonl
0it [00:00, ?it/s]
Used time: 0.0 minutes
Start evaluate...
Total tasks: ['qa_2']
Evaluate task qa_2...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 517943.20it/s]
=============================================
0 1
0 Tasks qa_2
1 Score 0.0
2 Nulls 500/500
Saved eval results to RULER/results/llama2-7b-chat/synthetic/131072/pred/summary-qa_2.csv
Saved submission results to RULER/results/llama2-7b-chat/synthetic/131072/pred/submission.csv
I'd appreciate any help on this.
YJHMITWEB commented
Found out why, llama2 only support 4K length.