Score is always 0.0, and it takes so long to prepare the dataset

Question

Score is always 0.0, and it takes so long to prepare the dataset

YJHMITWEB opened this issue 6 months ago · 1 comments

Hi, I am following the instructions to run the synthetic benchmark.

I use the LLaMA-2-chat-hf model, and I specify the path in run.sh

GPUS="1" # GPU size for tensor_parallel.
ROOT_DIR="RULER/results" # the path that stores generated task samples and model predictions. 
MODEL_DIR="RULER/models" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.

However, I found that the prepare.py takes so long to run

for TASK in "${TASKS[@]}"; do
        echo "Start prepare..."
        python data/prepare.py \
            --save_dir ${DATA_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --tokenizer_path ${TOKENIZER_PATH} \
            --tokenizer_type ${TOKENIZER_TYPE} \
            --max_seq_length ${MAX_SEQ_LENGTH} \
            --model_template_type ${MODEL_TEMPLATE_TYPE} \
            --num_samples ${NUM_SAMPLES} \
            ${REMOVE_NEWLINE_TAB}
        echo "Start call api..."
        python pred/call_api.py \
            --data_dir ${DATA_DIR} \
            --save_dir ${PRED_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --server_type ${MODEL_FRAMEWORK} \
            --model_name_or_path ${MODEL_PATH} \
            --temperature ${TEMPERATURE} \
            --top_k ${TOP_K} \
            --top_p ${TOP_P} \
            ${STOP_WORDS}
    done

And the evaluation score is always 0.0, for example, I use the hotpotQA benchmark, and it keeps outputting:

Prepare qa_2 with lines: 500 to RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
Used time: 5.8 minutes
Start call api...
Predict qa_2 
from RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
to RULER/results/llama2-7b-chat/synthetic/131072/pred/qa_2.jsonl
0it [00:00, ?it/s]
Used time: 0.0 minutes
Start evaluate...
Total tasks: ['qa_2']
Evaluate task qa_2...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 517943.20it/s]

=============================================

       0        1
0  Tasks     qa_2
1  Score      0.0
2  Nulls  500/500

Saved eval results to RULER/results/llama2-7b-chat/synthetic/131072/pred/summary-qa_2.csv

Saved submission results to RULER/results/llama2-7b-chat/synthetic/131072/pred/submission.csv

I'd appreciate any help on this.

Answer 1 · 2024-04-27T14:20:25.000Z

Found out why, llama2 only support 4K length.