hsiehjackson/RULER

Score is always 0.0, and it takes so long to prepare the dataset

YJHMITWEB opened this issue · 1 comments

Hi, I am following the instructions to run the synthetic benchmark.

I use the LLaMA-2-chat-hf model, and I specify the path in run.sh

GPUS="1" # GPU size for tensor_parallel.
ROOT_DIR="RULER/results" # the path that stores generated task samples and model predictions. 
MODEL_DIR="RULER/models" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.

However, I found that the prepare.py takes so long to run

for TASK in "${TASKS[@]}"; do
        echo "Start prepare..."
        python data/prepare.py \
            --save_dir ${DATA_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --tokenizer_path ${TOKENIZER_PATH} \
            --tokenizer_type ${TOKENIZER_TYPE} \
            --max_seq_length ${MAX_SEQ_LENGTH} \
            --model_template_type ${MODEL_TEMPLATE_TYPE} \
            --num_samples ${NUM_SAMPLES} \
            ${REMOVE_NEWLINE_TAB}
        echo "Start call api..."
        python pred/call_api.py \
            --data_dir ${DATA_DIR} \
            --save_dir ${PRED_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --server_type ${MODEL_FRAMEWORK} \
            --model_name_or_path ${MODEL_PATH} \
            --temperature ${TEMPERATURE} \
            --top_k ${TOP_K} \
            --top_p ${TOP_P} \
            ${STOP_WORDS}
    done

And the evaluation score is always 0.0, for example, I use the hotpotQA benchmark, and it keeps outputting:

Prepare qa_2 with lines: 500 to RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
Used time: 5.8 minutes
Start call api...
Predict qa_2 
from RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
to RULER/results/llama2-7b-chat/synthetic/131072/pred/qa_2.jsonl
0it [00:00, ?it/s]
Used time: 0.0 minutes
Start evaluate...
Total tasks: ['qa_2']
Evaluate task qa_2...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 517943.20it/s]

=============================================

       0        1
0  Tasks     qa_2
1  Score      0.0
2  Nulls  500/500

Saved eval results to RULER/results/llama2-7b-chat/synthetic/131072/pred/summary-qa_2.csv

Saved submission results to RULER/results/llama2-7b-chat/synthetic/131072/pred/submission.csv

I'd appreciate any help on this.

Found out why, llama2 only support 4K length.