Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination
- Install from Pypi
pip install -r requirements.txt
pip install opea-eval
notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.
- Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .
For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC
, HellaSwag
, MMLU
, TruthfulQA
, Winogrande
, GSM8K
and so on.
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model gaudi-hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device hpu \
--batch_size 8
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cpu \
--batch_size 8
from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate
args = LMevalParser(
model="hf",
user_model=user_model,
tokenizer=tokenizer,
tasks="hellaswag",
device="cpu",
batch_size=8,
)
results = evaluate(args)
- setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
- evaluate the model
- set
base_url
,tokenizer
and--model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.
cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
--model "codeparrot/codeparrot-small" \
--tasks "humaneval" \
--n_samples 100 \
--batch_size 10 \
--allow_code_execution
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
args = BigcodeEvalParser(
user_model=user_model,
tokenizer=tokenizer,
tasks="humaneval",
n_samples=100,
batch_size=10,
allow_code_execution=True,
)
results = evaluate(args)