A Custom LLM agent to evaluate a test model on the MMLU dataset

We build a basic agent with meta/meta-llama-3-8b-instruct as the seed model which is responsible for conducting various steps of the experiment. This model operates within the scaffold and together form the evaluation agent. We tested 3 randomly picked models from the Replicate repo using 45 randomly sampled questions from a subset of the MMLU dataset and reported their accuracy.

Results

Model id	Total questions	Correct	Wrong	Accuracy
meta/meta-llama-3-8b-instruct	45	44	1	97.77%
meta/llama-2-7b-chat	45	38	7	84.44%
mistralai/mistral-7b-instruct-v0.2	45	41	4	91.11%

Report and Logs

Report can be accessed here and logs can be accessed here.

Notes

The agent is a very basic scaffold around LLama3 and delegates most of the decision making work to the LLM. This creates a more robust evaluation model.
The MMLU subset is a very small dataset chosen keeping the scope and time in mind for the project.
The model scaffold can be redesigned to include more fine-grained control over the agent's behaviour.

LICENSE

MIT

basil08/custom_eval_agent

A Custom LLM agent to evaluate a test model on the MMLU dataset

Results

Report and Logs

Notes

LICENSE