We build a basic agent with meta/meta-llama-3-8b-instruct
as the seed model which is responsible for conducting various steps of the experiment. This model operates within the scaffold and together form the evaluation agent. We tested 3 randomly picked models from the Replicate repo using 45 randomly sampled questions from a subset of the MMLU dataset and reported their accuracy.
Model id | Total questions | Correct | Wrong | Errors | Accuracy |
---|---|---|---|---|---|
meta/meta-llama-3-8b-instruct | 45 | 44 | 1 | 0 | 97.77% |
meta/llama-2-7b-chat | 45 | 38 | 7 | 0 | 84.44% |
mistralai/mistral-7b-instruct-v0.2 | 45 | 41 | 4 | 0 | 91.11% |
Report can be accessed here and logs can be accessed here.
- The agent is a very basic scaffold around LLama3 and delegates most of the decision making work to the LLM. This creates a more robust evaluation model.
- The MMLU subset is a very small dataset chosen keeping the scope and time in mind for the project.
- The model scaffold can be redesigned to include more fine-grained control over the agent's behaviour.
MIT