AI Foundational Skill Evaluations

I am interested in whether LLM-based AIs can be reliable in the ways that we expect computers to be reliable.

This is a collection of evaluations designed to test AIs on "foundational skills" - performance on basic tasks that we expect them to be good at because we can write simple computer programs to solve the task, but that in reality LLMs fall significantly short on.

Many or most of these it's probably not reasonable to expect an LLM to be good at them and the most likely eventual architecture for performing well on them is some sort of hybrid. For example, LLMs are currently very bad at arithmetic. It is, in fact, probably unreasonable to ask an LLM to do arithmetic - it's horrendously inefficient compared to doing it directly - but the problem is that current generation LLMs "believe" they can do this sort of task despite the fact that they cannot. This results in interactions like the following:

ChatGPT getting a sum confidently wrong

(The correct answer here is 325657908, vs ChatGPT's very confident 325658908)

This is in a sense a variant of the hallucination problem. It's not that the LLM is hallucinating a fact about the world exactly, it's just making a straightforward error, but it's hallucinating a capability it doesn't have because it's trained on data generated by people who have that capability. It needs to either gain that capability or lose the "belief" that it has it.

My goal with this project is to find many broad categories of problems like this, and specific problems that LLMs struggle with, with the hope that this can be used to improve AI reliability.

General approach

These evaluations work by treating AIs like normal software that we can write tests for - ask it to perform a task, assert results about the answer. Theoretically a good AI should pass 100% of the time. In practice, on my initial results, pass rates tend to fall well short of this even on state of the art models.

Importantly, these tests are written using property-based testing, based on my extensive work in the subject in developing Hypothesis. This means that each evaluation is run on a randomised set of problems, generated at the time of evaluation. This ensures a good coverage across a wide distribution.

Property-based testing is one of the most effective techniques we have for writing tests in conventional software, but has two particular advantages for LLMs:

  1. It makes it very easy to create new unusually high quality evaluations, without having to put in a significant amount of manual effort. This is similar to its benefit for conventional software testing, but unlike conventional software, example-based testing is manifestly not good enough for LLMs because of their varied and non-deterministic behaviour.
  2. There doesn't need to be a test/train split, so there is no risk of training on the evaluation. Indeed, training on the evaluation is a perfectly reasonable thing to do. Getting a high score on the evaluation intrinsically means that the AI is good at the task (with certain caveats about distribution - it's possible that it is still bad at problems that are off-distribution for the evaluation, but if this turns out to be the case then the distribution is easy to change).