UKGovernmentBEIS/inspect_ai

[Question] Scorer execution order enforcement

Closed this issue · 4 comments

Hi,
Is there a way to enforce Scorers order of execution? For example, if I have two scorers (scorer-a and scorer-b) as part of the task, is there a ways to start evaluating scorer-b only after evaluation for scorer-a completed?
I'm trying to find a simple way to introduce custom scorer that will evaluate output of other scorer by storing/retiring key/values from sample "store". For example, after model_graded_fact completes and stores some data into "store" to access the "store" from a subsequent Scorer for additional logic.
For now I was considering two options:

  • write a custom Scorer to replicate built-in model_graded_fact and add additional functionality
  • introduce "completion" flag into "Store" that will be set by wrapper to model_graded_fact, and the new scorer will wait for the flag to be populate to indicate that model_graded_fact is completed.

@dragonstyle can probably comment better on this, but I think for this use case you might want to create a single scorer that produces multiple metrics: https://inspect.ai-safety-institute.org.uk/scorers.html#scorer-with-multiple-values

Are the scorers executed sequentially (for a given Sample)? Is there a facility to exchange information between the scorers other than "Store"?
I could not find scorer information in the Task-state class.

Scorers are run sequentially for each sample. The store is the only method of coordination between scorers I believe.

I agree with @Jallaire here that if the scoring gets very complex, it will often make sense to just create your own standalone scorer that just emits a dict[str, float] (or whatever). Emiting a dictionary as a score value is will handled through the result of the system (you can map metrics onto each score in the dictionary, for example) and it will ultimately resolve into multiple named scores in the result.

This also would allow you to create a pretty readable set of code to deal with the complexity (I think since the coordination between scorers will be through loose data structures and order, it will be inherently fragile and hard to reason about). I don't fully know your use case, so take my thoughts with appropriate grains of salt!

@dragonstyle , thank you! The use case is to use model_graded_fact as step one. In step 2, in case that model_graded_fact evaluates the answer as incorrect, to use have additional LLM call that will produce reason for incorrect answer. Both "Correct/Incorrect" by model_graded_fact and the reason-code should appear as a separate scorer in the inspect view. Since I wanted to use built-in model_graded_fact, I wrote wrapper that intercepts the response and writes needed information for the next step into "Store". I might combine both steps as you have suggested in a single scorer.