stanford-crfm/helm

RAFT evaluation

divyashan opened this issue · 2 comments

Hi! Thanks for all your work to make HELM available.

I had a few questions about the RAFT evaluations.

  1. What set of examples are the LLMs evaluated over? I noticed this comment in the Raft Scenario code: "# Note: Only using public labeled instances now. Check if we can get the hidden test set labels." on Line 105 here. My understanding is that there are 50 public labeled examples available for each task included in RAFT, but the reported EM numbers imply a number of examples larger than 50.
  2. How do you produce a distribution of predicted probabilities over classes using each model? I read the following in the original RAFT paper, but not sure if this matches up with the HELM methodology for producing class-specific predicted probabilities (pasted below for convenience). Any pointers would be much appreciated!

"we retrieve GPT-3’s 100 most likely next tokens using the davinci engine. For each class, we assign the probability that its first token is generated. We then normalize the probabilities to sum to 1. For the B77 dataset, multiple labels share the same first token so we prepend a numerical prefix such as “1. ” to each class."

Let me know if I've misunderstood anything here!

Thanks for your questions @divyashan!

  1. We use the 50 publicly labeled examples, and we split them into 10 in-context learning examples and 40 evaluation examples. The results page show more than 40 requests because we ran 3 different trials using different samples of in-context learning examples, and within each trial, we run additional requests using additional perturbations / data augmentations (e.g. dialect perturbations).
  2. We don't use the probability-based method from the original RAFT paper (but we do use probability-based methods for other scenarios in HELM). Instead, we directly prompt the model to generate the complete label name as a test output, using instructions and in-context learning. Example prompt:
Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:
Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).
Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.
Possible labels:
1. ADE-related
2. not ADE-related

Sentence: A challenge with clozapine was feasible and showed no clinical symptoms of eosinophilia.
Label: not ADE-related

Sentence: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence is low and the severity is relatively mild, with no or mild self-reported discomfort.
Label: ADE-related

Sentence: Best-corrected visual acuity measurements were performed at every visit.
Label: not ADE-related

Sentence: These cases were considered unusual in light of the short delay of their onset after initiation of immunosuppressive therapy and their fulminant course: 3 of these patients died of PCP occurring during the first month of treatment with prednisone.
Label: ADE-related

Sentence: The INR should be monitored more frequently when bosentan is initiated, adjusted, or discontinued in patients taking warfarin.
Label: not ADE-related

Sentence: Pulses have been given for periods up to three years without evident toxicity.
Label:

Thanks so much for the detailed answer!!