openai/weak-to-strong

Observing eval accuracy considerably lower than reported?

knagrecha opened this issue · 12 comments

Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?

As an example:

sciq - GPT-2-medium reports accuracy of 0.43 (0.5 after I lowered the learning rate). LLaMA-7B ground truth got 0.84. LLaMA-7b-transferred got 0.43 (0.81 after I lowered the learning rate).

0.43 is worse than random so something is either wrong with the ML there or your eval set isn't big enough

Yeah I figured the eval set size seemed small but assumed that the line in the README would work directly. Might test it out again later with a larger eval size.

10X'd the train/test sizes. new results on sciq with gpt2-med and llama-7b after a quick run.

GPT-2-Med ending acc: 0.661 +/- 0.006694460396477075

LLaMA-7B ending acc (gt): 0.866 +/- 0.015234434679370284

LLaMA-7B ending acc (transfer): Accuracy: 0.704 +/- 0.020414896521902825

Looks nice! Pretty closely aligned with the Qwen results, with slightly lower transfer efficacy. Hope others will add their OSS model eval results soon too.

Would suggest increasing the n_docs/n_test_docs values in the README command? Current values seem pretty low.

haha yeah, they are low! can update that

things to generally keep in mind:

  • things are somewhat noisy in general, even with a large dataset. results are cleaner when averaging across many seeds. i'm not totally sure why but i think they're noisier than our internal setup was
  • truncating the dataset to be smaller makes things even noisier

Off-topic, but I am curious about how you guys are thinking of labeling by a weak supervisor vs criticism/scoring by a weak supervisor. I guess there can be an argument in both directions, whether labeling is easier for a weak model or criticism.

I guess criticism may introduce even more noise due to hallucinations, but if alignment is from the perspective of a “weaker human” to strong model, it may intuitively be easier than labeling.

I am having the same issue of noise on my side, could it be possible that this is because of the way classification head was initialized. The paper claims to initialize the head with embedding weights of token "0" and "1" whereas in the code it seems like we are initializing it differently.

I actually tried initializing using unembeddings, and it didn't seem to help. but I didn't test very extensively. my hunch is it's not the issue.

by the way, there is some substantial literature on noisiness of fine-tuning, e.g. https://arxiv.org/pdf/2002.06305.pdf

Will look at this, thanks a lot. It would be nice to know how did you initialize with unembedding weights.

here's the code i had used, had done it sort of hackily

        # NOTE: this has to happen after the rest of the model is initialized
        unemb = self.transformer.wte.weight.data
        assert self.num_labels == 2
        inds = [
            11491, # incorrect
            3376, # correct
        ]
        new_data = unemb[inds, :]
        self.score.weight.data.copy_(new_data)

Thank you so much @WuTheFWasThat, will test it