evolutionaryscale/esm

Mismatch between the shapes of logits from ESM3 outputs and paper description

KatarinaYuan opened this issue · 2 comments

Hi Teams,

Thanks for the awesome work. Just would like to confirm why
the head layers for each modality (https://github.com/evolutionaryscale/esm/blob/fe3052abc387dca58e76b444a4c9e25cb7d49ddf/esm/models/esm3.py#L159C8-L159C12) cannot be matched to the description in paper.

        self.structure_head = RegressionHead(d_model, 4096)
        self.ss8_head = RegressionHead(d_model, 8 + 3)
        self.sasa_head = RegressionHead(d_model, 16 + 3)
        self.function_head = RegressionHead(d_model, 260 * 8)

But it's mentioned in the paper that
Sequence proteins have 29 tokens in total;
Structure has 4096 with 4 special tokens;
Secondary structure has a total of 10 tokens;
SASA has a total of 18 tokens;
Function annotation has 255 tokens + 3 special tokens;

It seems that the sequence tokens of ESM follow the design of ESM2, which should actually be 33, but for speed improvements, it was expanded to 64. https://huggingface.co/facebook/esm2_t12_35M_UR50D/blob/main/vocab.txt

Some tokens are unused. I think in the paper, we never counted the "pad" token as a real token. Additionally, what @elttaes said is true. For structure tokens, there are indeed 4051 tokens in the input, but we never decode any special structure tokens, leading to 4096 in the output.