?fix: incredibly low loss after 4 epochs

(Steps on x-axis)

Hypotheses:

Padding tokens are used for generating loss
Extremely homogenous sequences – they are pretty homogenous, but not entirely. Seems like a relatively likely explanation.

@KennethEnevoldsen, any thoughts here? :-)

Padding tokens are used for generating loss

You can check, but I believe the loss ignores the padding token id (-1 I believe). It at leasts shouldn't be the case.

Extremely homogenous sequences – they are pretty homogenous, but not entirely. Seems like a relatively likely explanation.

Def. a more likely explanation.

You can check, but I believe the loss ignores the padding token id (-1 I believe). It at leasts shouldn't be the case.

We've tried looking at the test_train test. When running it, and setting a breakpoint within the PretrainerBEHRT.forward() scope, we get:

We then tried looking at the logits here:

Line 81 in 5eead44

logits = self.mlm_head(encoded_patients)

But none of them are "-1". Does this imply that it does not ignore padding when calculating logits?

I'm way out of my depth here, so it might make sense for you to be hands on?

You should look in labels (in self.loss(logics, labels))

It seems perfectly fine here:

Line 105 in 5eead44

masked_labels[~mask] = -1 # -1 will be ignored in loss function

Line 40 in 5eead44

self.loss = nn.CrossEntropyLoss(ignore_index=-1)

Might be worth setting it as a class attribute to make sure that one is not changed without the other.

Ah, great! I'll change it to an attribute. I assume it's the redundancy making the pretraining very easy, then.

Lasse is using the GPU for a while, but after that I'll run a finetuning 👍