huggingface/open-muse

Attention mask?

pcuenca opened this issue · 1 comments

Like in Stable Diffusion, no attention mask appears to be used for input tokens:

input_ids = self.tokenizer(
text,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=self.tokenizer.model_max_length,
).input_ids # TODO: remove hardcode
input_ids = input_ids.to(self.device)
encoder_hidden_states = self.text_encoder(input_ids).last_hidden_state

But according to third party analysis this appears to have been a mistake all along. Do we have insight on whether attention masks would help for better prompt-image alignment?

these authors reckon it's better to train on an unmasked text embeddings (even though that risks learning from PAD token embeddings):
huggingface/diffusers#1890 (comment)

as for inference: the user needs to be able to match whatever approach was used during training.

I thought Muse was a bit wackier though. it actually masks vision tokens:

https://github.com/lucidrains/muse-maskgit-pytorch/