Attention mask?

Like in Stable Diffusion, no attention mask appears to be used for input tokens:

Lines 93 to 101 in 2a03657

    
           input_ids = self.tokenizer( 
        
               text, 
        
               return_tensors="pt", 
        
               padding="max_length", 
        
               truncation=True, 
        
               max_length=self.tokenizer.model_max_length, 
        
           ).input_ids  # TODO: remove hardcode 
        
           input_ids = input_ids.to(self.device) 
        
           encoder_hidden_states = self.text_encoder(input_ids).last_hidden_state

But according to third party analysis this appears to have been a mistake all along. Do we have insight on whether attention masks would help for better prompt-image alignment?

these authors reckon it's better to train on an unmasked text embeddings (even though that risks learning from PAD token embeddings):
huggingface/diffusers#1890 (comment)

as for inference: the user needs to be able to match whatever approach was used during training.

I thought Muse was a bit wackier though. it actually masks vision tokens:

https://github.com/lucidrains/muse-maskgit-pytorch/

	input_ids = self.tokenizer(
	text,
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=self.tokenizer.model_max_length,
	).input_ids # TODO: remove hardcode
	input_ids = input_ids.to(self.device)
	encoder_hidden_states = self.text_encoder(input_ids).last_hidden_state