Accidentially used Encoder instead of Decoder?

Hey there, great project!

I have just one thing that I am not sure if I misunderstood or you did. You wanted to reproduce the architecture of IconShop, but to my understanding, IconShop is a decoder-only transformer (like a language model).
Excerpt from their paper:

I first thought you had a minor mistake in your blog post, but also there you wrote encoder-only:

An encoder-only autoregressive transformer is self-contradictory, isn't it?

Also here, you took the xFormerEncoderConfig instead of the (imo) correct xFormerDecoderConfig.

fontogen/model/model.py

Lines 201 to 227 in de1e50b

    
           x_config = xFormerEncoderConfig( 
        
               num_layers=num_layers, 
        
               dim_model=d_model, 
        
               feedforward_config={ 
        
                   "name": "MLP", 
        
                   "dropout": 0.1, 
        
                   "activation": "gelu", 
        
                   "hidden_layer_multiplier": 4, 
        
               }, 
        
               multi_head_config={ 
        
                   "num_heads": nhead, 
        
                   "attention": BlockSparseAttention( 
        
                       num_heads=nhead, 
        
                       layout=BigBirdSparsityConfig( 
        
                           num_heads=nhead, 
        
                           block_size=block_size, 
        
                           num_global_blocks=1, 
        
                           num_random_blocks=2, 
        
                           num_sliding_window_blocks=int(3 * max_glyph_tokens / block_size) 
        
                       ).make_layout(seq_len=self.seq_len), 
        
                       block_size=block_size, 
        
                       dropout=0.1, 
        
                       causal=True, 
        
                   ) 
        
               }, 
        
           ) 
        
           self.transformer = xFormer(x_config)

I haven't thought this through what the implications are. You correctly used causal attention, so there should not be issues on that front. There might not even be an issue at all, as decoders (without cross-attention) are kinda just encoders with causal attention masking, but I don't know the xformers library and its inner workings. Just wanted to set the mental model straight.

Thanks for releasing your code and project!

Hey, @DerEchteFeuerpfeil! The article should indeed be saying decoder-only. Thank you for this, I'll fix it up!

Implementation-wise, if I understand correctly, there is no other way to implement a decoder-only transformer in xFormers, at least until this lands facebookresearch/xformers#617.

Alright, makes sense! Does not speak for the library that a decoder-only architecture issue is open for >1y 😅

When re-implementing parts of this for my project I used x_transformers by lucidrains instead.

I'll close this issue for now, as it has served its purpose. Cheers!

	x_config = xFormerEncoderConfig(
	num_layers=num_layers,
	dim_model=d_model,
	feedforward_config={
	"name": "MLP",
	"dropout": 0.1,
	"activation": "gelu",
	"hidden_layer_multiplier": 4,
	},
	multi_head_config={
	"num_heads": nhead,
	"attention": BlockSparseAttention(
	num_heads=nhead,
	layout=BigBirdSparsityConfig(
	num_heads=nhead,
	block_size=block_size,
	num_global_blocks=1,
	num_random_blocks=2,
	num_sliding_window_blocks=int(3 * max_glyph_tokens / block_size)
	).make_layout(seq_len=self.seq_len),
	block_size=block_size,
	dropout=0.1,
	causal=True,
	)
	},
	)
	self.transformer = xFormer(x_config)