different setup of input_hint_block compared to paper?

Hi, i noticed that the implementation of the tiny work converting control images into feature space is different from the structure menioned in the paper: "In particular, we use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, using 16, 32, 64, 128, channels respectively". The corresponding implementation should be here right(correct me if i am wrong):

ControlNet/cldm/cldm.py

Lines 147 to 163 in ed85cd1

    
           self.input_hint_block = TimestepEmbedSequential( 
        
               conv_nd(dims, hint_channels, 16, 3, padding=1), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 16, 16, 3, padding=1), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 16, 32, 3, padding=1, stride=2), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 32, 32, 3, padding=1), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 32, 96, 3, padding=1, stride=2), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 96, 96, 3, padding=1), 
        
               nn.SiLU(), 
        
               conv_nd(dims, 96, 256, 3, padding=1, stride=2), 
        
               nn.SiLU(), 
        
               zero_module(conv_nd(dims, 256, model_channels, 3, padding=1)) 
        
           )

a tiny network E(·)

Hello, have you figured out your question? Since I’m not very familiar with ControlNet, I haven’t fully understood the code yet. This code corresponds to the tiny network 𝐸(⋅) in the paper, which converts the condition image into conditional features, right? I hope you can help me clarify it. Thank you very much!

	self.input_hint_block = TimestepEmbedSequential(
	conv_nd(dims, hint_channels, 16, 3, padding=1),
	nn.SiLU(),
	conv_nd(dims, 16, 16, 3, padding=1),
	nn.SiLU(),
	conv_nd(dims, 16, 32, 3, padding=1, stride=2),
	nn.SiLU(),
	conv_nd(dims, 32, 32, 3, padding=1),
	nn.SiLU(),
	conv_nd(dims, 32, 96, 3, padding=1, stride=2),
	nn.SiLU(),
	conv_nd(dims, 96, 96, 3, padding=1),
	nn.SiLU(),
	conv_nd(dims, 96, 256, 3, padding=1, stride=2),
	nn.SiLU(),
	zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
	)