tangjiapeng/DiffuScene

what's the function of the learnable positional_embedding in the class DiffusionSceneLayout_DDPM?

Opened this issue · 3 comments

when reading training code, I found a learnable positional_embedding which is passed to the first block of Unet1D's downs, mid_blocks, ups.

such as the code of Unet1D's downs:

    for block0, block1, attncross, block2, attn, downsample in self.downs:
        x = block0(x, context) 
        x = block1(x, t)
        h.append(x)

        x = attncross(x, context_cross) if self.text_condition else attncross(x)
        x = block2(x, t)
        x = attn(x)
        h.append(x)

        x = downsample(x)

the context is the instan_condition_f from the next code:
instance_indices = torch.arange(self.sample_num_points).long().to(self.device)[None, :].repeat(batch_size, 1)
instan_condition_f = self.positional_embedding[instance_indices, :]

I wonder the function of positional_embedding. thanks for your help.

In the initial implementation of Unet1D from 'https://github.com/lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/denoising_diffusion_pytorch_1d.py', there is no block0 and attncross. is that one of the innovations of this paper?

the instance embedding is to encode the position information of each instance within a sequence.

It can helps the denoiser differentiate different object instances.

Thank you very much for your help!