Questions about implementation

Hi! Thanks for sharing training code!

While I analyzing implementation in details and have few questions.

Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py

Line 451 in e4b02ef

clean_latent = latents_list[i_s][index::column_size] # [bs, c, t, h, w]

Why use indexing [index::column_size] ? Since latent_list[i_s] would have shape of [bs, c, t, h, w] so
latents_list[i_s][index::column_size] means just getting one batch, isn't it?

How video sync group works?

If I use 8 gpus and default parallel group hyper-parameter setting, sp_group_size and video_sync_group would be 8.
Since sequential parallel already split long tokens so every gpu gets access to the same video input, why video_sync_group is necessary??

When extract video latent in advance, all videos have same fps ?? Since this line means if "frame" is not specified in annotation, extract first 121 frames

Pyramid-Flow/tools/extract_video_vae_latents.py

Line 118 in e4b02ef

frame_indexs = video_item['frames'] if 'frames' in video_item else list(range(self.num_frames))

Why multiplying 2 in here? to preserve variance for each stage ?

Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py

Line 424 in e4b02ef

cur_noise = F.interpolate(cur_noise, size=(height, width), mode='bilinear') * 2

Thanks!

Here are the answers to your questions:

latents_list[i_s][index::column_size] aims to get a batch of samples that belong to the same stage.
We do not use the sequence parallel in the training. The code about sequence parallel is for the multi-gpu inference. The param video_sync_group is for controlling the group of processes that accept the same input sample.
We directly use 24 fps for training. The frames key means you can specify the frame indexes you want to extract.
Multiplying is to make the variance of noise still equal to 1 after bilinear interpolation. (Statisfy standard Gaussian)

Thanks!

Another questions

Supports I2V training ? Current implementations seems only support t2v.

Theoretically, it naturally performs I2V training during autoregressive training (since the first frame is an image). However, we have not explicitly optimized for I2V, so the performance may be suboptimal. We are working on some improvements and will share them in due time.

Yes, sounds right. autoregressive training naturally doing I2V training.

Looking forward to share! Thanks.

Another question ?

Then, why scale factor for first frame and remaining frames is different ?? (using same vae)
Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py

Line 582 in e4b02ef

# is video

Great observation! Please refer to #28 (comment).

Thanks for quick answer!