/IMF

Implicit Motion Function - (unofficial) Microsoft recreation

Primary LanguagePython

Implicit-Motion-Function (unofficial)

see current training here https://wandb.ai/snoozie/IMF

Image

overfitting example - https://wandb.ai/snoozie/IMF/runs/xscj3hjo?nw=nwusersnoozie

Image

IMF (Implicit Motion Function) Training Outline

1. Network Architecture

A. Encoders

  • Dense Feature Encoder (EF): Multi-scale feature extraction
  • Latent Token Encoder (ET): Compact latent representation

B. Implicit Motion Function (IMF)

  • Latent Token Decoder (IMFD): Motion feature generation
  • Implicit Motion Alignment (IMFA): Feature alignment

C. Decoder

  • Frame Decoder (DF): Reconstructs the current frame

2. Input Processing

  • Current frame: x_current
  • Reference frame: x_reference

3. Training Loop

For each iteration:

A. Forward Pass

  1. Dense Feature Encoding:

    • Extract multi-scale features f_r from x_reference using EF
  2. Latent Token Encoding:

    • Generate t_r = ET(x_reference)
    • Generate t_c = ET(x_current)
  3. Latent Token Decoding:

    • Generate motion features m_r = IMFD(t_r)
    • Generate motion features m_c = IMFD(t_c)
  4. Implicit Motion Alignment:

    • For each scale:
      • Align features: aligned_feature = IMFA(m_c, m_r, f_r)
  5. Frame Decoding:

    • Reconstruct current frame: x_reconstructed = DF(aligned_features)

B. Loss Calculation

  1. Pixel-wise Loss:

    • L_p = ||x_reconstructed - x_current||_1
  2. Perceptual Loss:

    • L_v = VGG_loss(x_reconstructed, x_current)
  3. GAN Loss:

    • L_G = -E[D(x_reconstructed)]
    • L_D = E[max(0, 1 - D(x_current)) + max(0, 1 + D(x_reconstructed))]
  4. Total Loss:

    • L_total = λ_p * L_p + λ_v * L_v + λ_G * L_G

C. Optimization

  1. Update Generator (IMF model):

    • Backpropagate L_total
    • Update parameters using Adam optimizer
  2. Update Discriminator:

    • Backpropagate L_D
    • Update parameters using Adam optimizer

4. Key Techniques

  • Style modulation in Latent Token Decoder (similar to StyleGAN2)
  • Cross-attention and Transformer blocks in Implicit Motion Alignment
  • Multi-scale feature processing
  • Adaptive discriminator augmentation (ADA) for small datasets

5. Logging and Evaluation

  • Save model checkpoints regularly
  • Generate sample reconstructions using both current and reference frames
  • Compute and log reconstruction quality metrics (e.g., PSNR, SSIM)
  • Track and visualize individual loss components

6. Hardware Utilization

  • Leverage mixed-precision training (FP16)
  • Implement multi-GPU training using distributed data parallelism

This outline captures the essence of IMF training, incorporating the key components and processes described in the IMF paper. The actual implementation involves careful tuning of hyperparameters and optimization techniques to achieve the best results.

Intuition: The Implicit Motion Function (IMF) is like a sophisticated video compression and reconstruction system that works by understanding the essence of motion between video frames, rather than just tracking pixel movements.

Analogy: The Art Gallery Curator

Imagine you're a curator at an art gallery that specializes in "living paintings" - artworks that change over time. Your job is to efficiently store these paintings and recreate them on demand. Here's how the IMF works, explained through this analogy:

  1. Dense Feature Encoder (The Detailed Sketcher): This is like an artist who quickly sketches the key details of each painting. Instead of copying every brush stroke, they capture the essence - the shapes, colors, and textures that make the painting unique.

  2. Latent Token Encoder (The Essence Capturer): This is like a perfumer who can capture the "essence" of a scent in a tiny vial. For each painting, they distill its core characteristics into a small, concentrated form. This essence represents not just how the painting looks, but how it tends to change over time.

  3. Latent Token Decoder (The Essence Interpreter): This is like a magician who can take the vial of essence and use it to conjure up a ghostly image of the original painting. It's not perfect, but it captures the key features and potential changes.

  4. Implicit Motion Alignment (The Change Predictor): This is like a psychic who can look at two versions of a painting and understand exactly how one transformed into the other. They don't just see the surface changes, but understand the underlying patterns and rules of how the painting evolves.

  5. Frame Decoder (The Reconstruction Artist): This is the artist who takes all the information from the others - the detailed sketch, the essence, and the understanding of changes - and recreates the full painting in its new state.

Now, imagine you're storing and recreating a living painting:

  1. You start with two versions of the painting - the "reference" (how it looked before) and the "current" (how it looks now).

  2. The Detailed Sketcher quickly captures the key features of the reference painting.

  3. The Essence Capturer distills both the reference and current paintings into their essences.

  4. The Essence Interpreter takes these essences and creates ghostly versions of both paintings.

  5. The Change Predictor compares these ghostly versions and figures out exactly how the painting changed.

  6. Finally, the Reconstruction Artist uses the detailed sketch of the reference, combined with the understanding of changes, to recreate the current painting.

The magic of this system is that you don't need to store every detail of how the painting changes. Instead, you store the essence and learn the rules of how it evolves. This allows for efficient storage (compression) and accurate recreation (reconstruction) of the living paintings.

In the context of video:

  • The "living paintings" are video frames.
  • The "essence" is the compact latent representation.
  • The "changes" are the motions and transformations between frames.

By understanding the essence of each frame and the patterns of how frames change, the IMF can efficiently compress video information and accurately reconstruct frames, even handling complex motions and new content that traditional methods might struggle with.