/VASA-1-hack

Using Claude Sonnet 3.5 to forward (reverse) engineer code from VASA white paper - WIP - (this is for La Raza 🎷)

Primary LanguagePython

VASA-1-hack

This cut of code is using sonnet 3.5 to reverse engineer code from white paper (this is for La Raza) https://www.youtube.com/watch?v=bZ8AS300WH4

Image

MegaPortrait
https://github.com/johndpope/MegaPortrait-hack the models here seem to be ok - it does train / converge johndpope/MegaPortrait-hack#36

Status

this is fresh code spat out by sonnet 3.5 the earlier ai models were maybe light on the training data for diffusion - this latest code seems to nail it.

I update progress here - johndpope#20

✅ dataset is good

models seem good

stage 1 trainer I had working some time back - running into OOM problems... https://github.com/johndpope/MegaPortrait-hack/tree/5fb4398634c0e27c60d850d4ab997f9b1df2c3fe

https://wandb.ai/snoozie/vasa?nw=nwusersnoozie

python dataset_testing.py 




ulimit -n 65535



#!/bin/bash


# Debug memory
mprof run train_stage_1.py



# 1. STAGE 1 - Basic single GPU training
accelerate launch --mixed_precision fp16 train_stage1.py \
    --config configs/training/stage1-base.yaml

# 2. Multi-GPU training on a single machine
accelerate launch \
    --multi_gpu \
    --mixed_precision fp16 \
    --gradient_accumulation_steps 4 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml

# 3. Distributed training across multiple machines
accelerate launch \
    --multi_gpu \
    --num_processes 8 \
    --num_machines 2 \
    --machine_rank 0 \
    --main_process_ip "master_node_ip" \
    --main_process_port 29500 \
    --mixed_precision fp16 \
    --gradient_accumulation_steps 4 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml

# 4. Debug mode with smaller batch size and fewer iterations
accelerate launch \
    --mixed_precision fp16 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml \
    training.batch_size=4 \
    training.base_epochs=2 \
    training.sample_interval=10

# 5. Resume training from checkpoint
accelerate launch \
    --mixed_precision fp16 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml \
    training.resume_path=checkpoints/stage1/checkpoint_epoch_10.pt

# 6. Override specific config values
accelerate launch \
    --mixed_precision fp16 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml \
    training.batch_size=16 \
    training.lr=2e-4 \
    training.weight_decay=1e-2 \
    model.feature_dim=512

# 7. Training with different loss weights
accelerate launch \
    --mixed_precision fp16 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml \
    loss.perceptual_weight=1.0 \
    loss.gan_weight=0.5 \
    loss.identity_weight=1.0 \
    loss.motion_weight=0.5

# 8. Training with specific GPU devices
CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --multi_gpu \
    --mixed_precision fp16 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml

# 9. Training with CPU offload (for limited GPU memory)
accelerate launch \
    --mixed_precision fp16 \
    --cpu_offload \
    train_stage1.py \
    --config configs/training/stage1-base.yaml

# 10. Training with gradient checkpointing (for memory efficiency)
accelerate launch \
    --mixed_precision fp16 \
    --gradient_accumulation_steps 8 \
    train_stage1.py \
    --config configs/training/stage1-base.yaml \
    model.use_gradient_checkpointing=true



# STAGE 2 - Single GPU training
python train_stage2.py

# Multi-GPU training with Accelerate
accelerate launch \
    --multi_gpu \
    --mixed_precision fp16 \
    --gradient_accumulation_steps 4 \
    train_stage2.py

# Distributed training with specific GPU configuration
accelerate launch \
    --multi_gpu \
    --mixed_precision fp16 \
    --gradient_accumulation_steps 4 \
    --num_processes 4 \
    --num_machines 1 \
    --machine_rank 0 \
    --main_process_port 29500 \
    train_stage2.py

VASA Classes Documentation

Core Model Classes

PositionalEncoding

class PositionalEncoding(nn.Module):
  • Adds position information to transformer embeddings
  • Arguments:
    • d_model: Dimension of model
    • dropout: Dropout rate
    • max_len: Maximum sequence length
  • Used for sequence position information in transformers

AudioEncoder

class AudioEncoder(nn.Module):
  • Encodes audio features using Wav2Vec2 architecture
  • Components:
    • Multiple convolutional layers
    • Layer normalization
    • Projection layer
  • Processes audio input for the model

VASAFaceEncoder

class VASAFaceEncoder(nn.Module):
  • Enhanced face encoder with disentangled representations
  • Components:
    • 3D Appearance Volume Encoder
    • Identity Encoder
    • Head Pose Encoder
    • Facial Dynamics Encoder
  • Creates separate representations for different face aspects

VASADiffusionTransformer

class VASADiffusionTransformer(nn.Module):
  • Diffusion Transformer with conditioning and CFG
  • Features:
    • 8-layer transformer architecture
    • Multiple embedding layers
    • Classifier-free guidance support
  • Handles motion generation and conditioning

Training Infrastructure

VASATrainer

class VASATrainer:
  • Main training orchestrator
  • Responsibilities:
    • Model initialization
    • Training loop management
    • Optimization handling
    • Distributed training support

VASAConfig

class VASAConfig:
  • Configuration management
  • Sections:
    • Training settings
    • Model parameters
    • Diffusion settings
    • CFG settings
    • Data settings

TrainingLogger

class TrainingLogger:
  • Logging functionality
  • Features:
    • Weights & Biases integration
    • File logging
    • Metric tracking
    • Video logging

CheckpointManager

class CheckpointManager:
  • Manages model checkpoints
  • Features:
    • Checkpoint saving/loading
    • Best model tracking
    • Checkpoint rotation

Data Processing

VASADataset

class VASADataset(Dataset):
  • Dataset implementation
  • Features:
    • Video frame processing
    • Audio feature extraction
    • Face attribute extraction
    • Gaze and emotion processing

VideoGenerator

class VideoGenerator:
  • Video generation pipeline
  • Features:
    • Sliding window approach
    • Source image processing
    • Motion sequence generation
    • Frame synthesis

Evaluation

CAPPScore

class CAPPScore(nn.Module):
  • Contrastive Audio and Pose Pretraining score
  • Components:
    • Pose encoder (6-layer transformer)
    • Audio encoder (Wav2Vec2-based)
    • Temperature parameter
  • Evaluates audio-pose synchronization

Evaluator

class Evaluator:
  • Comprehensive evaluation metrics
  • Metrics:
    • SyncNet confidence
    • CAPP score
    • Pose variation intensity
    • FVD (if real video available)

Diffusion Process

VASADiffusion

class VASADiffusion:
  • Diffusion process handler
  • Features:
    • Forward diffusion sampling
    • Reverse diffusion sampling
    • Beta schedule management
    • Noise level control

Loss Functions

VASALoss

class VASALoss(nn.Module):
  • Combined loss implementation
  • Components:
    • DPE (Disentangled Pose Encoding) loss
    • Identity preservation loss
    • Reconstruction loss
    • Configuration for loss balancing

Usage

Each class is designed to work together in the VASA pipeline:

  1. Configuration is managed by VASAConfig
  2. Data is loaded through VASADataset
  3. Training is orchestrated by VASATrainer
  4. Models (VASAFaceEncoder, VASADiffusionTransformer, etc.) process the data
  5. VideoGenerator produces the final output
  6. Evaluator assesses the results

Notes

  • Most neural network classes inherit from nn.Module
  • The system uses PyTorch as its deep learning framework
  • Classes support distributed training where applicable
  • Extensive use of type hints for better code clarity
  • Modular design allows for easy component modification