OpenPipe/ART

Gradient reliability with sample-by-sample vs batch processing

Opened this issue · 2 comments

I'm examining the training implementation in src/art/unsloth/service.py and have a question about the gradient computation approach.

Currently, the code processes samples individually:

for offset in range(0, packed_tensors["tokens"].shape[0]):
# Process single sample: v[offset : offset + 1]
# Each sample triggers separate gradient computation and parameter update

This means:

  • Sample 1: θ₁ = θ₀ - lr * ∇L₁(θ₀)
  • Sample 2: θ₂ = θ₁ - lr * ∇L₂(θ₁) (based on updated θ₁)
  • Sample 3: θ₃ = θ₂ - lr * ∇L₃(θ₂) (based on updated θ₂)

Versus standard batch processing:

  • All samples: θ = θ₀ - lr * (∇L₁(θ₀) + ∇L₂(θ₀) + ∇L₃(θ₀))/batch_size

Question:

What's the reasoning behind this sequential gradient approach? Does it provide better gradient reliability or learning dynamics for your specific use case?

I'm particularly curious whether this design choice stems from:

  • Improved convergence properties
  • Better handling of gradient variance
  • Specific requirements for your training methodology

The downstream training code in train.py appears to support full batch processing, so I'm wondering if there are important gradient-related considerations I'm
missing.

Thanks for any insights!

In my experience doing more gradient updates works better. Here's some recent work that finds the same thing.

@zfflxx does that help answer your question?