Gradient reliability with sample-by-sample vs batch processing
Opened this issue · 2 comments
I'm examining the training implementation in src/art/unsloth/service.py and have a question about the gradient computation approach.
Currently, the code processes samples individually:
for offset in range(0, packed_tensors["tokens"].shape[0]):
# Process single sample: v[offset : offset + 1]
# Each sample triggers separate gradient computation and parameter update
This means:
- Sample 1: θ₁ = θ₀ - lr * ∇L₁(θ₀)
- Sample 2: θ₂ = θ₁ - lr * ∇L₂(θ₁) (based on updated θ₁)
- Sample 3: θ₃ = θ₂ - lr * ∇L₃(θ₂) (based on updated θ₂)
Versus standard batch processing:
- All samples: θ = θ₀ - lr * (∇L₁(θ₀) + ∇L₂(θ₀) + ∇L₃(θ₀))/batch_size
Question:
What's the reasoning behind this sequential gradient approach? Does it provide better gradient reliability or learning dynamics for your specific use case?
I'm particularly curious whether this design choice stems from:
- Improved convergence properties
- Better handling of gradient variance
- Specific requirements for your training methodology
The downstream training code in train.py appears to support full batch processing, so I'm wondering if there are important gradient-related considerations I'm
missing.
Thanks for any insights!
In my experience doing more gradient updates works better. Here's some recent work that finds the same thing.
@zfflxx does that help answer your question?