Paper ReviewComputer SystemsExperimental Design

Fine-Tuning 100B+ Models Without Backpropagation: Zeroth-Order Optimization Goes Distributed

Standard LLM fine-tuning requires storing model weights, gradients, optimizer states, and activationsโ€”often exceeding GPU memory for models above 70B parameters. DistZO2 eliminates backpropagation entirely, estimating gradients through forward-pass-only perturbation. Distributed across multiple GPUs, this enables fine-tuning of 100B+ models on hardware that cannot run standard training.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The memory cost of fine-tuning large language models is dominated not by the model weights themselves but by the backpropagation infrastructure: gradient tensors, optimizer states (Adam requires two momentum tensors per parameter), and activation checkpoints stored for the backward pass. For a 70B parameter model in mixed precision, the model weights occupy approximately 140GBโ€”but the full training state exceeds 500GB, requiring multi-GPU setups with sophisticated memory management (DeepSpeed ZeRO, FSDP) just to begin training.

Wang et al.'s DistZO2 takes a radical approach: eliminate backpropagation entirely. Zeroth-order (ZO) optimization estimates gradients by evaluating the loss at two slightly different parameter configurationsโ€”a forward pass with parameters ฮธ and a forward pass with parameters ฮธ + ฮตz (where z is a random perturbation vector). The difference in loss, divided by ฮต, provides a gradient estimate along direction z.

This eliminates the need to store activations (no backward pass), optimizer momentum tensors (ZO uses simpler update rules), and gradient tensors. The memory footprint drops to approximately the model weights plus a single perturbation vectorโ€”a reduction that enables fine-tuning models on hardware that cannot accommodate standard training.

From Single-GPU to Distributed ZO

Single-GPU zeroth-order optimization has a well-known weakness: high gradient variance. Each perturbation direction provides a one-dimensional gradient estimate; recovering the full gradient requires many perturbation directions. For models with billions of parameters, the number of perturbations needed for a useful gradient estimate is impractically large on a single GPU.

DistZO2 solves this through distribution: each GPU in the cluster computes gradient estimates along different perturbation directions, and the results are aggregated. With N GPUs, the gradient estimate improves by a factor of โˆšN (standard Monte Carlo convergence), making distributed ZO optimization both faster and more accurate than the single-GPU version.

The distributed coordination is lightweight: each GPU independently samples a random perturbation direction, computes two forward passes, and broadcasts its scalar gradient estimate. The communication volume is negligible compared to the gradient all-reduce operations in standard distributed trainingโ€”making DistZO2 communication-efficient in addition to memory-efficient.

Convergence Characteristics

ZO optimization converges more slowly than first-order (gradient-based) optimizationโ€”this is the fundamental tradeoff. Each ZO gradient estimate is noisier than the true gradient, requiring more update steps to reach the same loss level. The convergence rate depends on the model dimensionality (larger models need more perturbations), the perturbation scale ฮต (smaller ฮต gives more accurate but noisier estimates), and the learning rate schedule.

In practice, ZO optimization requires more forward passes than standard fine-tuning to reach comparable quality, since each gradient estimate is noisier than the true gradient. But since each forward pass is cheaper (no backward pass, no activation storage), the total wall-clock time can be competitiveโ€”and the memory savings enable experiments that are simply impossible with standard training on the available hardware.

Claims and Evidence

<
ClaimEvidenceVerdict
ZO eliminates backpropagation memory overheadMathematical proof; no activations or gradient tensors neededโœ… Proven
Distributed ZO improves gradient quality over single-GPUโˆšN improvement from aggregating N independent estimatesโœ… Supported (standard result)
DistZO2 enables fine-tuning of models too large for standard trainingDemonstrated on 100B+ parameter models on limited GPU memoryโœ… Demonstrated
ZO fine-tuning matches standard fine-tuning qualityQuality gap exists; more iterations needed due to gradient noiseโš ๏ธ Approaches but does not match
ZO is practical for all fine-tuning scenariosMost beneficial for memory-constrained settings; standard training is preferred when memory is availableโš ๏ธ Situational

Open Questions

  • Task-specific quality gap: Does the ZO-standard quality gap vary across tasks? Fine-tuning for simple classification may tolerate ZO noise well; fine-tuning for complex reasoning may suffer more. Task-specific analysis is needed.
  • Combination with LoRA: Can ZO be combined with parameter-efficient fine-tuning (LoRA, QLoRA) for additional memory savings? The combination would further reduce the number of parameters being optimized, potentially improving ZO convergence.
  • Adaptive perturbation: Should the perturbation scale ฮต adapt during training? Larger ฮต early in training (for faster exploration) and smaller ฮต later (for finer optimization) might improve convergence.
  • Hybrid approaches: Can we use ZO for most parameters and first-order optimization for a small subset of critical parameters? This hybrid might combine ZO's memory efficiency with first-order's convergence speed.
  • What This Means for Your Research

    For ML practitioners with limited GPU resources, DistZO2 opens the possibility of fine-tuning models that were previously out of reach. A research lab with 4 ร— A100 GPUs can potentially fine-tune a 100B model that would normally require 16+ GPUs with standard training.

    For optimization researchers, distributed zeroth-order optimization in the LLM setting presents interesting convergence analysis challengesโ€”particularly around the interaction between model dimensionality, perturbation strategies, and distributed aggregation.

    References (1)

    [1] Wang, L., Xie, H., Wang, D. et al. (2025). DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing. arXiv:2507.03211.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’