Paper ReviewComputer SystemsExperimental Design

Fine-Tuning 100B+ Models Without Backpropagation: Zeroth-Order Optimization Goes Distributed

Standard LLM fine-tuning requires storing model weights, gradients, optimizer states, and activations—often exceeding GPU memory for models above 70B parameters. DistZO2 eliminates backpropagation entirely, estimating gradients through forward-pass-only perturbation. Distributed across multiple GPUs, this enables fine-tuning of 100B+ models on hardware that cannot run standard training.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The memory cost of fine-tuning large language models is dominated not by the model weights themselves but by the backpropagation infrastructure: gradient tensors, optimizer states (Adam requires two momentum tensors per parameter), and activation checkpoints stored for the backward pass. For a 70B parameter model in mixed precision, the model weights occupy approximately 140GB—but the full training state exceeds 500GB, requiring multi-GPU setups with sophisticated memory management (DeepSpeed ZeRO, FSDP) just to begin training.

Wang et al.'s DistZO2 takes a radical approach: eliminate backpropagation entirely. Zeroth-order (ZO) optimization estimates gradients by evaluating the loss at two slightly different parameter configurations—a forward pass with parameters θ and a forward pass with parameters θ + εz (where z is a random perturbation vector). The difference in loss, divided by ε, provides a gradient estimate along direction z.

This eliminates the need to store activations (no backward pass), optimizer momentum tensors (ZO uses simpler update rules), and gradient tensors. The memory footprint drops to approximately the model weights plus a single perturbation vector—a reduction that enables fine-tuning models on hardware that cannot accommodate standard training.

From Single-GPU to Distributed ZO

Single-GPU zeroth-order optimization has a well-known weakness: high gradient variance. Each perturbation direction provides a one-dimensional gradient estimate; recovering the full gradient requires many perturbation directions. For models with billions of parameters, the number of perturbations needed for a useful gradient estimate is impractically large on a single GPU.

DistZO2 solves this through distribution: each GPU in the cluster computes gradient estimates along different perturbation directions, and the results are aggregated. With N GPUs, the gradient estimate improves by a factor of √N (standard Monte Carlo convergence), making distributed ZO optimization both faster and more accurate than the single-GPU version.

The distributed coordination is lightweight: each GPU independently samples a random perturbation direction, computes two forward passes, and broadcasts its scalar gradient estimate. The communication volume is negligible compared to the gradient all-reduce operations in standard distributed training—making DistZO2 communication-efficient in addition to memory-efficient.

Convergence Characteristics

ZO optimization converges more slowly than first-order (gradient-based) optimization—this is the fundamental tradeoff. Each ZO gradient estimate is noisier than the true gradient, requiring more update steps to reach the same loss level. The convergence rate depends on the model dimensionality (larger models need more perturbations), the perturbation scale ε (smaller ε gives more accurate but noisier estimates), and the learning rate schedule.

In practice, ZO optimization requires more forward passes than standard fine-tuning to reach comparable quality, since each gradient estimate is noisier than the true gradient. But since each forward pass is cheaper (no backward pass, no activation storage), the total wall-clock time can be competitive—and the memory savings enable experiments that are simply impossible with standard training on the available hardware.

Claims and Evidence

Claim	Evidence	Verdict
ZO eliminates backpropagation memory overhead	Mathematical proof; no activations or gradient tensors needed	✅ Proven
Distributed ZO improves gradient quality over single-GPU	√N improvement from aggregating N independent estimates	✅ Supported (standard result)
DistZO2 enables fine-tuning of models too large for standard training	Demonstrated on 100B+ parameter models on limited GPU memory	✅ Demonstrated
ZO fine-tuning matches standard fine-tuning quality	Quality gap exists; more iterations needed due to gradient noise	⚠️ Approaches but does not match
ZO is practical for all fine-tuning scenarios	Most beneficial for memory-constrained settings; standard training is preferred when memory is available	⚠️ Situational

Open Questions

Task-specific quality gap: Does the ZO-standard quality gap vary across tasks? Fine-tuning for simple classification may tolerate ZO noise well; fine-tuning for complex reasoning may suffer more. Task-specific analysis is needed.

Combination with LoRA: Can ZO be combined with parameter-efficient fine-tuning (LoRA, QLoRA) for additional memory savings? The combination would further reduce the number of parameters being optimized, potentially improving ZO convergence.

Adaptive perturbation: Should the perturbation scale ε adapt during training? Larger ε early in training (for faster exploration) and smaller ε later (for finer optimization) might improve convergence.

Hybrid approaches: Can we use ZO for most parameters and first-order optimization for a small subset of critical parameters? This hybrid might combine ZO's memory efficiency with first-order's convergence speed.

What This Means for Your Research

For ML practitioners with limited GPU resources, DistZO2 opens the possibility of fine-tuning models that were previously out of reach. A research lab with 4 × A100 GPUs can potentially fine-tune a 100B model that would normally require 16+ GPUs with standard training.

For optimization researchers, distributed zeroth-order optimization in the LLM setting presents interesting convergence analysis challenges—particularly around the interaction between model dimensionality, perturbation strategies, and distributed aggregation.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

역전파 없이 100B+ 모델 파인튜닝하기: 분산 제로차(Zeroth-Order) 최적화

대규모 언어 모델 파인튜닝의 메모리 비용은 모델 가중치 자체보다 역전파 인프라가 지배적이다: 그래디언트 텐서, 옵티마이저 상태(Adam은 파라미터당 두 개의 모멘텀 텐서를 필요로 한다), 그리고 역방향 패스를 위해 저장되는 활성화 체크포인트가 그 원인이다. 혼합 정밀도(mixed precision)로 70B 파라미터 모델을 구성할 경우, 모델 가중치만 약 140GB를 차지하지만 전체 학습 상태는 500GB를 초과하여, 학습을 시작하기 위해서만도 정교한 메모리 관리(DeepSpeed ZeRO, FSDP)를 갖춘 다중 GPU 설정이 필요하다.

Wang et al.의 DistZO2는 근본적인 접근 방식을 취한다: 역전파를 완전히 제거하는 것이다. 제로차(ZO) 최적화는 두 가지 약간 다른 파라미터 구성에서 손실을 평가함으로써 그래디언트를 추정한다—파라미터 θ에서의 순방향 패스와 파라미터 θ + εz에서의 순방향 패스(여기서 z는 무작위 섭동 벡터이다). 손실의 차이를 ε으로 나누면 방향 z를 따르는 그래디언트 추정값이 제공된다.

이를 통해 활성화 저장(역방향 패스 불필요), 옵티마이저 모멘텀 텐서(ZO는 더 단순한 업데이트 규칙 사용), 그래디언트 텐서를 저장할 필요가 없어진다. 메모리 사용량은 모델 가중치와 단일 섭동 벡터 정도로 줄어들어, 표준 학습을 수용할 수 없는 하드웨어에서도 모델 파인튜닝이 가능해진다.

단일 GPU에서 분산 ZO로

단일 GPU 제로차 최적화는 잘 알려진 약점이 있다: 높은 그래디언트 분산이다. 각 섭동 방향은 1차원 그래디언트 추정값만 제공하며, 전체 그래디언트를 복원하려면 많은 섭동 방향이 필요하다. 수십억 개의 파라미터를 가진 모델의 경우, 단일 GPU에서 유용한 그래디언트 추정에 필요한 섭동 횟수는 비현실적으로 많다.

DistZO2는 분산을 통해 이를 해결한다: 클러스터 내 각 GPU가 서로 다른 섭동 방향을 따라 그래디언트 추정값을 계산하고, 결과를 집계한다. N개의 GPU를 사용하면 그래디언트 추정이 √N 배 향상되어(표준 몬테카를로 수렴), 분산 ZO 최적화가 단일 GPU 버전보다 더 빠르고 정확해진다.

분산 조율은 경량화되어 있다: 각 GPU가 독립적으로 무작위 섭동 방향을 샘플링하고, 두 번의 순방향 패스를 계산한 후, 스칼라 그래디언트 추정값을 브로드캐스트한다. 통신량은 표준 분산 학습의 그래디언트 all-reduce 연산에 비해 무시할 수 있을 정도로 적어, DistZO2는 메모리 효율뿐 아니라 통신 효율도 높다.

수렴 특성

ZO 최적화는 1차(그래디언트 기반) 최적화보다 더 느리게 수렴한다—이것이 근본적인 트레이드오프이다. 각 ZO 그래디언트 추정값은 실제 그래디언트보다 더 잡음이 많아, 동일한 손실 수준에 도달하기 위해 더 많은 업데이트 스텝이 필요하다. 수렴 속도는 모델 차원수(더 큰 모델은 더 많은 섭동이 필요), 섭동 규모 ε(작은 ε은 더 정확하지만 더 잡음이 많은 추정값을 제공), 그리고 학습률 스케줄에 따라 달라진다.

실제로 ZO 최적화는 각 그래디언트 추정값이 실제 그래디언트보다 잡음이 많기 때문에 비슷한 품질에 도달하기 위해 표준 파인튜닝보다 더 많은 순방향 패스가 필요하다. 그러나 각 순방향 패스가 더 저렴하므로(역방향 패스 없음, 활성화 저장 없음), 전체 실행 시간은 경쟁력이 있을 수 있으며, 메모리 절약 덕분에 사용 가능한 하드웨어에서 표준 학습으로는 불가능한 실험도 가능해진다.

주장과 근거

주장	근거	판정
ZO는 역전파 메모리 오버헤드를 제거한다	수학적 증명; 활성화 또는 그래디언트 텐서 불필요	✅ 입증됨
분산 ZO는 단일 GPU 대비 기울기 품질을 향상시킨다	N개의 독립적인 추정치를 집계하여 √N만큼 개선	✅ 지원됨 (표준 결과)
DistZO2는 표준 학습에 비해 너무 큰 모델의 파인튜닝을 가능하게 한다	제한된 GPU 메모리에서 1000억 개 이상의 파라미터를 가진 모델에서 실증됨	✅ 실증됨
ZO 파인튜닝은 표준 파인튜닝과 동등한 품질을 달성한다	품질 격차가 존재하며, 기울기 노이즈로 인해 더 많은 반복이 필요함	⚠️ 근접하나 일치하지는 않음
ZO는 모든 파인튜닝 시나리오에 실용적이다	메모리 제약 환경에서 가장 유익하며, 메모리가 충분할 경우 표준 학습이 선호됨	⚠️ 상황에 따라 다름

미해결 질문

태스크별 품질 격차: ZO와 표준 방법 간의 품질 격차는 태스크에 따라 달라지는가? 단순한 분류 태스크를 위한 파인튜닝은 ZO 노이즈를 잘 허용할 수 있지만, 복잡한 추론 태스크를 위한 파인튜닝은 더 큰 영향을 받을 수 있다. 태스크별 분석이 필요하다.

LoRA와의 결합: ZO를 파라미터 효율적 파인튜닝(LoRA, QLoRA)과 결합하여 추가적인 메모리 절감을 달성할 수 있는가? 이러한 결합은 최적화되는 파라미터의 수를 더욱 줄여 ZO 수렴을 잠재적으로 개선할 수 있다.

적응형 perturbation: 학습 중에 perturbation 스케일 ε을 적응적으로 조정해야 하는가? 학습 초기에는 더 큰 ε(빠른 탐색을 위해), 후기에는 더 작은 ε(세밀한 최적화를 위해)을 사용하면 수렴을 개선할 수 있다.

하이브리드 접근법: 대부분의 파라미터에는 ZO를 사용하고, 일부 핵심 파라미터의 소규모 부분집합에는 1차 최적화를 사용할 수 있는가? 이러한 하이브리드 방식은 ZO의 메모리 효율성과 1차 방법의 수렴 속도를 결합할 수 있다.

연구에 주는 시사점

GPU 자원이 제한된 ML 실무자들에게 DistZO2는 기존에는 다루기 어려웠던 모델의 파인튜닝 가능성을 열어준다. 4개의 A100 GPU를 보유한 연구실은 표준 학습 기준으로 16개 이상의 GPU가 필요한 1000억 파라미터 모델을 잠재적으로 파인튜닝할 수 있다.

최적화 연구자들에게 있어, LLM 환경에서의 분산 영차 최적화는 흥미로운 수렴 분석 과제를 제시한다. 특히 모델 차원성, perturbation 전략, 분산 집계 간의 상호작용이 중요한 연구 주제이다.

References (1)

[1] Wang, L., Xie, H., Wang, D. et al. (2025). DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing. arXiv:2507.03211.

DOI Scholar