Core Idea#
Data parallelism is the simplest and most widely adopted form of distributed training. The strategy:
- Replicate the entire model on every device (GPU / TPU / accelerator).
- Partition the training dataset into disjoint mini-batches, one per replica.
- Each device computes a forward pass and backward pass on its local mini-batch.
- Synchronise gradients across all replicas (typically via AllReduce).
- Every replica applies the identical parameter update, keeping models in sync.
Because each device processes different data but shares the same model, data parallelism achieves near-linear scaling when communication overhead is well managed.
Why this matters for AGIACC: state replication and constant synchronisation make data-parallel systems efficient, but they also create system-wide trust dependencies that infrastructure security has to address.
PyTorch Distributed Data Parallel (DDP)#
PyTorch’s DistributedDataParallel (DDP) is the reference implementation of data-parallel training. Key design choices:
- Gradient bucketing — Small gradient tensors are grouped into communication buckets to amortise AllReduce launch overhead.
- Overlap — Gradient synchronisation begins while the backward pass is still running, hiding communication latency behind compute.
- Process groups — NCCL (NVIDIA), Gloo, or MPI backends handle the underlying collectives.
# Minimal PyTorch DDP example
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl")
model = DDP(MyModel().cuda(), device_ids=[local_rank])
Scaling efficiency: DDP typically achieves 90–95% scaling efficiency on 8 GPUs within a single node, dropping to 80–90% across multi-node clusters depending on interconnect bandwidth.
The Memory Problem: ZeRO#
Classic DDP stores a full copy of model parameters, optimizer states, and gradients on every device. For a 10B-parameter model in fp16/fp32 mixed precision:
| Component | Per-device memory |
|---|---|
| Parameters (fp16) | 20 GB |
| Gradients (fp16) | 20 GB |
| Optimizer states (fp32 master + momentum + variance) | 60 GB |
| Total | 100 GB |
This exceeds the 80 GB capacity of an H100. ZeRO (Zero Redundancy Optimizer), developed by Microsoft DeepSpeed, eliminates this redundancy by sharding state across devices:
ZeRO Stages#
| Stage | What is sharded | Memory saving |
|---|---|---|
| ZeRO-1 | Optimizer states only | ~4× reduction |
| ZeRO-2 | Optimizer states + gradients | ~8× reduction |
| ZeRO-3 | Optimizer states + gradients + parameters | Full linear scaling |
At Stage 3, each device stores only 1/N of total state. Parameters are gathered on-demand via AllGather before each forward/backward computation, then released.
Fully Sharded Data Parallel (FSDP)#
PyTorch’s FSDP integrates ZeRO-3-style sharding natively into PyTorch:
- Each FSDP unit wraps a sub-module and shards its parameters across the data-parallel group.
- Before a forward pass, parameters are all-gathered; after backward, gradients are reduce-scattered and shards are released.
- Supports mixed precision, activation checkpointing, and CPU offloading out of the box.
FSDP has become the default data-parallelism strategy for models too large for vanilla DDP but not so large as to require tensor or pipeline splits.
When Data Parallelism Is Not Enough#
Data parallelism assumes the model fits on a single device (at least at ZeRO-3 sharding granularity). When a single layer is too large for one GPU — common in frontier models with embedding tables of billions of parameters or attention heads spanning thousands of dimensions — model parallelism is required.
Next: Model Parallelism →