Skip to main content
Data Parallelism

Data Parallelism

506 words·3 mins
Distributed Training Methods - This article is part of a series.
Part 2: This Article

Core Idea
#

Data parallelism is the simplest and most widely adopted form of distributed training. The strategy:

  1. Replicate the entire model on every device (GPU / TPU / accelerator).
  2. Partition the training dataset into disjoint mini-batches, one per replica.
  3. Each device computes a forward pass and backward pass on its local mini-batch.
  4. Synchronise gradients across all replicas (typically via AllReduce).
  5. Every replica applies the identical parameter update, keeping models in sync.

Because each device processes different data but shares the same model, data parallelism achieves near-linear scaling when communication overhead is well managed.

Why this matters for AGIACC: state replication and constant synchronisation make data-parallel systems efficient, but they also create system-wide trust dependencies that infrastructure security has to address.


PyTorch Distributed Data Parallel (DDP)
#

PyTorch’s DistributedDataParallel (DDP) is the reference implementation of data-parallel training. Key design choices:

  • Gradient bucketing — Small gradient tensors are grouped into communication buckets to amortise AllReduce launch overhead.
  • Overlap — Gradient synchronisation begins while the backward pass is still running, hiding communication latency behind compute.
  • Process groups — NCCL (NVIDIA), Gloo, or MPI backends handle the underlying collectives.
# Minimal PyTorch DDP example
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(MyModel().cuda(), device_ids=[local_rank])

Scaling efficiency: DDP typically achieves 90–95% scaling efficiency on 8 GPUs within a single node, dropping to 80–90% across multi-node clusters depending on interconnect bandwidth.


The Memory Problem: ZeRO
#

Classic DDP stores a full copy of model parameters, optimizer states, and gradients on every device. For a 10B-parameter model in fp16/fp32 mixed precision:

ComponentPer-device memory
Parameters (fp16)20 GB
Gradients (fp16)20 GB
Optimizer states (fp32 master + momentum + variance)60 GB
Total100 GB

This exceeds the 80 GB capacity of an H100. ZeRO (Zero Redundancy Optimizer), developed by Microsoft DeepSpeed, eliminates this redundancy by sharding state across devices:

ZeRO Stages
#

StageWhat is shardedMemory saving
ZeRO-1Optimizer states only~4× reduction
ZeRO-2Optimizer states + gradients~8× reduction
ZeRO-3Optimizer states + gradients + parametersFull linear scaling

At Stage 3, each device stores only 1/N of total state. Parameters are gathered on-demand via AllGather before each forward/backward computation, then released.


Fully Sharded Data Parallel (FSDP)
#

PyTorch’s FSDP integrates ZeRO-3-style sharding natively into PyTorch:

  • Each FSDP unit wraps a sub-module and shards its parameters across the data-parallel group.
  • Before a forward pass, parameters are all-gathered; after backward, gradients are reduce-scattered and shards are released.
  • Supports mixed precision, activation checkpointing, and CPU offloading out of the box.

FSDP has become the default data-parallelism strategy for models too large for vanilla DDP but not so large as to require tensor or pipeline splits.


When Data Parallelism Is Not Enough
#

Data parallelism assumes the model fits on a single device (at least at ZeRO-3 sharding granularity). When a single layer is too large for one GPU — common in frontier models with embedding tables of billions of parameters or attention heads spanning thousands of dimensions — model parallelism is required.


Next: Model Parallelism →

Distributed Training Methods - This article is part of a series.
Part 2: This Article