Why Hybrid?#
No single parallelism strategy is sufficient for frontier-scale training. Each has strengths suited to a specific axis of the cluster:
| Strategy | Best For | Communication | Typical Scope |
|---|---|---|---|
| Tensor Parallelism | Large individual layers | AllReduce (high bandwidth) | Intra-node (NVLink) |
| Pipeline Parallelism | Many sequential stages | Point-to-point | Inter-node (InfiniBand) |
| Data Parallelism | Scaling throughput | AllReduce (gradient sync) | Across node groups |
3D Parallelism (or hybrid parallelism) composes all three, mapping each strategy to the interconnect tier where it performs best.
Why this matters for AGIACC: frontier training is infrastructure choreography. The companies that understand these boundaries are better positioned to secure how high-value models are actually built.
Megatron-DeepSpeed#
The Megatron-DeepSpeed system, jointly developed by NVIDIA and Microsoft, is the canonical implementation of 3D parallelism. It was used to train models like Megatron-Turing NLG (530B) and has influenced most subsequent large-scale training systems.
Architecture#
Consider a cluster of 64 DGX H100 nodes (512 GPUs total) training a 175B-parameter model:
┌─────────────────────────────┐
Data Parallel Group │ 64 replicas across cluster │
└──────────┬──────────────────┘
│
┌──────────▼──────────────────┐
Pipeline Parallel Group│ 8 stages across 8 nodes │
└──────────┬──────────────────┘
│
┌──────────▼──────────────────┐
Tensor Parallel Group │ 8 GPUs within a single node │
└─────────────────────────────┘
- Tensor Parallel (TP) = 8: Each transformer layer is split across 8 GPUs within a node, using NVLink at 900 GB/s.
- Pipeline Parallel (PP) = 8: The model’s layers are split into 8 stages, one per node, communicating via InfiniBand at 400 Gb/s.
- Data Parallel (DP) = 8: 8 identical pipeline replicas train on different data, synchronising gradients via AllReduce across the cluster.
Total GPUs = TP × PP × DP = 8 × 8 × 8 = 512 GPUs.
Configuration Space#
Choosing the right TP/PP/DP configuration is a systems optimisation problem. Key trade-offs:
| Increasing… | Reduces | Increases | Constraint |
|---|---|---|---|
| TP degree | Per-GPU memory for one layer | AllReduce traffic (intra-node) | Limited by intra-node bandwidth |
| PP degree | Per-GPU memory (fewer stages) | Pipeline bubbles, activation latency | Bubble fraction ≈ (PP-1) / micro-batches |
| DP degree | N/A (model unchanged) | Gradient AllReduce traffic | Requires enough global batch size |
Modern autotuners (e.g., Alpa, Galvatron) use cost models and dynamic programming to search this space automatically.
ZeRO + 3D Parallelism#
Combining ZeRO with 3D parallelism further reduces memory without changing the parallelism geometry:
- ZeRO Stage 1 shards optimizer states across the data-parallel group — compatible with any TP/PP configuration.
- ZeRO Stage 2 additionally shards gradients.
- ZeRO Stage 3 shards parameters as well, but can conflict with tensor parallelism unless carefully coordinated.
In practice, most 3D-parallel systems use ZeRO Stage 1 or 2, relying on tensor parallelism for per-layer memory reduction.
Sequence Parallelism + Context Parallelism#
For long-context models (128K+ tokens), additional dimensions of parallelism are needed:
- Sequence Parallelism splits LayerNorm and Dropout computations along the sequence dimension within the tensor-parallel group.
- Context Parallelism partitions the input sequence across devices, with each device computing attention only for its assigned segment using ring attention or similar techniques.
These are sometimes called 4D or 5D parallelism when combined with the classic TP/PP/DP trio.
Real-World Training Configurations#
| Model | Params | GPUs | TP | PP | DP | Framework |
|---|---|---|---|---|---|---|
| GPT-3 | 175B | 1024 V100 | 8 | 8 | 16 | Megatron-LM |
| Megatron-Turing NLG | 530B | 2240 A100 | 8 | 35 | 8 | Megatron-DeepSpeed |
| LLaMA 3 (405B) | 405B | 16K H100 | 8 | 16 | 128 | Meta internal |
| Mixtral 8×22B (MoE) | 141B active | 2048 H100 | 8 | 4 | 64 | Custom (EP+DP) |
Next: Frameworks & Tools →