Why Distribute Training?#
The cost of training a single AI model is growing exponentially. Scaling laws — first formalised by Kaplan et al. (2020) and refined by Chinchilla (2022) — show that model quality improves predictably with more parameters and more data, but the compute budget required grows super-linearly.
A single state-of-the-art GPU (e.g., NVIDIA H100 at 80 GB HBM3) can hold roughly 40 billion parameters in fp16. Training a 175B-parameter model at this precision requires at minimum 5 GPUs just to fit the model weights — before accounting for optimizer states, activations, and gradient buffers.
Why this matters for AGIACC: distributed training is not just a scaling technique; it is a high-value infrastructure surface. The more complex the cluster, the more important trustworthy boundaries and operational assurance become.
The Three Bottlenecks#
| Bottleneck | Description | Solution |
|---|---|---|
| Memory | Model + optimizer + activations exceed device RAM | Model / tensor / pipeline parallelism |
| Compute | Single-device throughput too slow for practical schedules | Data parallelism, hardware scaling |
| Communication | Synchronising state across devices introduces overhead | Gradient compression, overlap, topology-aware placement |
Scaling Laws in Practice#
Scaling laws predict that a model with N parameters trained on D tokens at a given compute budget C follows a power-law relationship:
Loss ≈ α · N^{-a} + β · D^{-b}
This implies that there exists an optimal (N*, D*) pair for any given budget — Chinchilla-optimal training. Most deployed LLMs in 2025 are trained at or beyond Chinchilla optimality, meaning both parameter counts and dataset sizes have increased together, amplifying the need for distributed methods.
Communication Patterns#
Every distributed training strategy relies on one or more collective communication primitives. The most common:
AllReduce#
All workers contribute a local gradient vector and receive the globally averaged result. The dominant primitive for data-parallel training.
- Ring AllReduce — Each worker sends to and receives from exactly one neighbour. Communication cost is
2(N-1)/N × message_size, effectively bandwidth-optimal. - Tree AllReduce — Hierarchical reduction for very large clusters with multi-hop topologies.
AllGather#
Each worker broadcasts its shard; every worker gets the fully assembled tensor. Used in ZeRO Stage 3 and Fully Sharded Data Parallel (FSDP) to reconstruct parameters on demand.
Reduce-Scatter#
The inverse of AllGather — reduces global data and scatters shards back. Used to shard gradients after a backward pass.
Point-to-Point (Send / Recv)#
Direct communication between specific GPU pairs. The primitive underlying pipeline parallelism, where activations flow from one stage to the next.
Taxonomy of Parallelism#
At the highest level, distributed training strategies split into four families — often composed together in production systems:
- Data Parallelism — replicate the full model, split the data
- Model (Tensor) Parallelism — split individual layers across devices
- Pipeline Parallelism — split sequential stages across devices
- Hybrid (3D) Parallelism — combine all three for maximum scale
The following pages explore each in detail.
Next: Data Parallelism →