Fundamentals of Distributed Training

Table of Contents

Distributed Training Methods - This article is part of a series.

Part 1: This Article

Part 2: Data Parallelism

Part 3: Model Parallelism

Part 4: Pipeline Parallelism

Part 5: Hybrid & 3D Parallelism

Part 6: Frameworks & Tools

Part 7: Security Challenges in Distributed Training

Why Distribute Training?
#

The cost of training a single AI model is growing exponentially. Scaling laws — first formalised by Kaplan et al. (2020) and refined by Chinchilla (2022) — show that model quality improves predictably with more parameters and more data, but the compute budget required grows super-linearly.

A single state-of-the-art GPU (e.g., NVIDIA H100 at 80 GB HBM3) can hold roughly 40 billion parameters in fp16. Training a 175B-parameter model at this precision requires at minimum 5 GPUs just to fit the model weights — before accounting for optimizer states, activations, and gradient buffers.

Why this matters for AGIACC: distributed training is not just a scaling technique; it is a high-value infrastructure surface. The more complex the cluster, the more important trustworthy boundaries and operational assurance become.

The Three Bottlenecks
#

Bottleneck	Description	Solution
Memory	Model + optimizer + activations exceed device RAM	Model / tensor / pipeline parallelism
Compute	Single-device throughput too slow for practical schedules	Data parallelism, hardware scaling
Communication	Synchronising state across devices introduces overhead	Gradient compression, overlap, topology-aware placement

Scaling Laws in Practice
#

Scaling laws predict that a model with N parameters trained on D tokens at a given compute budget C follows a power-law relationship:

Loss ≈ α · N^{-a} + β · D^{-b}

This implies that there exists an optimal (N*, D*) pair for any given budget — Chinchilla-optimal training. Most deployed LLMs in 2025 are trained at or beyond Chinchilla optimality, meaning both parameter counts and dataset sizes have increased together, amplifying the need for distributed methods.

Communication Patterns
#

Every distributed training strategy relies on one or more collective communication primitives. The most common:

AllReduce
#

All workers contribute a local gradient vector and receive the globally averaged result. The dominant primitive for data-parallel training.

Ring AllReduce — Each worker sends to and receives from exactly one neighbour. Communication cost is 2(N-1)/N × message_size, effectively bandwidth-optimal.
Tree AllReduce — Hierarchical reduction for very large clusters with multi-hop topologies.

AllGather
#

Each worker broadcasts its shard; every worker gets the fully assembled tensor. Used in ZeRO Stage 3 and Fully Sharded Data Parallel (FSDP) to reconstruct parameters on demand.