Fundamentals of Distributed Training
477 words·3 mins
Why Distribute Training? # The cost of training a single AI model is growing exponentially. Scaling laws — first formalised by Kaplan et al. (2020) and refined by Chinchilla (2022) — show that model quality improves predictably with more parameters and more data, but the compute budget required grows super-linearly.
A single state-of-the-art GPU (e.g., NVIDIA H100 at 80 GB HBM3) can hold roughly 40 billion parameters in fp16. Training a 175B-parameter model at this precision requires at minimum 5 GPUs just to fit the model weights — before accounting for optimizer states, activations, and gradient buffers.