Distributed training

Technology (general). These articles explain how distributed training works — data, tensor, pipeline, and hybrid parallelism — and name the trust boundaries that matter when you secure training clusters. This is background reading that helps technical and investment audiences understand where AI infrastructure risk appears before deployment.

Frontier models exceed single-device memory; training splits work across devices, networks, and schedulers. That scale is exactly why supply-chain, node integrity, and lateral movement in clusters show up in threat models — topics we connect to in Research and scenario writing in Showcase.

Start here: Fundamentals · Data Parallelism · Model Parallelism · Pipeline Parallelism · Hybrid Parallelism · Frameworks & Tools · Security Challenges

Fundamentals of Distributed Training

477 words·3 mins

Why Distribute Training? # The cost of training a single AI model is growing exponentially. Scaling laws — first formalised by Kaplan et al. (2020) and refined by Chinchilla (2022) — show that model quality improves predictably with more parameters and more data, but the compute budget required grows super-linearly. A single state-of-the-art GPU (e.g., NVIDIA H100 at 80 GB HBM3) can hold roughly 40 billion parameters in fp16. Training a 175B-parameter model at this precision requires at minimum 5 GPUs just to fit the model weights — before accounting for optimizer states, activations, and gradient buffers.

Data Parallelism

506 words·3 mins

Core Idea # Data parallelism is the simplest and most widely adopted form of distributed training. The strategy: Replicate the entire model on every device (GPU / TPU / accelerator). Partition the training dataset into disjoint mini-batches, one per replica. Each device computes a forward pass and backward pass on its local mini-batch. Synchronise gradients across all replicas (typically via AllReduce). Every replica applies the identical parameter update, keeping models in sync. Because each device processes different data but shares the same model, data parallelism achieves near-linear scaling when communication overhead is well managed.

Model Parallelism

528 words·3 mins

Core Idea # When a model’s layers are too large to fit on a single accelerator, model parallelism partitions the model itself across multiple devices. Unlike data parallelism (which replicates the model), model parallelism places different parts of the computation graph on different GPUs. The most important variant is tensor parallelism — splitting the weight matrices of individual layers so that each device computes a slice of a single operation (e.g., a matrix multiplication) in parallel.

Pipeline Parallelism

571 words·3 mins

Core Idea # Pipeline parallelism divides a neural network into sequential stages, each assigned to a different device. Data flows through these stages like an assembly line — the output of Stage 1 becomes the input of Stage 2, and so on. This is a form of model parallelism, but operates at the inter-layer level (grouping consecutive layers together) rather than the intra-layer level (splitting individual layers, as in tensor parallelism).

Hybrid & 3D Parallelism

553 words·3 mins

Why Hybrid? # No single parallelism strategy is sufficient for frontier-scale training. Each has strengths suited to a specific axis of the cluster: Strategy Best For Communication Typical Scope Tensor Parallelism Large individual layers AllReduce (high bandwidth) Intra-node (NVLink) Pipeline Parallelism Many sequential stages Point-to-point Inter-node (InfiniBand) Data Parallelism Scaling throughput AllReduce (gradient sync) Across node groups 3D Parallelism (or hybrid parallelism) composes all three, mapping each strategy to the interconnect tier where it performs best.

Frameworks & Tools

503 words·3 mins

Overview # The distributed training ecosystem in 2025 centres on a small number of frameworks, each with distinct strengths. Most production systems use a combination of these tools.

Security Challenges in Distributed Training

569 words·3 mins

A Massively Expanded Attack Surface # Distributed training transforms a single-machine workload into a large-scale networked system. Every added dimension of parallelism introduces new attack vectors:

↑