Model Parallelism

Table of Contents

Distributed Training Methods - This article is part of a series.

Part 1: Fundamentals of Distributed Training

Part 2: Data Parallelism

Part 3: This Article

Part 4: Pipeline Parallelism

Part 5: Hybrid & 3D Parallelism

Part 6: Frameworks & Tools

Part 7: Security Challenges in Distributed Training

Core Idea
#

When a model’s layers are too large to fit on a single accelerator, model parallelism partitions the model itself across multiple devices. Unlike data parallelism (which replicates the model), model parallelism places different parts of the computation graph on different GPUs.

The most important variant is tensor parallelism — splitting the weight matrices of individual layers so that each device computes a slice of a single operation (e.g., a matrix multiplication) in parallel.

Why this matters for AGIACC: as models get wider, training depends on tightly coupled accelerators, privileged runtimes, and trusted interconnects. That concentration of value also concentrates risk.

Tensor Parallelism (Intra-Layer)
#

Tensor parallelism was formalised by the Megatron-LM project at NVIDIA. The key insight is that large matrix multiplications — the computational core of transformer attention and feed-forward layers — can be decomposed and distributed with minimal communication.

Column-Parallel Linear
#

A linear layer Y = XA can be split by partitioning A column-wise across devices:

A = [A₁ | A₂ | ... | Aₙ]
Yᵢ = X · Aᵢ      (computed independently on device i)
Y  = [Y₁ | Y₂ | ... | Yₙ]   (concatenated via AllGather)

Each device computes a thin slice of the output. The results are concatenated to reconstruct the full output.

Row-Parallel Linear
#

For the subsequent layer, A is split row-wise:

A = [A₁; A₂; ...; Aₙ]ᵀ
Yᵢ = Xᵢ · Aᵢ    (each device uses its own input shard)
Y  = Σ Yᵢ         (via AllReduce)

By alternating column-parallel and row-parallel layers, a full transformer block requires only two AllReduce operations per layer — one in the forward pass and one in the backward pass.

Self-Attention Parallelism
#

Multi-head attention naturally lends itself to tensor parallelism. Since attention heads are independent:

Each device computes a subset of attention heads.
Key, Query, and Value projections are column-parallel.
The output projection is row-parallel.
Communication is limited to AllReduce on the output.

For a model with 96 attention heads across 8 GPUs, each GPU computes 12 heads independently.

Communication Cost
#

Within a single node connected by NVLink / NVSwitch (900 GB/s bandwidth on DGX H100), tensor parallelism communication overhead is manageable:

Tensor Parallel Degree	Communication Volume (per layer)	Typical Efficiency
2	2 AllReduce of hidden_size	~97%
4	2 AllReduce of hidden_size	~94%
8	2 AllReduce of hidden_size	~90%

Critical constraint: Tensor parallelism is generally limited to intra-node use (GPUs within a single machine), because it requires very high-bandwidth, low-latency interconnects. Cross-node communication typically uses pipeline or data parallelism instead.

Sequence Parallelism
#

Recent work extends tensor parallelism to LayerNorm and Dropout operations that occur between parallel regions. Sequence parallelism (Megatron v3) allocates non-parallel portions of the computation along the sequence dimension, reducing activation memory by the tensor-parallel degree without additional communication.

Expert Parallelism (MoE)
#

Mixture-of-Experts (MoE) models introduce a different kind of model parallelism: instead of splitting layers, they activate only a subset of “expert” sub-networks for each input token. Each expert resides on a different device, and an All-to-All communication operation routes tokens to their assigned experts.

MoE enables models with trillions of parameters (e.g., Mixtral, Switch Transformer) while keeping per-token compute cost manageable.

Next: Pipeline Parallelism →