Skip to main content
Pipeline Parallelism

Pipeline Parallelism

571 words·3 mins
Distributed Training Methods - This article is part of a series.
Part 4: This Article

Core Idea
#

Pipeline parallelism divides a neural network into sequential stages, each assigned to a different device. Data flows through these stages like an assembly line — the output of Stage 1 becomes the input of Stage 2, and so on.

This is a form of model parallelism, but operates at the inter-layer level (grouping consecutive layers together) rather than the intra-layer level (splitting individual layers, as in tensor parallelism).

Why this matters for AGIACC: pipelines create explicit stage boundaries, hand-off points, and scheduling dependencies. Those are performance choices, but they are also natural places to think about containment and assurance.


The Bubble Problem
#

Naive pipeline parallelism suffers from severe underutilisation. Consider a 4-stage pipeline processing a single mini-batch:

Time →
GPU 0: [Fwd Stage 1] [         idle         ] [Bwd Stage 1]
GPU 1:       [Fwd Stage 2] [    idle    ] [Bwd Stage 2]
GPU 2:             [Fwd Stage 3] [idle ] [Bwd Stage 3]
GPU 3:                   [Fwd Stage 4] [Bwd Stage 4]

At any given time, only one GPU is active — the rest are idle. This idle time is called the pipeline bubble, and it grows linearly with the number of stages.


GPipe: Micro-Batch Pipelining
#

GPipe (Huang et al., 2019) was the first practical solution to the bubble problem. The idea: split each mini-batch into M micro-batches and pipeline them through the stages:

Time → (4 stages, 4 micro-batches)
GPU 0: [F₁] [F₂] [F₃] [F₄] [B₄] [B₃] [B₂] [B₁]
GPU 1:      [F₁] [F₂] [F₃] [F₄] [B₄] [B₃] [B₂] [B₁]
GPU 2:           [F₁] [F₂] [F₃] [F₄] [B₄] [B₃] [B₂] [B₁]
GPU 3:                [F₁] [F₂] [F₃] [F₄] [B₄] [B₃] [B₂] [B₁]

GPipe executes all forward micro-batches first, then all backward micro-batches. The bubble fraction reduces to approximately:

Bubble fraction ≈ (P - 1) / M

where P is the number of pipeline stages and M is the number of micro-batches. With M » P, the bubble becomes negligible.

Trade-off: GPipe must store activations for all M micro-batches during the forward phase, increasing peak memory. Activation checkpointing (recomputing activations during backward) mitigates this at the cost of additional compute.


1F1B: Interleaved Scheduling
#

1F1B (One-Forward-One-Backward) scheduling, introduced in PipeDream, interleaves forward and backward passes more aggressively:

  • After a warm-up phase, each device alternates between one forward micro-batch and one backward micro-batch.
  • This limits the peak number of in-flight micro-batches to the number of pipeline stages, dramatically reducing activation memory.
Warm-up     Steady state              Cool-down
GPU 0: [F₁][F₂][F₃] [F₄][B₁] [F₅][B₂] [F₆][B₃] [B₄][B₅][B₆]
GPU 1:    [F₁][F₂] [F₃][B₁] [F₄][B₂] [F₅][B₃] [F₆][B₄] [B₅][B₆]

1F1B achieves similar bubble reduction to GPipe but with constant activation memory (proportional to P, not M).


Interleaved Pipeline Stages
#

Megatron-LM introduces virtual pipeline stages, where each device hosts multiple non-consecutive stages. For example, with 8 layers and 2 GPUs:

DeviceClassic PipelineInterleaved Pipeline
GPU 0Layers 1–4Layers 1–2, 5–6
GPU 1Layers 5–8Layers 3–4, 7–8

This reduces the bubble by a factor equal to the number of virtual stages per device, at the cost of additional point-to-point communication.


When to Use Pipeline Parallelism
#

Pipeline parallelism is most effective when:

  • The model is too deep to fit on a single device (many sequential layers)
  • Cross-node bandwidth is limited (pipeline communication is point-to-point, not collective)
  • Combined with tensor parallelism within nodes and data parallelism across node groups

In production systems training frontier LLMs, pipeline parallelism typically operates across nodes, while tensor parallelism operates within nodes.


Next: Hybrid Parallelism →

Distributed Training Methods - This article is part of a series.
Part 4: This Article