Skip to main content
Hybrid & 3D Parallelism

Hybrid & 3D Parallelism

553 words·3 mins
Distributed Training Methods - This article is part of a series.
Part 5: This Article

Why Hybrid?
#

No single parallelism strategy is sufficient for frontier-scale training. Each has strengths suited to a specific axis of the cluster:

StrategyBest ForCommunicationTypical Scope
Tensor ParallelismLarge individual layersAllReduce (high bandwidth)Intra-node (NVLink)
Pipeline ParallelismMany sequential stagesPoint-to-pointInter-node (InfiniBand)
Data ParallelismScaling throughputAllReduce (gradient sync)Across node groups

3D Parallelism (or hybrid parallelism) composes all three, mapping each strategy to the interconnect tier where it performs best.

Why this matters for AGIACC: frontier training is infrastructure choreography. The companies that understand these boundaries are better positioned to secure how high-value models are actually built.


Megatron-DeepSpeed
#

The Megatron-DeepSpeed system, jointly developed by NVIDIA and Microsoft, is the canonical implementation of 3D parallelism. It was used to train models like Megatron-Turing NLG (530B) and has influenced most subsequent large-scale training systems.

Architecture
#

Consider a cluster of 64 DGX H100 nodes (512 GPUs total) training a 175B-parameter model:

                              ┌─────────────────────────────┐
         Data Parallel Group  │  64 replicas across cluster  │
                              └──────────┬──────────────────┘
                                         │
                              ┌──────────▼──────────────────┐
       Pipeline Parallel Group│  8 stages across 8 nodes     │
                              └──────────┬──────────────────┘
                                         │
                              ┌──────────▼──────────────────┐
        Tensor Parallel Group │  8 GPUs within a single node │
                              └─────────────────────────────┘
  • Tensor Parallel (TP) = 8: Each transformer layer is split across 8 GPUs within a node, using NVLink at 900 GB/s.
  • Pipeline Parallel (PP) = 8: The model’s layers are split into 8 stages, one per node, communicating via InfiniBand at 400 Gb/s.
  • Data Parallel (DP) = 8: 8 identical pipeline replicas train on different data, synchronising gradients via AllReduce across the cluster.

Total GPUs = TP × PP × DP = 8 × 8 × 8 = 512 GPUs.


Configuration Space
#

Choosing the right TP/PP/DP configuration is a systems optimisation problem. Key trade-offs:

Increasing…ReducesIncreasesConstraint
TP degreePer-GPU memory for one layerAllReduce traffic (intra-node)Limited by intra-node bandwidth
PP degreePer-GPU memory (fewer stages)Pipeline bubbles, activation latencyBubble fraction ≈ (PP-1) / micro-batches
DP degreeN/A (model unchanged)Gradient AllReduce trafficRequires enough global batch size

Modern autotuners (e.g., Alpa, Galvatron) use cost models and dynamic programming to search this space automatically.


ZeRO + 3D Parallelism
#

Combining ZeRO with 3D parallelism further reduces memory without changing the parallelism geometry:

  • ZeRO Stage 1 shards optimizer states across the data-parallel group — compatible with any TP/PP configuration.
  • ZeRO Stage 2 additionally shards gradients.
  • ZeRO Stage 3 shards parameters as well, but can conflict with tensor parallelism unless carefully coordinated.

In practice, most 3D-parallel systems use ZeRO Stage 1 or 2, relying on tensor parallelism for per-layer memory reduction.


Sequence Parallelism + Context Parallelism
#

For long-context models (128K+ tokens), additional dimensions of parallelism are needed:

  • Sequence Parallelism splits LayerNorm and Dropout computations along the sequence dimension within the tensor-parallel group.
  • Context Parallelism partitions the input sequence across devices, with each device computing attention only for its assigned segment using ring attention or similar techniques.

These are sometimes called 4D or 5D parallelism when combined with the classic TP/PP/DP trio.


Real-World Training Configurations
#

ModelParamsGPUsTPPPDPFramework
GPT-3175B1024 V1008816Megatron-LM
Megatron-Turing NLG530B2240 A1008358Megatron-DeepSpeed
LLaMA 3 (405B)405B16K H100816128Meta internal
Mixtral 8×22B (MoE)141B active2048 H1008464Custom (EP+DP)

Next: Frameworks & Tools →

Distributed Training Methods - This article is part of a series.
Part 5: This Article