Hybrid & 3D Parallelism

Table of Contents

Distributed Training Methods - This article is part of a series.

Part 1: Fundamentals of Distributed Training

Part 2: Data Parallelism

Part 3: Model Parallelism

Part 4: Pipeline Parallelism

Part 5: This Article

Part 6: Frameworks & Tools

Part 7: Security Challenges in Distributed Training

Why Hybrid?
#

No single parallelism strategy is sufficient for frontier-scale training. Each has strengths suited to a specific axis of the cluster:

Strategy	Best For	Communication	Typical Scope
Tensor Parallelism	Large individual layers	AllReduce (high bandwidth)	Intra-node (NVLink)
Pipeline Parallelism	Many sequential stages	Point-to-point	Inter-node (InfiniBand)
Data Parallelism	Scaling throughput	AllReduce (gradient sync)	Across node groups

3D Parallelism (or hybrid parallelism) composes all three, mapping each strategy to the interconnect tier where it performs best.

Why this matters for AGIACC: frontier training is infrastructure choreography. The companies that understand these boundaries are better positioned to secure how high-value models are actually built.

Megatron-DeepSpeed
#

The Megatron-DeepSpeed system, jointly developed by NVIDIA and Microsoft, is the canonical implementation of 3D parallelism. It was used to train models like Megatron-Turing NLG (530B) and has influenced most subsequent large-scale training systems.

Architecture
#

Consider a cluster of 64 DGX H100 nodes (512 GPUs total) training a 175B-parameter model:

                              ┌─────────────────────────────┐
         Data Parallel Group  │  64 replicas across cluster  │
                              └──────────┬──────────────────┘
                                         │
                              ┌──────────▼──────────────────┐
       Pipeline Parallel Group│  8 stages across 8 nodes     │
                              └──────────┬──────────────────┘
                                         │
                              ┌──────────▼──────────────────┐
        Tensor Parallel Group │  8 GPUs within a single node │
                              └─────────────────────────────┘

Tensor Parallel (TP) = 8: Each transformer layer is split across 8 GPUs within a node, using NVLink at 900 GB/s.
Pipeline Parallel (PP) = 8: The model’s layers are split into 8 stages, one per node, communicating via InfiniBand at 400 Gb/s.
Data Parallel (DP) = 8: 8 identical pipeline replicas train on different data, synchronising gradients via AllReduce across the cluster.

Total GPUs = TP × PP × DP = 8 × 8 × 8 = 512 GPUs.

Configuration Space
#

Choosing the right TP/PP/DP configuration is a systems optimisation problem. Key trade-offs:

Increasing…	Reduces	Increases	Constraint
TP degree	Per-GPU memory for one layer	AllReduce traffic (intra-node)	Limited by intra-node bandwidth
PP degree	Per-GPU memory (fewer stages)	Pipeline bubbles, activation latency	Bubble fraction ≈ (PP-1) / micro-batches
DP degree	N/A (model unchanged)	Gradient AllReduce traffic	Requires enough global batch size

Modern autotuners (e.g., Alpa, Galvatron) use cost models and dynamic programming to search this space automatically.

ZeRO + 3D Parallelism
#

Combining ZeRO with 3D parallelism further reduces memory without changing the parallelism geometry:

ZeRO Stage 1 shards optimizer states across the data-parallel group — compatible with any TP/PP configuration.
ZeRO Stage 2 additionally shards gradients.
ZeRO Stage 3 shards parameters as well, but can conflict with tensor parallelism unless carefully coordinated.

In practice, most 3D-parallel systems use ZeRO Stage 1 or 2, relying on tensor parallelism for per-layer memory reduction.

Sequence Parallelism + Context Parallelism
#

For long-context models (128K+ tokens), additional dimensions of parallelism are needed:

Sequence Parallelism splits LayerNorm and Dropout computations along the sequence dimension within the tensor-parallel group.
Context Parallelism partitions the input sequence across devices, with each device computing attention only for its assigned segment using ring attention or similar techniques.

These are sometimes called 4D or 5D parallelism when combined with the classic TP/PP/DP trio.

Real-World Training Configurations
#

Model	Params	GPUs	TP	PP	DP	Framework
GPT-3	175B	1024 V100	8	8	16	Megatron-LM
Megatron-Turing NLG	530B	2240 A100	8	35	8	Megatron-DeepSpeed
LLaMA 3 (405B)	405B	16K H100	8	16	128	Meta internal
Mixtral 8×22B (MoE)	141B active	2048 H100	8	4	64	Custom (EP+DP)