Skip to main content
Security Challenges in Distributed Training

Security Challenges in Distributed Training

569 words·3 mins
Distributed Training Methods - This article is part of a series.
Part 7: This Article

A Massively Expanded Attack Surface
#

Distributed training transforms a single-machine workload into a large-scale networked system. Every added dimension of parallelism introduces new attack vectors:

Why this matters for AGIACC: the security surface of AI grows with scale. That is exactly why trusted infrastructure becomes more strategic as model development becomes more distributed and more capital-intensive.


Threat Model
#

1. Gradient and Parameter Poisoning
#

In data-parallel training, all workers contribute gradients to a shared AllReduce operation. A compromised worker can inject malicious gradients that subtly alter the final model behaviour — a technique known as Byzantine poisoning.

  • In federated or multi-tenant training, not all participants may be trusted.
  • A single malicious gradient contribution can bias the model across all replicas.
  • Software-only defenses (e.g., gradient clipping, median aggregation) add overhead and are bypassable by sophisticated adversaries.

2. Man-in-the-Middle on Collective Communication
#

AllReduce, AllGather, and point-to-point operations typically use NCCL over InfiniBand or RoCE. In many deployments:

  • Communication is unencrypted for performance reasons.
  • There is no mutual authentication between GPU workers.
  • An attacker with access to the training network fabric can intercept, modify, or replay gradient tensors.

3. Model and Data Exfiltration
#

Distributed training systems move terabytes of model parameters, activations, and training data across the network. Without hardware-enforced access control:

  • A compromised node can read the full model weights during AllGather operations.
  • Training data flowing through pipeline stages can be intercepted.
  • Checkpoints stored on shared filesystems are targets for model theft.

4. Supply Chain Attacks on Frameworks
#

The training software stack — PyTorch, DeepSpeed, NCCL, CUDA drivers, container images — is deep and complex. Vulnerabilities at any layer can be exploited:

  • Malicious custom operators or extensions can execute arbitrary code on GPU nodes.
  • Container image poisoning can inject backdoors into the training environment.
  • Dependency confusion attacks on Python packages can compromise the training pipeline before it begins.

5. Side-Channel Attacks on Shared Infrastructure
#

In cloud and multi-tenant GPU clusters:

  • GPU memory residuals — previous tenants’ data may remain in GPU memory if not properly cleared.
  • Timing side channels — communication patterns can leak information about model architecture and hyperparameters.
  • Co-location attacks — attackers sharing the same physical node can exploit shared memory or cache hierarchies.

Why Software-Only Defenses Are Insufficient
#

DefenseLimitation
Gradient encryption (SSL/TLS)10–30% throughput penalty; most NCCL deployments skip it
Byzantine-robust aggregationAssumes honest majority; doubles compute cost
TEE-based training (SGX)Severe performance limitations; not available for GPUs
Differential privacyReduces model quality; noise injection has limits
Container isolationKernel-level — bypassable by memory-unsafe exploits

Where capability hardware fits (AGIACC view)
#

Training clusters are real distributed systems; we do not promise a turnkey AGIACC training appliance in this primer. What we do argue is consistent with CHERI ecosystem goals:

  • Capability-bounded memory (where CPUs support it) bounds what a compromised worker process can touch even when higher-level software is untrusted — the same property that motivates CHERI in other domains.
  • Compartmentalisation is a design target for splitting orchestration, dataloaders, and sensitive parameter paths — maturity depends on framework and silicon.
  • Unforgeable pointers address classic memory-corruption classes that still show up in native training stacks.
  • Attestation between nodes is an integration problem (TPMs, firmware roots, and policy) we treat as adjacent to capability work, not a single-vendor add-on.

Distributed training needs the same discipline as other high-stakes distributed systems: provable boundaries, not only network segmentation.


← Back to: Distributed Training Overview

Distributed Training Methods - This article is part of a series.
Part 7: This Article