Accelerating Diffusion Sampling with Picard Iterations

How fixed-point refinement can cut serial denoising steps and unlock better parallel GPU utilization.

Why diffusion sampling is slow

Diffusion models generate data by repeatedly denoising a latent variable over many timesteps. Even with improved samplers, this process is often sequential: each state \(x_t\) depends on the previous state \(x_{t+1}\). That serial dependency limits throughput and makes scaling across multiple GPUs harder than in fully parallel training workloads.

Reframing each step as a fixed-point problem

A broad class of samplers can be written as an implicit update where the next sample must satisfy an equation of the form \(x = F(x)\). Instead of taking one explicit update and moving on, we can solve this fixed-point equation approximately using Picard iterations:

Initialize with a predictor state \(x^{(0)}\).
Iterate \(x^{(k+1)} = F(x^{(k)})\).
Stop after a small number of refinement rounds.

Each refinement improves local consistency of the denoising step, which can permit fewer global timesteps for comparable sample quality.

Where parallel GPUs come in

The key idea is to execute multiple Picard refinement evaluations concurrently. Depending on your architecture and communication budget, practical parallelization patterns include:

Data-parallel refinements: replicate model weights and evaluate multiple candidate refinements for different batch shards on separate GPUs.
Pipeline overlap: while one refinement round is reduced/synchronized, prefetch and stage activations for the next round.
Time-chunk scheduling: assign blocks of timesteps to devices and use Picard corrections to reduce cross-device synchronization frequency.

In all three cases, the refinement operator becomes the unit of parallel work, rather than the full sequential trajectory.

Algorithm sketch

Choose a coarse timestep schedule (for example, 10-20 steps instead of 50-100).
At each timestep, produce predictor \(x^{(0)}\) from the current sampler update.
Run \(K\) Picard refinements with distributed inference kernels.
Aggregate/refine to obtain \(x_t\), then advance to the next timestep.
Tune \(K\), guidance scale, and step schedule jointly for quality-latency trade-offs.

Practical constraints

Contraction behavior: Picard refinement works best when the local operator is sufficiently contractive; aggressive guidance can break this.
Communication overhead: all-reduce and synchronization can erase gains if refinements are too fine-grained.
Numerical stability: mixed precision and large batch sizes need careful scaling to avoid divergence during refinement.
System balance: speedups depend on kernel efficiency, interconnect bandwidth, and how well compute and transfer are overlapped.

When this approach pays off

Picard-accelerated sampling is most useful when your bottleneck is sequential denoising latency and you already have multiple GPUs available at inference time. If the method lets you reduce global denoising steps while keeping quality stable, wall-clock generation can drop significantly.

Takeaway: Treating diffusion updates as fixed-point solves creates a new axis for acceleration. Picard iterations trade a little local refinement work for fewer global sequential steps, making multi-GPU parallelism much more effective during sampling.