Optimal transport (OT)—the mathematical theory of moving one probability distribution to another at minimum cost—has deep roots in mathematics and is finding increasingly diverse applications in machine learning. The Wasserstein distance, derived from optimal transport, provides a geometrically meaningful way to compare probability distributions—one that captures the structure of the underlying space rather than just pointwise differences.
For machine learning, this matters because many problems reduce to comparing or aligning distributions: domain adaptation (aligning source and target data distributions), generative modeling (matching generated and real data distributions), and graph learning (comparing structural distributions across networks).
The Research Landscape
Domain Adaptation
Koç and Chiang (2025), with 2 citations, apply optimal transport to the domain adaptation problem: training a model on source data and deploying it on differently-distributed target data. Standard approaches assume the distribution shift is simple (covariate shift), but real-world shifts often involve more complex structural changes.
The authors introduce "entanglement"—a concept that captures how features in the source and target domains are coupled in complex, non-linear ways. Optimal transport provides a natural framework for modeling this coupling: the transport plan between source and target distributions reveals which source examples correspond to which target examples, even when the correspondence is non-obvious.
Scaling Multimarginal OT
Tsur and Greenewald (2025), with 2 citations, address a computational bottleneck: extending OT from two distributions (source and target) to many distributions simultaneously (multimarginal OT). This is needed for problems like multi-domain alignment, multi-source transfer learning, and barycenter computation.
The computational cost of exact multimarginal OT grows exponentially with the number of marginals. Their neural estimation approach approximates the solution using neural networks, achieving polynomial scaling—making problems with 10+ distributions tractable that would be impossible with exact methods.
Wasserstein Hypergraph Networks
Duta and Liò (2025) introduce Wasserstein distances into hypergraph neural networks. Standard graph neural networks pass messages along edges connecting pairs of nodes. Hypergraph networks generalize this to hyperedges connecting sets of nodes—capturing higher-order relationships. The Wasserstein distance provides a principled way to aggregate information across hyperedges, treating each hyperedge as a distribution over its member nodes.
Geometric Training Optimization
Ferrara (2026) proposes integrating optimal transport with Riemannian gradient methods for neural network training itself. Standard gradient descent operates in Euclidean space, but the parameter space of neural networks has a non-Euclidean geometry (the loss landscape is curved, parameter magnitudes have different sensitivities). Riemannian methods respect this geometry, and OT provides a way to measure distances in the probability distributions that neural networks represent.
Critical Analysis: Claims and Evidence
<| Claim | Evidence | Verdict |
|---|---|---|
| OT provides better domain adaptation than standard methods for complex shifts | Koç et al.'s entanglement analysis | ✅ Supported — on benchmark datasets |
| Neural estimation makes multimarginal OT tractable for 10+ distributions | Tsur et al.'s scaling experiments | ✅ Supported |
| Wasserstein distances improve hypergraph message passing | Duta & Liò's experiments | ⚠️ Uncertain — early results; comparison baselines limited |
| Riemannian OT methods improve neural network training | Ferrara's theoretical analysis | ⚠️ Uncertain — theoretical framework; empirical validation pending |
What This Means for Your Research
For ML researchers, optimal transport is becoming an essential tool—not just for generative models (where it is well-established) but for domain adaptation, graph learning, and training optimization. For mathematicians, the ML applications are driving new theoretical questions about computational OT at scale.
Explore related work through ORAA ResearchBrain.