Optimal Transport for ASR: A New Era of Precise Sequence Alignment

By Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi


Published on November 24, 2025| Vol. 1, Issue No. 1

Summary\

This paper introduces a novel differentiable alignment framework for sequence-to-sequence (seq2seq) models, specifically targeting Automatic Speech Recognition (ASR). Addressing the alignment inaccuracies and "peaky behavior" of current end-to-end ASR systems like Connectionist Temporal Classification (CTC), the researchers propose a solution based on one-dimensional optimal transport. They define a new pseudo-metric, Sequence Optimal Transport Distance (SOTD), and an associated loss function, Optimal Temporal Transport Classification (OTTC) loss. While experimental results on datasets like TIMIT, AMI, and LibriSpeech demonstrate a considerable improvement in alignment performance compared to CTC and Consistency-Regularized CTC, a trade-off in overall ASR performance is noted. This work is positioned to open new avenues for seq2seq alignment research, with the code publicly available.
\

Why It Matters\

This research represents a pivotal development in addressing a fundamental challenge within sequence-to-sequence modeling: achieving accurate and smooth temporal alignment. While the reported trade-off in ASR performance might initially seem like a drawback, the core innovation of significantly enhanced alignment accuracy holds profound implications for AI professionals. For applications where precise temporal correspondence between input and output sequences is critical-such as medical speech analysis for disease detection, nuanced language learning feedback systems, or detailed phoneme-level analysis-this work offers a superior mechanism. Imagine an AI system needing to detect minute vocal tremors or specific pronunciation errors; accurate alignment is often more valuable than a marginal improvement in overall word error rate. This paper provides a powerful, theoretically grounded tool in the form of OTTC loss and SOTD, which can be integrated into future architectures. It signifies a move beyond the inherent limitations of established methods like CTC, offering a re-thinking of the alignment process itself. This isn't just an incremental tweak; it's an exploration of how optimal transport theory can provide a more robust and interpretable way to map complex sequences. The open-sourcing of the code further democratizes this advancement, inviting collaboration and refinement from the broader research community, potentially paving the way for hybrid models that combine the best of both worlds: superior alignment with competitive overall performance. This trend underscores a broader shift towards more principled, mathematically informed approaches in deep learning, aiming for not just performance, but also greater control and understanding of model behavior.

Advertisement