MMS-300M Forced Aligner
MahmoudAshraf/mms-300m-1130-forced-aligner
published May 2024 · updated Apr 2026
MMS-300M Forced Aligner is a forced alignment model that aligns text to audio using CTC emissions from a pretrained MMS-300M model.
specs
| Task | Forced Alignment |
| Architecture | MMS-300M (wav2vec2-based) |
| Parameters | 300M |
| License | CC-BY-NC-4.0 |
about this model
MahmoudAshraf/mms-300m-1130-forced-aligner is a forced alignment model for synchronizing text transcripts with audio, built on the MMS-300M checkpoint and converted from torchaudio to Hugging Face Transformers. It performs CTC-based alignment, enabling precise word-level timestamp generation for automatic speech recognition (ASR) pipelines.
Key Strengths
- Optimized for low memory usage — significantly less than the TorchAudio forced alignment API, making it suitable for large-scale or resource-constrained deployments.
- Supports forced alignment across multiple languages via ISO-639-3 language codes and romanization, leveraging the pretrained MMS-300M multilingual model.
- Provides per-word timestamps, confidence scores, and segment-level alignment output.
Usage Context
This model is designed for developers who need accurate alignment of spoken audio with existing transcripts, such as in subtitle generation, speech corpus creation, or phoneme-level analysis. It processes audio waveforms with batched emission generation and produces structured span-level results.
Additional Details
- License: CC-BY-NC-4.0 (non-commercial use).
- Model version created on 2024-05-02, with over 69 million total downloads on Hugging Face and 92 community likes.
- The checkpoint is a direct conversion from the torchaudio MMS-300M forced alignment checkpoint, ensuring compatibility with the Hugging Face ecosystem.
best for
- ·Generating word-level timestamps for speech recordings
- ·Aligning transcriptions to audio for ASR training data preparation
- ·Forced alignment in multilingual speech applications
FAQ
It is best for forced alignment: accurately aligning a given text to an audio file to produce word-level timestamps.
It uses much less memory than the TorchAudio forced alignment API, as stated in the model card.
It supports 1130 languages (multilingual), based on the MMS-300M checkpoint.
Input: audio waveform and text (with language code). Output: word timestamps and confidence scores.
Use the gigarouter OpenAI-compatible endpoint with an API key. Send audio and text inputs to receive aligned timestamps.
We're benchmarking and onboarding MMS-300M Forced Aligner as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.