MMS-300M Forced Aligner

MahmoudAshraf/mms-300m-1130-forced-aligner

published May 2024 · updated Apr 2026

MMS-300M Forced Aligner is a forced alignment model that aligns text to audio using CTC emissions from a pretrained MMS-300M model.

est. price

~$0.0034

· estimated, set at launch

API providers

downloads / mo

3.2M

license

cc-by-nc-4.0

specs

Task	Forced Alignment
Architecture	MMS-300M (wav2vec2-based)
Parameters	300M
License	CC-BY-NC-4.0

about this model

MahmoudAshraf/mms-300m-1130-forced-aligner is a forced alignment model for synchronizing text transcripts with audio, built on the MMS-300M checkpoint and converted from torchaudio to Hugging Face Transformers. It performs CTC-based alignment, enabling precise word-level timestamp generation for automatic speech recognition (ASR) pipelines.

Key Strengths

Optimized for low memory usage — significantly less than the TorchAudio forced alignment API, making it suitable for large-scale or resource-constrained deployments.
Supports forced alignment across multiple languages via ISO-639-3 language codes and romanization, leveraging the pretrained MMS-300M multilingual model.
Provides per-word timestamps, confidence scores, and segment-level alignment output.

Usage Context

This model is designed for developers who need accurate alignment of spoken audio with existing transcripts, such as in subtitle generation, speech corpus creation, or phoneme-level analysis. It processes audio waveforms with batched emission generation and produces structured span-level results.

Additional Details

License: CC-BY-NC-4.0 (non-commercial use).
Model version created on 2024-05-02, with over 69 million total downloads on Hugging Face and 92 community likes.
The checkpoint is a direct conversion from the torchaudio MMS-300M forced alignment checkpoint, ensuring compatibility with the Hugging Face ecosystem.

best for

·Generating word-level timestamps for speech recordings
·Aligning transcriptions to audio for ASR training data preparation
·Forced alignment in multilingual speech applications

FAQ

What is this model best for?

It is best for forced alignment: accurately aligning a given text to an audio file to produce word-level timestamps.

How does it compare to other forced alignment tools in terms of memory usage?

It uses much less memory than the TorchAudio forced alignment API, as stated in the model card.

What languages does the model support?

It supports 1130 languages (multilingual), based on the MMS-300M checkpoint.

What are the input and output formats?

Input: audio waveform and text (with language code). Output: word timestamps and confidence scores.

How can I use this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key. Send audio and text inputs to receive aligned timestamps.

not yet live

We're benchmarking and onboarding MMS-300M Forced Aligner as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related speech-to-text models

compare all →

speaker-diarization-3.1

wav2vec2-large-xlsr-53-japanese

6.1M dl/mo

wav2vec2-large-xlsr-53-polish

4.7M dl/mo

wav2vec2-large-xlsr-53-dutch

4.1M dl/mo