LTX-2.3
Lightricks/LTX-2.3
published Mar 2026 · updated Apr 2026
LTX-2.3 is a DiT-based audio-video foundation model that generates synchronized video and audio from text and image inputs.
specs
| Task | Image-to-Video, Text-to-Video, Audio-Video Generation |
| Architecture | Asymmetric dual-stream DiT with bidirectional audio-video cross-attention |
| Parameters | 22B total (14B video stream, 5B audio stream) |
| License | Open-source (see LICENSE file) |
about this model
LTX-2.3 is an image-to-video generation model that produces synchronized video and audio from an input image and text prompt. Built on a diffusion-based architecture, it is designed for high-fidelity audiovisual content creation and can be accessed as a managed, OpenAI-compatible API through gigarouter.
The model uses an asymmetric dual-stream transformer with a 14-billion-parameter video stream and a 5-billion-parameter audio stream. Bidirectional cross-attention layers with temporal positional embeddings and cross-modality AdaLN enable tight audiovisual alignment. A multilingual text encoder broadens prompt understanding, and modality-aware classifier-free guidance (modality-CFG) improves controllability of both video and audio outputs. The audio stream can generate speech, background sounds, and foley consistent with the scene.
According to the LTX-2 paper (arXiv:2601.03233), the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of the computational cost and inference time. A distilled variant (ltx-2.3-22b-distilled) runs in 8 sampling steps at CFG=1 for fastest inference.
For production use, the recommended pipeline is the two-stage text/image-to-video pipeline (TI2VidTwoStagesPipeline), with a higher-quality variant using a different sampler. Additional supported modes include video-to-video, keyframe interpolation, retake, IC-LoRA, lipdub, and text-to-audio.

best for
- ·Generating synchronized video and audio from a single text prompt or image
- ·Creating short video clips with natural background audio, foley, and speech
- ·Fast iterative content creation using the distilled 8-step pipeline
FAQ
LTX-2.3 is a significant update with improved audio and visual quality and enhanced prompt adherence.
Input is a text prompt and an image; output is a synchronized video and audio clip. Width and height must be divisible by 32, and frame count must be divisible by 8 + 1.
The full model has 22B total parameters, split into a 14B video stream and a 5B audio stream.
The model is released under an open-source license; you can use the full, distilled, and upscaler models and their derivatives for purposes under the license.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name and your input prompt/image.
We're benchmarking and onboarding LTX-2.3 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.