LTX-2.3

Lightricks/LTX-2.3

published Mar 2026 · updated Apr 2026

LTX-2.3 is a DiT-based audio-video foundation model that generates synchronized video and audio from text and image inputs.

status

coming soon

API providers

downloads / mo

1.8M

license

other

specs

Task	Image-to-Video, Text-to-Video, Audio-Video Generation
Architecture	Asymmetric dual-stream DiT with bidirectional audio-video cross-attention
Parameters	22B total (14B video stream, 5B audio stream)
License	Open-source (see LICENSE file)

about this model

LTX-2.3 is an image-to-video generation model that produces synchronized video and audio from an input image and text prompt. Built on a diffusion-based architecture, it is designed for high-fidelity audiovisual content creation and can be accessed as a managed, OpenAI-compatible API through gigarouter.

The model uses an asymmetric dual-stream transformer with a 14-billion-parameter video stream and a 5-billion-parameter audio stream. Bidirectional cross-attention layers with temporal positional embeddings and cross-modality AdaLN enable tight audiovisual alignment. A multilingual text encoder broadens prompt understanding, and modality-aware classifier-free guidance (modality-CFG) improves controllability of both video and audio outputs. The audio stream can generate speech, background sounds, and foley consistent with the scene.

According to the LTX-2 paper (arXiv:2601.03233), the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of the computational cost and inference time. A distilled variant (ltx-2.3-22b-distilled) runs in 8 sampling steps at CFG=1 for fastest inference.

For production use, the recommended pipeline is the two-stage text/image-to-video pipeline (TI2VidTwoStagesPipeline), with a higher-quality variant using a different sampler. Additional supported modes include video-to-video, keyframe interpolation, retake, IC-LoRA, lipdub, and text-to-audio.

best for

·Generating synchronized video and audio from a single text prompt or image
·Creating short video clips with natural background audio, foley, and speech
·Fast iterative content creation using the distilled 8-step pipeline

FAQ

What is the main difference between LTX-2.3 and LTX-2?

LTX-2.3 is a significant update with improved audio and visual quality and enhanced prompt adherence.

What are the input and output formats for the image-to-video pipeline?

Input is a text prompt and an image; output is a synchronized video and audio clip. Width and height must be divisible by 32, and frame count must be divisible by 8 + 1.

How many parameters does the model have?

The full model has 22B total parameters, split into a 14B video stream and a 5B audio stream.

What is the license for LTX-2.3?

The model is released under an open-source license; you can use the full, distilled, and upscaler models and their derivatives for purposes under the license.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name and your input prompt/image.

not yet live

We're benchmarking and onboarding LTX-2.3 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.