IndexTTS-2 API, Deployment, Pricing
IndexTeam/IndexTTS-2
IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets
Details
Model Provider
IndexTeam
Type
audio
Sub Type
text-to-speech
Publish Time
Sep 10, 2025
Price
$
undefined
/ M UTF-8 bytes
Tags
MoE,235B,128K
Model FAQs: Usage, Deployment
Learn how to use, fine-tune, and deploy this model with ease.