State-of-the-Art

AI Model Library

One API to run inference on 200+ cutting-edge AI models, and deploy in seconds

State-of-the-Art

AI Model Library

One API to run inference on 200+ cutting-edge AI models, and deploy in seconds

State-of-the-Art

AI Model Library

One API to run inference on 200+ cutting-edge AI models, and deploy in seconds

All

Featured

LLM

Vision

Image

Video

Audio

Serverless

IndexTeam

Text-to-Speech

IndexTTS-2

Release on: Sep 10, 2025

IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets...

7.15

/ M UTF-8 bytes

Cached Input:

text

/ M UTF-8 bytes

Ready to accelerate your AI development?