
State-of-the-Art
AI Models
Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.
IndexTeam
text-to-speech
IndexTTS-2
Release on: Sep 10, 2025
IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets...
Total Context:
0K
Max output:
Input:
$
7.15
/ M UTF-8 bytes
Output:
$
/ M UTF-8 bytes
Fish Audio
text-to-speech
Fish-Speech-1.5
Release on: Nov 29, 2024
Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters....
Total Context:
0K
Max output:
Input:
$
15.0
/ M UTF-8 bytes
Output:
$
/ M UTF-8 bytes

FunAudioLLM/
text-to-speech
FunAudioLLM/CosyVoice2-0.5B
Release on: Dec 16, 2024
CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios....
Total Context:
0K
Max output:
Input:
$
7.15
/ M UTF-8 bytes
Output:
$
/ M UTF-8 bytes