FunAudioLLM/CosyVoice2-0.5B API, Deployment, Pricing

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

API Usage

curl --request POST \
--url https://api.siliconflow.com/v1/audio/speech \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"input": "Can you say it with a happy emotion? <|endofprompt|>I'\''m so happy, Spring Festival is coming!",
"response_format": "mp3",
"stream": true,
"speed": 1,
"gain": 0,
"model": "FunAudioLLM/CosyVoice2-0.5B"
}'

Details

Model Provider

FunAudioLLM

Type

audio

Sub Type

text-to-speech

Publish Time

Dec 16, 2024

Price

$

undefined

/ M UTF-8 bytes

Tags

MoE,235B,128K

Model FAQs: Usage, Deployment

Learn how to use, fine-tune, and deploy this model with ease.

Ready to accelerate your AI development?

Ready to accelerate your AI development?

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.