FunAudioLLM/CosyVoice2-0.5B

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

API Usage

curl --request POST \
  --url https://api.ap.siliconflow.com/v1/audio/speech \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "input": "Can you say it with a happy emotion? <|endofprompt|>I'\''m so happy, Spring Festival is coming!",
  "response_format": "mp3",
  "stream": true,
  "speed": 1,
  "gain": 0,
  "model": "FunAudioLLM/CosyVoice2-0.5B"
}'

Details

Model Provider

FunAudioLLM

Type

audio

Sub Type

text-to-speech

Publish Time

Dec 16, 2024

Price

$

7.15

/ M UTF-8 bytes

Tags

Multilingual,0.5B

Ready to accelerate your AI development?

Ready to accelerate your AI development?

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.