State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

FEATURED

FEATURED

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

text-to-image

FLUX.1-dev

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. It offers cutting-edge output quality, second only to their state-of-the-art model FLUX.1 [pro]. The model features competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, FLUX.1 [dev] is more efficient. Open weights are provided to drive new scientific research and empower artists to develop innovative workflows

12B

text-to-image

FLUX.1-dev

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. It offers cutting-edge output quality, second only to their state-of-the-art model FLUX.1 [pro]. The model features competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, FLUX.1 [dev] is more efficient. Open weights are provided to drive new scientific research and empower artists to develop innovative workflows

12B

text-to-image

FLUX.1-dev

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. It offers cutting-edge output quality, second only to their state-of-the-art model FLUX.1 [pro]. The model features competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, FLUX.1 [dev] is more efficient. Open weights are provided to drive new scientific research and empower artists to develop innovative workflows

12B

image-to-image

FLUX.1-Kontext-dev

FLUX.1 Kontext [dev] is a 12 billion parameter image editing model developed by Black Forest Labs. Based on advanced Flow Matching technology, it functions as a diffusion transformer capable of precise image editing based on text instructions. The model's core feature is its powerful contextual understanding, allowing it to process both text and image inputs simultaneously and maintain a high degree of consistency for characters, styles, and objects over multiple successive edits with minimal visual drift. As an open-weight model, FLUX.1 Kontext [dev] aims to drive new scientific research and empower developers and artists with innovative workflows. Users can leverage it for various tasks, including style transfer, object modification, background swapping, and even text editing

12B

image-to-image

FLUX.1-Kontext-dev

FLUX.1 Kontext [dev] is a 12 billion parameter image editing model developed by Black Forest Labs. Based on advanced Flow Matching technology, it functions as a diffusion transformer capable of precise image editing based on text instructions. The model's core feature is its powerful contextual understanding, allowing it to process both text and image inputs simultaneously and maintain a high degree of consistency for characters, styles, and objects over multiple successive edits with minimal visual drift. As an open-weight model, FLUX.1 Kontext [dev] aims to drive new scientific research and empower developers and artists with innovative workflows. Users can leverage it for various tasks, including style transfer, object modification, background swapping, and even text editing

12B

image-to-image

FLUX.1-Kontext-dev

FLUX.1 Kontext [dev] is a 12 billion parameter image editing model developed by Black Forest Labs. Based on advanced Flow Matching technology, it functions as a diffusion transformer capable of precise image editing based on text instructions. The model's core feature is its powerful contextual understanding, allowing it to process both text and image inputs simultaneously and maintain a high degree of consistency for characters, styles, and objects over multiple successive edits with minimal visual drift. As an open-weight model, FLUX.1 Kontext [dev] aims to drive new scientific research and empower developers and artists with innovative workflows. Users can leverage it for various tasks, including style transfer, object modification, background swapping, and even text editing

12B

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

text-to-video

Wan2.1-T2V-14B (Turbo)

Wan2.1-T2V-14B-T is the TeaCache accelerated version of the Wan2.1-T2V-14B model, reducing single video generation time by 30%. The Wan2.1-T2V-14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Audio

Audio Models

Our most popular and powerful models, ready for your applications

Audio

Audio Models

Our most popular and powerful models, ready for your applications

Audio

Audio Models

Our most popular and powerful models, ready for your applications

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

Ready to accelerate your AI development?

Ready to accelerate your AI development?

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.