State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

State-of-the-Art

AI Models

Access a comprehensive library of cutting-edge AI models for LLMs, image, video, and audio generation, all through our high-performance inference API.

FEATURED

FEATURED

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

Image

Image Generation

Generate high-quality images from text prompts with our state-of-the-art image models

black-forest-labs-flux-1-1-pro

text-to-image

FLUX 1.1 [pro]

FLUX1.1 Pro is an enhanced text-to-image model built on the FLUX.1 architecture, offering improved composition, detail, and rendering speed. With better visual consistency and artistic fidelity, it's suitable for illustration, creative content generation, and e-commerce visual assets—delivering diverse styles with strong prompt alignment.

black-forest-labs-flux-1-1-pro

text-to-image

FLUX 1.1 [pro]

FLUX1.1 Pro is an enhanced text-to-image model built on the FLUX.1 architecture, offering improved composition, detail, and rendering speed. With better visual consistency and artistic fidelity, it's suitable for illustration, creative content generation, and e-commerce visual assets—delivering diverse styles with strong prompt alignment.

black-forest-labs-flux-1-1-pro

text-to-image

FLUX 1.1 [pro]

FLUX1.1 Pro is an enhanced text-to-image model built on the FLUX.1 architecture, offering improved composition, detail, and rendering speed. With better visual consistency and artistic fidelity, it's suitable for illustration, creative content generation, and e-commerce visual assets—delivering diverse styles with strong prompt alignment.

black-forest-labs/FLUX-Pro-1.1-ultra

text-to-image

FLUX 1.1 [pro] Ultra

FLUX1.1 Pro Ultra is the high-resolution version of FLUX1.1 Pro, capable of generating images up to 4 megapixels (2K resolution). It improves photo realism and prompt controllability for advanced use cases. The Ultra mode is optimized for composition and precision, while Raw mode prioritizes natural textures and realism—ideal for commercial visual production, art direction, and realistic concept rendering.

black-forest-labs/FLUX-Pro-1.1-ultra

text-to-image

FLUX 1.1 [pro] Ultra

FLUX1.1 Pro Ultra is the high-resolution version of FLUX1.1 Pro, capable of generating images up to 4 megapixels (2K resolution). It improves photo realism and prompt controllability for advanced use cases. The Ultra mode is optimized for composition and precision, while Raw mode prioritizes natural textures and realism—ideal for commercial visual production, art direction, and realistic concept rendering.

black-forest-labs/FLUX-Pro-1.1-ultra

text-to-image

FLUX 1.1 [pro] Ultra

FLUX1.1 Pro Ultra is the high-resolution version of FLUX1.1 Pro, capable of generating images up to 4 megapixels (2K resolution). It improves photo realism and prompt controllability for advanced use cases. The Ultra mode is optimized for composition and precision, while Raw mode prioritizes natural textures and realism—ideal for commercial visual production, art direction, and realistic concept rendering.

text-to-image

FLUX.1 Kontext [max]

FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.

12B

text-to-image

FLUX.1 Kontext [max]

FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.

12B

text-to-image

FLUX.1 Kontext [max]

FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.

12B

text-to-image

FLUX.1 Kontext [pro]

FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.

12B

text-to-image

FLUX.1 Kontext [pro]

FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.

12B

text-to-image

FLUX.1 Kontext [pro]

FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.

12B

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

Video

Video Generation

Create dynamic videos from text descriptions with our cutting-edge video generation models

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

image-to-video

Wan2.1-I2V-14B-720P

Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

image-to-video

Wan2.1-I2V-14B-720P (Turbo)

Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks

14B,Img2Video

text-to-video

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

text-to-video

Wan2.1-T2V-14B (Turbo)

Wan2.1-T2V-14B-T is the TeaCache accelerated version of the Wan2.1-T2V-14B model, reducing single video generation time by 30%. The Wan2.1-T2V-14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction

14B

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Chat

LLMs

Powerful language models for conversational AI, content generation, and more

Audio

Audio Models

Our most popular and powerful models, ready for your applications

Audio

Audio Models

Our most popular and powerful models, ready for your applications

Audio

Audio Models

Our most popular and powerful models, ready for your applications

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

Fish-Speech-1.5

Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.

Multilingual

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

text-to-speech

FunAudioLLM/CosyVoice2-0.5B

CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.

Multilingual,0.5B

Ready to accelerate your AI development?

Ready to accelerate your AI development?

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.