FEATURED
FEATURED
Featured
Featured
Featured
Qwen/Qwen-Image-Edit
Qwen-Image-Edit is the image editing version of Qwen-Image, released by Alibaba's Qwen team. Built upon the 20B Qwen-Image model, it has been further trained to extend its unique text rendering capabilities to image editing tasks, enabling precise text editing within images. Furthermore, Qwen-Image-Edit utilizes an innovative architecture that feeds the input image into both Qwen2.5-VL (for visual semantic control) and a VAE Encoder (for visual appearance control), achieving capabilities in both semantic and appearance editing. This allows it to support not only low-level visual appearance edits like adding, removing, or modifying elements, but also high-level visual semantic editing such as IP creation and style transfer, which require maintaining semantic consistency. The model has achieved state-of-the-art (SOTA) performance on multiple public benchmarks, establishing it as a powerful foundation model for image editing
Image
Image Generation
Generate high-quality images from text prompts with our state-of-the-art image models
Image
Image Generation
Generate high-quality images from text prompts with our state-of-the-art image models
Image
Image Generation
Generate high-quality images from text prompts with our state-of-the-art image models

image-to-image
Qwen-Image-Edit
Qwen-Image-Edit is the image editing version of Qwen-Image, released by Alibaba's Qwen team. Built upon the 20B Qwen-Image model, it has been further trained to extend its unique text rendering capabilities to image editing tasks, enabling precise text editing within images. Furthermore, Qwen-Image-Edit utilizes an innovative architecture that feeds the input image into both Qwen2.5-VL (for visual semantic control) and a VAE Encoder (for visual appearance control), achieving capabilities in both semantic and appearance editing. This allows it to support not only low-level visual appearance edits like adding, removing, or modifying elements, but also high-level visual semantic editing such as IP creation and style transfer, which require maintaining semantic consistency. The model has achieved state-of-the-art (SOTA) performance on multiple public benchmarks, establishing it as a powerful foundation model for image editing

image-to-image
Qwen-Image-Edit
Qwen-Image-Edit is the image editing version of Qwen-Image, released by Alibaba's Qwen team. Built upon the 20B Qwen-Image model, it has been further trained to extend its unique text rendering capabilities to image editing tasks, enabling precise text editing within images. Furthermore, Qwen-Image-Edit utilizes an innovative architecture that feeds the input image into both Qwen2.5-VL (for visual semantic control) and a VAE Encoder (for visual appearance control), achieving capabilities in both semantic and appearance editing. This allows it to support not only low-level visual appearance edits like adding, removing, or modifying elements, but also high-level visual semantic editing such as IP creation and style transfer, which require maintaining semantic consistency. The model has achieved state-of-the-art (SOTA) performance on multiple public benchmarks, establishing it as a powerful foundation model for image editing

image-to-image
Qwen-Image-Edit
Qwen-Image-Edit is the image editing version of Qwen-Image, released by Alibaba's Qwen team. Built upon the 20B Qwen-Image model, it has been further trained to extend its unique text rendering capabilities to image editing tasks, enabling precise text editing within images. Furthermore, Qwen-Image-Edit utilizes an innovative architecture that feeds the input image into both Qwen2.5-VL (for visual semantic control) and a VAE Encoder (for visual appearance control), achieving capabilities in both semantic and appearance editing. This allows it to support not only low-level visual appearance edits like adding, removing, or modifying elements, but also high-level visual semantic editing such as IP creation and style transfer, which require maintaining semantic consistency. The model has achieved state-of-the-art (SOTA) performance on multiple public benchmarks, establishing it as a powerful foundation model for image editing

text-to-image
Qwen-Image
Qwen-Image is an image generation foundation model released by the Alibaba Qwen team, featuring 20 billion parameters. The model has achieved significant advances in complex text rendering and precise image editing, excelling particularly at generating images with high-fidelity Chinese and English text. Qwen-Image can handle multi-line layouts and paragraph-level text while maintaining layout coherence and contextual harmony in the generated images. Beyond its superior text-rendering capabilities, the model supports a wide range of artistic styles, from photorealistic scenes to anime aesthetics, adapting fluidly to various creative prompts. It also possesses powerful image editing and understanding abilities, supporting advanced operations such as style transfer, object insertion or removal, detail enhancement, text editing, and even human pose manipulation, aiming to be a comprehensive foundation model for intelligent visual creation and manipulation where language, layout, and imagery converge

text-to-image
Qwen-Image
Qwen-Image is an image generation foundation model released by the Alibaba Qwen team, featuring 20 billion parameters. The model has achieved significant advances in complex text rendering and precise image editing, excelling particularly at generating images with high-fidelity Chinese and English text. Qwen-Image can handle multi-line layouts and paragraph-level text while maintaining layout coherence and contextual harmony in the generated images. Beyond its superior text-rendering capabilities, the model supports a wide range of artistic styles, from photorealistic scenes to anime aesthetics, adapting fluidly to various creative prompts. It also possesses powerful image editing and understanding abilities, supporting advanced operations such as style transfer, object insertion or removal, detail enhancement, text editing, and even human pose manipulation, aiming to be a comprehensive foundation model for intelligent visual creation and manipulation where language, layout, and imagery converge

text-to-image
Qwen-Image
Qwen-Image is an image generation foundation model released by the Alibaba Qwen team, featuring 20 billion parameters. The model has achieved significant advances in complex text rendering and precise image editing, excelling particularly at generating images with high-fidelity Chinese and English text. Qwen-Image can handle multi-line layouts and paragraph-level text while maintaining layout coherence and contextual harmony in the generated images. Beyond its superior text-rendering capabilities, the model supports a wide range of artistic styles, from photorealistic scenes to anime aesthetics, adapting fluidly to various creative prompts. It also possesses powerful image editing and understanding abilities, supporting advanced operations such as style transfer, object insertion or removal, detail enhancement, text editing, and even human pose manipulation, aiming to be a comprehensive foundation model for intelligent visual creation and manipulation where language, layout, and imagery converge

text-to-image
FLUX.1 Kontext [pro]
FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.
12B

text-to-image
FLUX.1 Kontext [pro]
FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.
12B

text-to-image
FLUX.1 Kontext [pro]
FLUX.1 Kontext Pro is an advanced image generation and editing model that supports both natural language prompts and reference images. It delivers high semantic understanding, precise local control, and consistent outputs, making it ideal for brand design, product visualization, and narrative illustration. It enables fine-grained edits and context-aware transformations with high fidelity.
12B

text-to-image
FLUX.1 Kontext [max]
FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.
12B

text-to-image
FLUX.1 Kontext [max]
FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.
12B

text-to-image
FLUX.1 Kontext [max]
FLUX.1 Kontext Max is the most powerful and feature-rich model in the Kontext series, designed for high-resolution, high-precision visual editing and generation. It offers superior prompt adherence, detailed rendering, and advanced typographic control. Ideal for enterprise design systems, marketing visuals, and automated creative pipelines that require robust scene transformations and layout control.
12B
Video
Video Generation
Create dynamic videos from text descriptions with our cutting-edge video generation models
Video
Video Generation
Create dynamic videos from text descriptions with our cutting-edge video generation models
Video
Video Generation
Create dynamic videos from text descriptions with our cutting-edge video generation models

image-to-video
Wan2.2-I2V-A14B
Wan2.2-I2V-A14B is one of the industry's first open-source image-to-video generation models featuring a Mixture-of-Experts (MoE) architecture, released by Alibaba's AI initiative, Wan-AI. The model specializes in transforming a static image into a smooth, natural video sequence based on a text prompt. Its key innovation is the MoE architecture, which employs a high-noise expert for the initial video layout and a low-noise expert to refine details in later stages, enhancing model performance without increasing inference costs. Compared to its predecessors, Wan2.2 was trained on a significantly larger dataset, which notably improves its ability to handle complex motion, aesthetics, and semantics, resulting in more stable videos with reduced unrealistic camera movements
MoE,27B

image-to-video
Wan2.2-I2V-A14B
Wan2.2-I2V-A14B is one of the industry's first open-source image-to-video generation models featuring a Mixture-of-Experts (MoE) architecture, released by Alibaba's AI initiative, Wan-AI. The model specializes in transforming a static image into a smooth, natural video sequence based on a text prompt. Its key innovation is the MoE architecture, which employs a high-noise expert for the initial video layout and a low-noise expert to refine details in later stages, enhancing model performance without increasing inference costs. Compared to its predecessors, Wan2.2 was trained on a significantly larger dataset, which notably improves its ability to handle complex motion, aesthetics, and semantics, resulting in more stable videos with reduced unrealistic camera movements
MoE,27B

text-to-video
Wan2.2-T2V-A14B
Wan2.2-T2V-A14B is the industry's first open-source video generation model with a Mixture-of-Experts (MoE) architecture, released by Alibaba. This model focuses on text-to-video (T2V) generation, capable of producing 5-second videos at both 480P and 720P resolutions. By introducing an MoE architecture, it expands the total model capacity while keeping inference costs nearly unchanged; it features a high-noise expert for the early stages to handle the overall layout and a low-noise expert for later stages to refine video details. Furthermore, Wan2.2 incorporates meticulously curated aesthetic data with detailed labels for lighting, composition, and color, allowing for more precise and controllable generation of cinematic styles. Compared to its predecessor, the model was trained on significantly larger datasets, which notably enhances its generalization across motion, semantics, and aesthetics, enabling better handling of complex dynamic effects
MoE,27B

text-to-video
Wan2.2-T2V-A14B
Wan2.2-T2V-A14B is the industry's first open-source video generation model with a Mixture-of-Experts (MoE) architecture, released by Alibaba. This model focuses on text-to-video (T2V) generation, capable of producing 5-second videos at both 480P and 720P resolutions. By introducing an MoE architecture, it expands the total model capacity while keeping inference costs nearly unchanged; it features a high-noise expert for the early stages to handle the overall layout and a low-noise expert for later stages to refine video details. Furthermore, Wan2.2 incorporates meticulously curated aesthetic data with detailed labels for lighting, composition, and color, allowing for more precise and controllable generation of cinematic styles. Compared to its predecessor, the model was trained on significantly larger datasets, which notably enhances its generalization across motion, semantics, and aesthetics, enabling better handling of complex dynamic effects
MoE,27B

image-to-video
Wan2.1-I2V-14B-720P
Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video

image-to-video
Wan2.1-I2V-14B-720P
Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video

image-to-video
Wan2.1-I2V-14B-720P (Turbo)
Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video

image-to-video
Wan2.1-I2V-14B-720P (Turbo)
Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video

text-to-video
Wan2.1-T2V-14B
Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction
14B

text-to-video
Wan2.1-T2V-14B
Wan2.1-T2V-14B is an open-source advanced text-to-video generation model. This 14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction
14B

text-to-video
Wan2.1-T2V-14B (Turbo)
Wan2.1-T2V-14B-T is the TeaCache accelerated version of the Wan2.1-T2V-14B model, reducing single video generation time by 30%. The Wan2.1-T2V-14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction
14B

text-to-video
Wan2.1-T2V-14B (Turbo)
Wan2.1-T2V-14B-T is the TeaCache accelerated version of the Wan2.1-T2V-14B model, reducing single video generation time by 30%. The Wan2.1-T2V-14B model has established state-of-the-art performance benchmarks among both open-source and closed-source models, capable of generating high-quality visual content with significant dynamic effects. It is the only video model that can simultaneously generate text in both Chinese and English, and supports video generation at 480P and 720P resolutions. The model adopts a diffusion transformer architecture and enhances its generative capabilities through an innovative spatiotemporal variational autoencoder (VAE), scalable training strategies, and large-scale data construction
14B

image-to-video
Wan2.2-I2V-A14B
Wan2.2-I2V-A14B is one of the industry's first open-source image-to-video generation models featuring a Mixture-of-Experts (MoE) architecture, released by Alibaba's AI initiative, Wan-AI. The model specializes in transforming a static image into a smooth, natural video sequence based on a text prompt. Its key innovation is the MoE architecture, which employs a high-noise expert for the initial video layout and a low-noise expert to refine details in later stages, enhancing model performance without increasing inference costs. Compared to its predecessors, Wan2.2 was trained on a significantly larger dataset, which notably improves its ability to handle complex motion, aesthetics, and semantics, resulting in more stable videos with reduced unrealistic camera movements
MoE,27B

text-to-video
Wan2.2-T2V-A14B
Wan2.2-T2V-A14B is the industry's first open-source video generation model with a Mixture-of-Experts (MoE) architecture, released by Alibaba. This model focuses on text-to-video (T2V) generation, capable of producing 5-second videos at both 480P and 720P resolutions. By introducing an MoE architecture, it expands the total model capacity while keeping inference costs nearly unchanged; it features a high-noise expert for the early stages to handle the overall layout and a low-noise expert for later stages to refine video details. Furthermore, Wan2.2 incorporates meticulously curated aesthetic data with detailed labels for lighting, composition, and color, allowing for more precise and controllable generation of cinematic styles. Compared to its predecessor, the model was trained on significantly larger datasets, which notably enhances its generalization across motion, semantics, and aesthetics, enabling better handling of complex dynamic effects
MoE,27B

image-to-video
Wan2.1-I2V-14B-720P
Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video

image-to-video
Wan2.1-I2V-14B-720P (Turbo)
Wan2.1-I2V-14B-720P-Turbo is the TeaCache accelerated version of the Wan2.1-I2V-14B-720P model, reducing single video generation time by 30%. Wan2.1-I2V-14B-720P is an open-source advanced image-to-video generation model, part of the Wan2.1 video foundation model suite. This 14B model can generate 720P high-definition videos. And after thousands of rounds of human evaluation, this model is reaching state-of-the-art performance levels. It utilizes a diffusion transformer architecture and enhances generation capabilities through innovative spatiotemporal variational autoencoders (VAE), scalable training strategies, and large-scale data construction. The model also understands and processes both Chinese and English text, providing powerful support for video generation tasks
14B,Img2Video
Chat
LLMs
Powerful language models for conversational AI, content generation, and more
Chat
LLMs
Powerful language models for conversational AI, content generation, and more
Chat
LLMs
Powerful language models for conversational AI, content generation, and more
Audio
Audio Models
Our most popular and powerful models, ready for your applications
Audio
Audio Models
Our most popular and powerful models, ready for your applications
Audio
Audio Models
Our most popular and powerful models, ready for your applications
text-to-speech
IndexTTS-2
IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets
text-to-speech
IndexTTS-2
IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets
text-to-speech
IndexTTS-2
IndexTTS2 is a breakthrough auto-regressive zero-shot Text-to-Speech (TTS) model designed to address the challenge of precise duration control in large-scale TTS systems, which is a significant limitation in applications like video dubbing. It introduces a novel, general method for speech duration control, supporting two modes: one that explicitly specifies the number of generated tokens for precise duration, and another that generates speech freely in an auto-regressive manner. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion via separate prompts. To enhance speech clarity in highly emotional expressions, the model incorporates GPT latent representations and utilizes a novel three-stage training paradigm. To lower the barrier for emotional control, it also features a soft instruction mechanism based on text descriptions, developed by fine-tuning Qwen3, to effectively guide the generation of speech with the desired emotional tone. Experimental results show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity across multiple datasets

text-to-speech
FunAudioLLM/CosyVoice2-0.5B
CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.
Multilingual,0.5B

text-to-speech
FunAudioLLM/CosyVoice2-0.5B
CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.
Multilingual,0.5B

text-to-speech
FunAudioLLM/CosyVoice2-0.5B
CosyVoice 2 is a streaming speech synthesis model based on a large language model, employing a unified streaming/non-streaming framework design. The model enhances the utilization of the speech token codebook through finite scalar quantization (FSQ), simplifies the text-to-speech language model architecture, and develops a chunk-aware causal streaming matching model that supports different synthesis scenarios. In streaming mode, the model achieves ultra-low latency of 150ms while maintaining synthesis quality almost identical to that of non-streaming mode. Compared to version 1.0, the pronunciation error rate has been reduced by 30%-50%, the MOS score has improved from 5.4 to 5.53, and fine-grained control over emotions and dialects is supported. The model supports Chinese (including dialects: Cantonese, Sichuan dialect, Shanghainese, Tianjin dialect, etc.), English, Japanese, Korean, and supports cross-lingual and mixed-language scenarios.
Multilingual,0.5B
text-to-speech
Fish-Speech-1.5
Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.
Multilingual
text-to-speech
Fish-Speech-1.5
Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.
Multilingual
text-to-speech
Fish-Speech-1.5
Fish Speech V1.5 is a leading open-source text-to-speech (TTS) model. The model employs an innovative DualAR architecture, featuring a dual autoregressive transformer design. It supports multiple languages, with over 300,000 hours of training data for both English and Chinese, and over 100,000 hours for Japanese. In independent evaluations by TTS Arena, the model performed exceptionally well, with an ELO score of 1339. The model achieved a word error rate (WER) of 3.5% and a character error rate (CER) of 1.2% for English, and a CER of 1.3% for Chinese characters.
Multilingual
Ready to accelerate your AI development?
Ready to accelerate your AI development?


