
Qwen
Text Generation
Qwen3-VL-32B-Instruct
Release on: Oct 21, 2025
Qwen3-VL is the vision-language model in the Qwen3 series, achieving state-of-the-art(SOTA)performance on various vision-language(VL)benchmarks. The model supports high-resolution image inputs up to the megapixel level and possesses strong capabilities in general visual understanding, multilingual OCR, fine-grained visual grounding, and visual dialogue. As part of the Qwen3 series, it inherits a powerful language foundation, enabling it to understand and execute complex instructions....
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
0.6
/ M Tokens

Qwen
Text Generation
Qwen3-VL-32B-Thinking
Release on: Oct 21, 2025
Qwen3-VL-Thinking is a version of the Qwen3-VL series specially optimized for complex visual reasoning tasks. It incorporates a "Thinking Mode" , enabling it to generate detailed intermediate reasoning steps (Chain-of-Thought) before providing a final answer. This design significantly enhances the model's performance on visual question answering (VQA) and other vision-language tasks that require multi-step logic, planning, and in-depth analysis....
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
1.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-8B-Instruct
Release on: Oct 15, 2025
Qwen3-VL-8B-Instruct is the vision-language model of the Qwen3 series, demonstrates strong capabilities in general visual understanding, visual-centric dialogue, and multilingual text recognition in images. ...
Total Context:
262K
Max output:
262K
Input:
$
0.18
/ M Tokens
Output:
$
0.68
/ M Tokens

Qwen
Text Generation
Qwen3-VL-8B-Thinking
Release on: Oct 15, 2025
Qwen3-VL-8B-Thinking is a vision-language model from the Qwen3 series, optimized for scenarios requiring complex reasoning. In this Thinking mode, the model performs step-by-step thinking and reasoning before providing the final answer....
Total Context:
262K
Max output:
262K
Input:
$
0.18
/ M Tokens
Output:
$
2.0
/ M Tokens

Qwen
Text Generation
Qwen3-VL-235B-A22B-Instruct
Release on: Oct 4, 2025
Qwen3-VL-235B-A22B-Instruct is a 235B parameters Mixture-of-Experts (MoE) vision-language model, with 22B activated parameters. It is an instruction-tuned version of Qwen3-VL-235B-A22B and is aligned for chat applications. ...
Total Context:
262K
Max output:
262K
Input:
$
0.3
/ M Tokens
Output:
$
1.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-235B-A22B-Thinking
Release on: Oct 4, 2025
Qwen3-VL-235B-A22B-Thinking is one of the Qwen3-VL series models, a reasoning-enhanced Thinking edition that achieves state-of-the-art (SOTA) results across many multimodal reasoning benchmarks, excelling in STEM, math, causal analysis, and logical, evidence-based answers. It features a Mixture-of-Experts (MoE) architecture with 235B total parameters and 22B active parameters. ...
Total Context:
262K
Max output:
262K
Input:
$
0.45
/ M Tokens
Output:
$
3.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-30B-A3B-Instruct
Release on: Oct 5, 2025
Qwen3-VL series delivers superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions....
Total Context:
262K
Max output:
262K
Input:
$
0.29
/ M Tokens
Output:
$
1.0
/ M Tokens

Qwen
Text Generation
Qwen3-VL-30B-A3B-Thinking
Release on: Oct 11, 2025
Qwen3-VL series delivers superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions....
Total Context:
262K
Max output:
262K
Input:
$
0.29
/ M Tokens
Output:
$
1.0
/ M Tokens

Z.ai
Text Generation
GLM-4.5V
Release on: Aug 13, 2025
As a part of the GLM-V family of models, GLM-4.5V is based on ZhipuAI’s foundation model GLM-4.5-Air, achieving SOTA performance on tasks such as image, video, and document understanding, as well as GUI agent operations....
Total Context:
66K
Max output:
66K
Input:
$
0.14
/ M Tokens
Output:
$
0.86
/ M Tokens

Qwen
Text Generation
Qwen3-Omni-30B-A3B-Captioner
Release on: Oct 4, 2025
Qwen3-Omni-30B-A3B-Captioner is a Vision-Language Model (VLM) from Alibaba's Qwen team, part of the Qwen3 series. It is specifically designed for generating high-quality, detailed, and accurate image captions. Based on a 30B total parameter Mixture of Experts (MoE) architecture, the model can deeply understand image content and translate it into rich, natural language text...
Total Context:
66K
Max output:
66K
Input:
$
0.1
/ M Tokens
Output:
$
0.4
/ M Tokens

Qwen
Text Generation
Qwen3-Omni-30B-A3B-Instruct
Release on: Oct 4, 2025
Qwen3-Omni-30B-A3B-Instruct is a member of the latest Qwen3 series from Alibaba's Qwen team. It is a Mixture of Experts (MoE) model with 30 billion total parameters and 3 billion active parameters, which effectively reduces inference costs while maintaining powerful performance. The model was trained on high-quality, multi-source, and multilingual data, demonstrating excellent performance in basic capabilities such as multilingual dialogue, as well as in code, math...
Total Context:
66K
Max output:
66K
Input:
$
0.1
/ M Tokens
Output:
$
0.4
/ M Tokens

Qwen
Text Generation
Qwen3-Omni-30B-A3B-Thinking
Release on: Oct 4, 2025
Qwen3-Omni-30B-A3B-Thinking is the core "Thinker" component within the Qwen3-Omni omni-modal model's "Thinker-Talker" architecture. It is specifically designed to process multimodal inputs, including text, audio, images, and video, and to execute complex chain-of-thought reasoning. As the reasoning brain of the system, this model unifies all inputs into a common representational space for understanding and analysis, but its output is text-only. This design allows it to excel at solving complex problems that require deep thought and cross-modal understanding, such as mathematical problems presented in images, making it key to the powerful cognitive abilities of the entire Qwen3-Omni architecture...
Total Context:
66K
Max output:
66K
Input:
$
0.1
/ M Tokens
Output:
$
0.4
/ M Tokens

StepFun
Text Generation
step3
Release on: Aug 6, 2025
Step3 is a cutting-edge multimodal reasoning model from StepFun. It is built on a Mixture-of-Experts (MoE) architecture with 321B total parameters and 38B active parameters. The model is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision-language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. During pretraining, Step3 processed over 20T text tokens and 4T image-text mixed tokens, spanning more than ten languages. The model has achieved state-of-the-art performance for open-source models on various benchmarks, including math, code, and multimodality...
Total Context:
66K
Max output:
66K
Input:
$
0.57
/ M Tokens
Output:
$
1.42
/ M Tokens

Z.ai
Text Generation
GLM-4.1V-9B-Thinking
Release on: Jul 4, 2025
GLM-4.1V-9B-Thinking is an open-source Vision-Language Model (VLM) jointly released by Zhipu AI and Tsinghua University's KEG lab, designed to advance general-purpose multimodal reasoning. Built upon the GLM-4-9B-0414 foundation model, it introduces a 'thinking paradigm' and leverages Reinforcement Learning with Curriculum Sampling (RLCS) to significantly enhance its capabilities in complex tasks. As a 9B-parameter model, it achieves state-of-the-art performance among models of a similar size, and its performance is comparable to or even surpasses the much larger 72B-parameter Qwen-2.5-VL-72B on 18 different benchmarks. The model excels in a diverse range of tasks, including STEM problem-solving, video understanding, and long document understanding, and it can handle images with resolutions up to 4K and arbitrary aspect ratios...
Total Context:
66K
Max output:
66K
Input:
$
0.035
/ M Tokens
Output:
$
0.14
/ M Tokens

Qwen
Text Generation
Qwen2.5-VL-32B-Instruct
Release on: Mar 24, 2025
Qwen2.5-VL-32B-Instruct is a multimodal large language model released by the Qwen team, part of the Qwen2.5-VL series. This model is not only proficient in recognizing common objects but is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. It acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use. Additionally, the model can accurately localize objects in images, and generate structured outputs for data like invoices and tables. Compared to its predecessor Qwen2-VL, this version has enhanced mathematical and problem-solving abilities through reinforcement learning, with response styles adjusted to better align with human preferences...
Total Context:
131K
Max output:
131K
Input:
$
0.27
/ M Tokens
Output:
$
0.27
/ M Tokens

Qwen
Text Generation
Qwen2.5-VL-72B-Instruct
Release on: Jan 28, 2025
Qwen2.5-VL is a vision-language model in the Qwen2.5 series that shows significant enhancements in several aspects: it has strong visual understanding capabilities, recognizing common objects while analyzing texts, charts, and layouts in images; it functions as a visual agent capable of reasoning and dynamically directing tools; it can comprehend videos over 1 hour long and capture key events; it accurately localizes objects in images by generating bounding boxes or points; and it supports structured outputs for scanned data like invoices and forms. The model demonstrates excellent performance across various benchmarks including image, video, and agent tasks...
Total Context:
131K
Max output:
4K
Input:
$
0.59
/ M Tokens
Output:
$
0.59
/ M Tokens

Qwen
Text Generation
Qwen2.5-VL-7B-Instruct
Release on: Jan 28, 2025
Qwen2.5-VL is a new member of the Qwen series, equipped with powerful visual comprehension capabilities. It can analyze text, charts, and layouts within images, understand long videos, and capture events. It is capable of reasoning, manipulating tools, supporting multi-format object localization, and generating structured outputs. The model has been optimized for dynamic resolution and frame rate training in video understanding, and has improved the efficiency of the visual encoder....
Total Context:
33K
Max output:
4K
Input:
$
0.05
/ M Tokens
Output:
$
0.05
/ M Tokens
DeepSeek
Text Generation
deepseek-vl2
Release on: Dec 13, 2024
DeepSeek-VL2 is a mixed-expert (MoE) vision-language model developed based on DeepSeekMoE-27B, employing a sparse-activated MoE architecture to achieve superior performance with only 4.5B active parameters. The model excels in various tasks including visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Compared to existing open-source dense models and MoE-based models, it demonstrates competitive or state-of-the-art performance using the same or fewer active parameters....
Total Context:
4K
Max output:
4K
Input:
$
0.15
/ M Tokens
Output:
$
0.15
/ M Tokens

