Models

Products

Pricing

Docs

Blog

About

Contact

Back to Models

GLM-4.1V-9B-Thinking API, Deployment, Pricing

THUDM/GLM-4.1V-9B-Thinking

GLM-4.1V-9B-Thinking is an open-source Vision-Language Model (VLM) jointly released by Zhipu AI and Tsinghua University's KEG lab, designed to advance general-purpose multimodal reasoning. Built upon the GLM-4-9B-0414 foundation model, it introduces a 'thinking paradigm' and leverages Reinforcement Learning with Curriculum Sampling (RLCS) to significantly enhance its capabilities in complex tasks. As a 9B-parameter model, it achieves state-of-the-art performance among models of a similar size, and its performance is comparable to or even surpasses the much larger 72B-parameter Qwen-2.5-VL-72B on 18 different benchmarks. The model excels in a diverse range of tasks, including STEM problem-solving, video understanding, and long document understanding, and it can handle images with resolutions up to 4K and arbitrary aspect ratios

API Usage

cURL

Python

JavaScript

curl --request POST \
--url https://api.siliconflow.com/v1/chat/completions \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"model": "THUDM/GLM-4.1V-9B-Thinking",
"stream": false,
"max_tokens": 512,
"enable_thinking": true,
"thinking_budget": 4096,
"min_p": 0.05,
"temperature": 0.7,
"top_p": 0.7,
"top_k": 50,
"frequency_penalty": 0.5,
"n": 1,
"stop": [],
"messages": [
{
"role": "user",
"content": [
{
"image_url": {
"detail": "auto",
"url": "data:image/png;base64,XXX"
},
"type": "image_url"
}
]
}
]
}'

Details

Model Provider

Z.ai

Type

text

Sub Type

chat

Size

text

Publish Time

Jul 4, 2025

Input Price

0.035

/ M Tokens

Output Price

0.14

/ M Tokens

Context length

66K

Qwen3-VL-235B-A22B-Instruct

Release on: Oct 4, 2025

Qwen3-VL-235B-A22B-Instruct is a 235B parameters Mixture-of-Experts (MoE) vision-language model, with 22B activated parameters. It is an instruction-tuned version of Qwen3-VL-235B-A22B and is aligned for chat applications. Qwen3-VL is a series of multimodal models accepting both text and image inputs, and it is trained with a large amount of data. It demonstrates advanced capabilities in understanding and reasoning over text and images...

Total Context:

262K

Max output:

262K

Input:

0.3

/ M Tokens

Output:

1.5

/ M Tokens

Qwen

chat

Qwen3-VL-235B-A22B-Thinking

Release on: Oct 4, 2025

Qwen3-VL is the most powerful vision-language model in the Qwen series to date, delivering comprehensive upgrades across text understanding and generation, visual perception and reasoning, context length, spatial and video dynamics comprehension, and agent interaction capabilities. Qwen3-VL-235B-A22B-Thinking is one of the series' flagship models, a reasoning-enhanced "Thinking" edition that achieves state-of-the-art (SOTA) results across many multimodal reasoning benchmarks, excelling in STEM, math, causal analysis, and logical, evidence-based answers. It features a Mixture-of-Experts (MoE) architecture with 235B total parameters and 22B active parameters. The model natively supports a 262K context length, expandable to 1 million, allowing it to process entire textbooks or hours-long videos. Furthermore, it possesses strong visual agent capabilities, enabling it to operate PC/mobile GUIs, convert sketches into code, and perform 3D grounding, laying the foundation for complex spatial reasoning and embodied AI applications...

Total Context:

262K

Max output:

262K

Input:

0.45

/ M Tokens

Output:

3.5

/ M Tokens

Qwen

chat

Qwen3-VL-30B-A3B-Instruct

Release on: Oct 5, 2025

Qwen3-VL is the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades, including superior text understanding and generation, deeper visual perception and reasoning, extended context length, and stronger agent interaction capabilities. As the instruction-tuned (Instruct) version built on a Mixture-of-Experts (MoE) architecture, it is designed for flexible, on-demand deployment and features powerful capabilities like a visual agent, visual coding, and video understanding, with native support for a 262K context length...

Total Context:

262K

Max output:

262K

Input:

0.29

/ M Tokens

Output:

1.0

/ M Tokens

Qwen

chat

Qwen3-VL-30B-A3B-Thinking

Release on: Oct 11, 2025

Qwen3-VL is the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board, including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. This reasoning-enhanced Thinking edition is built on a Mixture-of-Experts (MoE) architecture, excelling in tasks like operating PC/mobile GUIs, generating code from images, and advanced multimodal reasoning in STEM fields. It supports a native 262K context length and has expanded OCR capabilities for 32 languages...

Total Context:

262K

Max output:

262K

Input:

0.29

/ M Tokens

Output:

1.0

/ M Tokens

inclusionAI

chat

Ling-1T

Release on: Oct 11, 2025

Ling-1T is the first flagship non-thinking model in the Ling 2.0 series, featuring 1 trillion total parameters with ≈ 50 billion active parameters per token. Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of efficient reasoning and scalable cognition. Pre-trained on 20 trillion+ high-quality, reasoning-dense tokens, Ling-1T-base supports up to 131K context length and adopts an evolutionary chain-of-thought (Evo-CoT) process across mid-training and post-training. This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve state-of-the-art performance on multiple complex reasoning benchmarks—balancing accuracy and efficiency...

Total Context:

131K

Max output:

131K

Input:

0.57

/ M Tokens

Output:

2.28

/ M Tokens

DeepSeek

chat

DeepSeek-V3.2-Exp

Release on: Oct 10, 2025

DeepSeek-V3.2-Exp is an experimental version of the model, serving as an intermediate step toward a next-generation architecture. It builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention (DSA), a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios. This release represents ongoing research into more efficient transformer architectures, focusing on improving computational efficiency when processing extended text sequences. DSA achieves fine-grained sparse attention for the first time, delivering substantial improvements in long-context training and inference efficiency while maintaining virtually identical model output quality...

Total Context:

164K

Max output:

164K

Input:

0.27

/ M Tokens

Output:

0.41

/ M Tokens

Qwen

chat

Qwen3-Omni-30B-A3B-Captioner

Release on: Oct 4, 2025

Qwen3-Omni-30B-A3B-Captioner is a Vision-Language Model (VLM) from Alibaba's Qwen team, part of the Qwen3 series. It is specifically designed for generating high-quality, detailed, and accurate image captions. Based on a 30B total parameter Mixture of Experts (MoE) architecture, the model can deeply understand image content and translate it into rich, natural language text...

Total Context:

66K

Max output:

66K

Input:

0.1

/ M Tokens

Output:

0.4

/ M Tokens

Qwen

chat

Qwen3-Omni-30B-A3B-Instruct

Release on: Oct 4, 2025

Qwen3-Omni-30B-A3B-Instruct is a member of the latest Qwen3 series from Alibaba's Qwen team. It is a Mixture of Experts (MoE) model with 30 billion total parameters and 3 billion active parameters, which effectively reduces inference costs while maintaining powerful performance. The model was trained on high-quality, multi-source, and multilingual data, demonstrating excellent performance in basic capabilities such as multilingual dialogue, as well as in code, math...

Total Context:

66K

Max output:

66K

Input:

0.1

/ M Tokens

Output:

0.4

/ M Tokens

Qwen

chat

Qwen3-Omni-30B-A3B-Thinking

Release on: Oct 4, 2025

Qwen3-Omni-30B-A3B-Thinking is the core "Thinker" component within the Qwen3-Omni omni-modal model's "Thinker-Talker" architecture. It is specifically designed to process multimodal inputs, including text, audio, images, and video, and to execute complex chain-of-thought reasoning. As the reasoning brain of the system, this model unifies all inputs into a common representational space for understanding and analysis, but its output is text-only. This design allows it to excel at solving complex problems that require deep thought and cross-modal understanding, such as mathematical problems presented in images, making it key to the powerful cognitive abilities of the entire Qwen3-Omni architecture...

Total Context:

66K

Max output:

66K

Input:

0.1

/ M Tokens

Output:

0.4

/ M Tokens

Z.ai

chat

GLM-4.6

Release on: Oct 4, 2025

Compared with GLM-4.5, GLM-4.6 brings several key improvements. Its context window is expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code, Cline, Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. It also exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. For writing, it better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios...

Total Context:

205K

Max output:

205K

Input:

0.5

/ M Tokens

Output:

1.9

/ M Tokens

inclusionAI

chat

Ring-flash-2.0

Release on: Sep 29, 2025

Ring-flash-2.0 is a high-performance thinking model, deeply optimized based on Ling-flash-2.0-base. It is a Mixture-of-Experts (MoE) model with a total of 100B parameters, but only 6.1B are activated per inference. The model leverages the independently developed 'icepop' algorithm to address the training instability challenges in reinforcement learning (RL) for MoE LLMs, enabling continuous improvement of its complex reasoning capabilities throughout extended RL training cycles. Ring-flash-2.0 demonstrates significant breakthroughs across challenging benchmarks, including math competitions, code generation, and logical reasoning. Its performance surpasses that of SOTA dense models under 40B parameters and rivals larger open-weight MoE models and closed-source high-performance thinking model APIs. More surprisingly, although Ring-flash-2.0 is primarily designed for complex reasoning, it also shows strong capabilities in creative writing. Thanks to its efficient architecture, it achieves high-speed inference, significantly reducing inference costs for thinking models in high-concurrency scenarios...

Total Context:

131K

Max output:

131K

Input:

0.14

/ M Tokens

Output:

0.57

/ M Tokens

Qwen

chat

Qwen3-Next-80B-A3B-Thinking

Release on: Sep 25, 2025

Qwen3-Next-80B-A3B-Thinking is a next-generation foundation model from Alibaba's Qwen team, specifically designed for complex reasoning tasks. It is built on the innovative Qwen3-Next architecture, which combines a Hybrid Attention mechanism (Gated DeltaNet and Gated Attention) with a High-Sparsity Mixture-of-Experts (MoE) structure to achieve ultimate training and inference efficiency. As an 80-billion-parameter sparse model, it activates only about 3 billion parameters during inference, significantly reducing computational costs and delivering over 10 times higher throughput than the Qwen3-32B model on long-context tasks exceeding 32K tokens. This 'Thinking' version is optimized for demanding multi-step problems like mathematical proofs, code synthesis, logical analysis, and agentic planning, and it outputs structured 'thinking' traces by default. In terms of performance, it surpasses more costly models like Qwen3-32B-Thinking and has outperformed Gemini-2.5-Flash-Thinking on multiple benchmarks...

Total Context:

262K

Max output:

262K

Input:

0.14

/ M Tokens

Output:

0.57

/ M Tokens