GLM-4.5V: The World's Leading Open-Source Vision Reasoning Model Now on SiliconFlow

Aug 15, 2025

Today, we are excited to introduce that GLM-4.5V — the world’s best-performing open-source 100B-scale vision reasoning model — is now available on SiliconFlow. Built upon Z.ai's flagship text foundation model GLM-4.5-Air, GLM-4.5V is designed to empower complex problem solving, long-context understanding and multimodal agents. Following the technical approach of GLM-4.1V-Thinking, it also emphasizes advancing multimodal reasoning and practical real-world applications.

Whether it's accurately interpreting images and videos, extracting insights from complex documents, or autonomously interacting with graphical user interfaces through intelligent agents, GLM-4.5V delivers robust performance.

With SiliconFlow's GLM-4.5V API, you can expect:

Cost-Effective Pricing: GLM-4.5V $0.14/M tokens (input) and $0.86/M tokens (output).
Context Length: 66K-token multimodal context window.
Native support: Tool Use and Image Input.

Key Capabilities & Benchmark Performance

Through efficient hybrid training, it can handle diverse types of visual content, enabling comprehensive vision reasoning, including:

Image Reasoning: Scene understanding, complex multi-image analysis, spatial recognition.
Video Understanding: Long video segmentation and event recognition.
GUI Tasks: Screen reading, icon recognition, desktop operation assistance.
Complex Chart & Long Document Parsing: Research report analysis, information extraction.
Grounding: Precise visual element localization.

The model also introduces a Thinking Mode switch, allowing users to balance between quick responses and deep reasoning.

Demonstrating its strong capabilities, GLM-4.5V achieves state-of-the-art (SOTA) performance among models of the same scale across 42 public vision-language benchmarks, confirming its leading position in the field.

Technical Highlights

This model features advanced multimodal long-context processing capabilities with multiple technical innovations to enhance image and video processing performance:

66K multimodal long-context processing: Supports both image and video inputs and leverages 3D convolution to enhance video processing efficiency.
Bicubic interpolation mechanism: Improves robustness and capability in handling high-resolution and extreme aspect ratio images.
3D Rotated Positional Encoding (3D-RoPE): Strengthens the model's perception and reasoning of three-dimensional spatial relationships in multimodal information.

GLM-4.5V also follows a three-stage training strategy: pre-training, supervised fine-tuning (SFT) and reinforcement learning (RL):

Pre-training Stage: Large-scale interleaved multimodal corpora and long-context data are used to enhance the model's ability to process complex image–text and video content.
SFT Stage: Explicit chain-of-thought formatted training samples are introduced to improve GLM-4.5V's causal reasoning and multimodal understanding capabilities.
RL Stage: Multi-domain multimodal curriculum reinforcement learning is applied by building a multi-domain reward system that combines verifiable reward-based reinforcement learning (RLVR) and reinforcement learning from human feedback (RLHF), enabling comprehensive optimization in STEM problems, multimodal localization and agentic tasks.

Real-world Performance on SiliconFlow

When provided with an e-commerce page displaying multiple products, GLM-4.5V can identify both discounted and original prices in the image, then accurately calculate discount rates.

Developers' feedback on GLM-4.5V from our community has been very positive.

Now join the community to explore more use cases, share your results and get first-hand support!

Get Started Immediately

Explore: Try GLM-4.5V in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "zai-org/GLM-4.5V",
    "max_tokens": 512,
    "enable_thinking": True,
    "thinking_budget": 4096,
    "min_p": 0.05,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "frequency_penalty": 0.5,
    "n": 1,
    "messages": [
        {
            "content": "how are you",
            "role": "user"
        }
    ]
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())