Qwen2.5-VL-32B-Instruct
About Qwen2.5-VL-32B-Instruct
Qwen2.5-VL-32B-Instruct is a multimodal large language model released by the Qwen team, part of the Qwen2.5-VL series. This model is not only proficient in recognizing common objects but is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. It acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use. Additionally, the model can accurately localize objects in images, and generate structured outputs for data like invoices and tables. Compared to its predecessor Qwen2-VL, this version has enhanced mathematical and problem-solving abilities through reinforcement learning, with response styles adjusted to better align with human preferences
Explore how Qwen2.5-VL-32B-Instruct's multimodal intelligence and agentic capabilities solve complex visual and analytical challenges.
Document Data Extraction
Automate data extraction from invoices, forms, and reports, structuring information for efficient processing.
Use Case Example:
"Extracted vendor, item, and total amounts from thousands of scanned invoices, populating a database and cutting manual entry time by 80%."
Visual UI Automation
Automate complex interactions on web or mobile apps by visually understanding layouts and directing actions.
Use Case Example:
"An AI agent navigated an e-commerce site, added items, and completed checkout, adapting to UI changes for robust automation."
Video Event Detection
Analyze long video streams to detect specific events, objects, or activities with precise timestamps and summaries.
Use Case Example:
"Monitored security footage, pinpointing unauthorized access instances and generating alerts with relevant video clips."
Interactive STEM Learning
Provide step-by-step solutions for problems in textbooks, diagrams, or handwritten notes, enhancing STEM education.
Use Case Example:
"Solved a challenging physics problem by analyzing a diagram and equations, providing a detailed, step-by-step derivation."
Metadata
Specification
State
Deprecated
Architecture
Multimodal Transformer
Calibrated
Yes
Mixture of Experts
No
Total Parameters
32B
Activated Parameters
32B
Reasoning
No
Precision
FP8
Context length
131K
Max Tokens
131K
Compare with Other Models
See how this model stacks up against others.

Qwen
chat
Qwen3.6-35B-A3B
Release on: Apr 17, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
1.6
/ M Tokens

Qwen
chat
Qwen3.6-27B
Release on: Apr 23, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.3
/ M Tokens
Output:
$
3.2
/ M Tokens

Qwen
chat
Qwen3.5-397B-A17B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.39
/ M Tokens
Output:
$
2.34
/ M Tokens

Qwen
chat
Qwen3.5-122B-A10B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.26
/ M Tokens
Output:
$
2.08
/ M Tokens

Qwen
chat
Qwen3.5-35B-A3B
Release on: Feb 25, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.24
/ M Tokens
Output:
$
1.8
/ M Tokens

Qwen
chat
Qwen3.5-27B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.25
/ M Tokens
Output:
$
2.0
/ M Tokens

Qwen
chat
Qwen3.5-9B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.1
/ M Tokens
Output:
$
0.15
/ M Tokens

Qwen
chat
Qwen3-VL-32B-Instruct
Release on: Oct 21, 2025
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
0.6
/ M Tokens

Qwen
chat
Qwen3-VL-32B-Thinking
Release on: Oct 21, 2025
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
1.5
/ M Tokens
