Qwen2.5-VL-72B-Instruct
About Qwen2.5-VL-72B-Instruct
Qwen2.5-VL is a vision-language model in the Qwen2.5 series that shows significant enhancements in several aspects: it has strong visual understanding capabilities, recognizing common objects while analyzing texts, charts, and layouts in images; it functions as a visual agent capable of reasoning and dynamically directing tools; it can comprehend videos over 1 hour long and capture key events; it accurately localizes objects in images by generating bounding boxes or points; and it supports structured outputs for scanned data like invoices and forms. The model demonstrates excellent performance across various benchmarks including image, video, and agent tasks
Explore how Qwen2.5-VL-72B-Instruct's advanced vision-language capabilities solve complex, real-world problems.
Smart Document Data Extraction
Automate data extraction from diverse visual documents like invoices, forms, and charts, converting unstructured visual data into structured, actionable insights.
Use Case Example:
"Processed thousands of scanned healthcare intake forms, accurately extracting patient demographics and medical history, reducing manual data entry by 80%."
Long Video Content Analysis
Comprehend and analyze extended video content (over 1 hour), identifying key events, objects, and actions, pinpointing relevant segments for rapid review.
Use Case Example:
"Monitored 8-hour manufacturing line footage, automatically flagging anomalies like misaligned products or safety violations with precise timestamps for review."
Visual UI Automation
Act as a visual agent to interact with digital interfaces (web, mobile), performing complex tasks and automating workflows based on visual cues.
Use Case Example:
"Automated customer support tasks on a web portal by visually navigating the UI to process returns and update order statuses, eliminating manual API calls."
Real-time Object Localization
Accurately detect and localize objects within images and video streams, generating bounding boxes or points for precise tracking and inventory management.
Use Case Example:
"Implemented a retail warehouse system to monitor shelf stock, identifying low-stock items and their exact locations, improving inventory accuracy."
Metadata
Specification
State
Deprecated
Architecture
Vision-Language Transformer
Calibrated
No
Mixture of Experts
No
Total Parameters
72B
Activated Parameters
72B
Reasoning
No
Precision
FP8
Context length
131K
Max Tokens
4K
Compare with Other Models
See how this model stacks up against others.

Qwen
chat
Qwen3.6-35B-A3B
Release on: Apr 17, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
1.6
/ M Tokens

Qwen
chat
Qwen3.6-27B
Release on: Apr 23, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.3
/ M Tokens
Output:
$
3.2
/ M Tokens

Qwen
chat
Qwen3.5-397B-A17B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.39
/ M Tokens
Output:
$
2.34
/ M Tokens

Qwen
chat
Qwen3.5-122B-A10B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.26
/ M Tokens
Output:
$
2.08
/ M Tokens

Qwen
chat
Qwen3.5-35B-A3B
Release on: Feb 25, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.24
/ M Tokens
Output:
$
1.8
/ M Tokens

Qwen
chat
Qwen3.5-27B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.25
/ M Tokens
Output:
$
2.0
/ M Tokens

Qwen
chat
Qwen3.5-9B
Release on: Apr 24, 2026
Total Context:
262K
Max output:
262K
Input:
$
0.1
/ M Tokens
Output:
$
0.15
/ M Tokens

Qwen
chat
Qwen3-VL-32B-Instruct
Release on: Oct 21, 2025
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
0.6
/ M Tokens

Qwen
chat
Qwen3-VL-32B-Thinking
Release on: Oct 21, 2025
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Output:
$
1.5
/ M Tokens
