
Moonshot AI
Text Generation
Kimi-K2.6
Kimi K2.6 is an open-source, native multimodal agentic model by Moonshot AI, achieving open-source state-of-the-art on benchmarks including HLE with tools, SWE-Bench Pro, and BrowseComp. Built on a MoE architecture with 1T total parameters and 32B activated, the model supports a 256K-token context window and multimodal inputs (image and video) via its MoonViT vision encoder. K2.6 is optimized for agentic workloads: it sustains 4,000+ tool calls over 12+ hours of continuous execution, scales to 300 parallel sub-agents × 4,000 steps per run to produce 100+ files from a single prompt, and supports both Thinking and Instant inference modes with function calling and multi-turn Preserve Thinking...
Total Context:
262K
Max output:
262K
Input:
$
0.95
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
4.0
/ M Tokens

Z.ai
Text Generation
GLM-5V-Turbo
GLM-5V-Turbo is Zhipu’s latest flagship multimodal foundation model, optimized for multimodal coding and agent capabilities. It supports up to 200K tokens of image, video, and text context, and, when integrated with frameworks such as Claude Code and OpenClaw, can handle complex long-horizon programming and assistant tasks....
Total Context:
205K
Max output:
131K
Input:
$
1.2
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
4.0
/ M Tokens

Moonshot AI
Text Generation
Kimi-K2.5
Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. With a 1T-parameter MoE architecture (32B active) and 256K context length, it seamlessly integrates vision and language understanding with advanced agentic capabilities, supporting both instant and thinking modes, as well as conversational and agentic paradigms...
Total Context:
262K
Max output:
262K
Input:
$
0.23
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
3.0
/ M Tokens

Z.ai
Text Generation
GLM-4.6V
GLM-4.6V achieves SOTA (State-of-the-Art) accuracy in visual understanding among models of the same parameter scale. For the first time, it natively integrates Function Call capabilities into the visual model architecture, bridging the gap between "Visual Perception" and "Executable Action." This provides a unified technical foundation for multimodal Agents in real-world business scenarios. Additionally, the visual context window has been expanded to 128k, supporting long video stream processing and high-resolution multi-image analysis....
Total Context:
131K
Max output:
131K
Input:
$
0.3
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
0.9
/ M Tokens

Qwen
Text Generation
Qwen3-VL-32B-Instruct
Qwen3-VL is the vision-language model in the Qwen3 series, achieving state-of-the-art(SOTA)performance on various vision-language(VL)benchmarks. The model supports high-resolution image inputs up to the megapixel level and possesses strong capabilities in general visual understanding, multilingual OCR, fine-grained visual grounding, and visual dialogue. As part of the Qwen3 series, it inherits a powerful language foundation, enabling it to understand and execute complex instructions....
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
0.6
/ M Tokens

Qwen
Text Generation
Qwen3-VL-32B-Thinking
Qwen3-VL-Thinking is a version of the Qwen3-VL series specially optimized for complex visual reasoning tasks. It incorporates a "Thinking Mode" , enabling it to generate detailed intermediate reasoning steps (Chain-of-Thought) before providing a final answer. This design significantly enhances the model's performance on visual question answering (VQA) and other vision-language tasks that require multi-step logic, planning, and in-depth analysis....
Total Context:
262K
Max output:
262K
Input:
$
0.2
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
1.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-8B-Instruct
Qwen3-VL-8B-Instruct is the vision-language model of the Qwen3 series, demonstrates strong capabilities in general visual understanding, visual-centric dialogue, and multilingual text recognition in images. ...
Total Context:
262K
Max output:
262K
Input:
$
0.18
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
0.68
/ M Tokens

Qwen
Text Generation
Qwen3-VL-235B-A22B-Instruct
Qwen3-VL-235B-A22B-Instruct is a 235B parameters Mixture-of-Experts (MoE) vision-language model, with 22B activated parameters. It is an instruction-tuned version of Qwen3-VL-235B-A22B and is aligned for chat applications. ...
Total Context:
262K
Max output:
262K
Input:
$
0.3
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
1.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-235B-A22B-Thinking
Qwen3-VL-235B-A22B-Thinking is one of the Qwen3-VL series models, a reasoning-enhanced Thinking edition that achieves state-of-the-art (SOTA) results across many multimodal reasoning benchmarks, excelling in STEM, math, causal analysis, and logical, evidence-based answers. It features a Mixture-of-Experts (MoE) architecture with 235B total parameters and 22B active parameters. ...
Total Context:
262K
Max output:
262K
Input:
$
0.45
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
3.5
/ M Tokens

Qwen
Text Generation
Qwen3-VL-30B-A3B-Instruct
Qwen3-VL series delivers superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions....
Total Context:
262K
Max output:
262K
Input:
$
0.29
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
1
/ M Tokens

Qwen
Text Generation
Qwen3-VL-30B-A3B-Thinking
Qwen3-VL series delivers superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions....
Total Context:
262K
Max output:
262K
Input:
$
0.29
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
1
/ M Tokens

Qwen
Text Generation
Qwen2.5-VL-32B-Instruct
Qwen2.5-VL-32B-Instruct is a multimodal large language model released by the Qwen team, part of the Qwen2.5-VL series. This model is not only proficient in recognizing common objects but is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. It acts as a visual agent that can reason and dynamically direct tools, capable of computer and phone use. Additionally, the model can accurately localize objects in images, and generate structured outputs for data like invoices and tables. Compared to its predecessor Qwen2-VL, this version has enhanced mathematical and problem-solving abilities through reinforcement learning, with response styles adjusted to better align with human preferences...
Total Context:
131K
Max output:
131K
Input:
$
0.27
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
0.27
/ M Tokens

Qwen
Text Generation
Qwen2.5-VL-72B-Instruct
Qwen2.5-VL is a vision-language model in the Qwen2.5 series that shows significant enhancements in several aspects: it has strong visual understanding capabilities, recognizing common objects while analyzing texts, charts, and layouts in images; it functions as a visual agent capable of reasoning and dynamically directing tools; it can comprehend videos over 1 hour long and capture key events; it accurately localizes objects in images by generating bounding boxes or points; and it supports structured outputs for scanned data like invoices and forms. The model demonstrates excellent performance across various benchmarks including image, video, and agent tasks...
Total Context:
131K
Max output:
4K
Input:
$
0.59
/ M Tokens
Cached Input:
$
text
/ M Tokens
Output:
$
0.59
/ M Tokens

