Step3 Now Live on SiliconFlow: The Leading Open-source Multimodal Reasoning Model

Aug 11, 2025

Step3, Stepfun's latest cutting-edge multimodal reasoning model is now available on SiliconFlow. Built on a large-scale MoE architecture with 321B total parameters and 38B active parameters, the model delivers exceptional performance in vision-language reasoning. It offers optimized decoding efficiency for enterprise and developer needs, enabling grounded multimodal reasoning with accurate visual interpretation and reduced hallucination.

With SiliconFlow's Step3 API, you can expect:

Cost-Effective Pricing: Step3 $0.57/M tokens (input) and $1.42/M tokens (output).
Context Length: Supports 64K context length.
Native support Tool Use / Function Calling.

Key Capabilities & Benchmark Performance

Step3 features powerful visual perception and advanced reasoning capabilities, enabling accurate cross-domain understanding, multimodal mathematical reasoning and real-world grounded visual understanding tasks.

These capabilities are demonstrated through strong performance across industry-standard benchmarks, highlighting its effectiveness in tasks requiring both visual understanding and reasoning:

VLM Benchmark Performance: Step3 achieves the highest MMMU score (74.2) among open-source VLM models, surpassing proprietary VLM like Gemini 2.5 Flash (73.2); 64.2 on Hallusion Bench, outperforming leading proprietary models including Claude Opus 4 (59.9), Claude Sonnet 4 (57.0) and o3 (60.1), demonstrating Step3's superior performance in complex visual reasoning, factuality and cross-domain comprehension.
LLM Benchmark Performance: Step3 maintains competitive results with 82.9 on AIME25, 73.0 on GPQA-Diamond and 67.1 on LiveCodeBench, showcasing strong capabilities in mathematical reasoning, top graduate-level reasoning and code generation.

In addition to its top-tier performance, Step3 also comes at a lower cost — making it a budget-friendly choice for your workload.

Technical Highlights

Step3 addresses key challenges in multimodal alignment, decoding costs and inference efficiency through full-stack optimizations across model architecture design, training pipeline and deployment:

Pretrain Model Architecture: Step3 employs a novel Multi-Matrix Factorization Attention (MFA) mechanism that reduces KV cache overhead and computational costs while maintaining model capabilities and inference efficiency.
Multimodal Capabilities:
- Step3 uses a 5B Vision Encoder with dual-layer 2D convolution downsampling, reducing visual tokens to 1/16 of original size for improved efficiency;
- Training adopts a two-stage approach: first enhancing encoder perception, then freezing the vision encoder to optimize backbone and connector layers.
AFD System Architecture: Step3 implements Attention-FFN Disaggregation (AFD) that decouples computational tasks into specialized subsystems with multi-stage pipeline scheduling, effectively improving overall throughput efficiency.

Real-world Performance on SiliconFlow

Upload a restaurant receipt to Step3 on SiliconFlow to calculate the meal's calories. It accurately identifies food items, parses complex descriptions, categorizes dishes, matches them with calorie values and estimates total calories (e.g., 900-1330 kcal).

This process formed a complete closed loop — from raw data to concept recognition, calculation, and final explanation — with clear and consistent logic at every stage.

Get Started Immediately

Explore: Try Step3 in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "stepfun-ai/step3",
    "max_tokens": 65536,
    "min_p": 0.05,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "messages": [
        {
            "role": "user",
            "content": "tell me a story"
        }
    ]
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)