Gemma 4 Now on SiliconFlow: Frontier Multimodal Intelligence

2026年5月12日

TL; DR: Gemma-4-31B, Gemma-4-26B-A4B and Gemma-4-12B are now available on SiliconFlow. The Gemma 4 family of multimodal models by Google DeepMind delivers frontier-level performance across text, image, video, and audio inputs, with different sizes optimized for different hardware requirements. Achieving an Arena AI score of 1452 (31B) and 1441 (26B A4B), Gemma 4 excels in reasoning, coding, and agentic workflows. Start building with SiliconFlow's API today to power your AI product.

Overview: Maximize Intelligence-per-parameter

Gemma 4, Google DeepMind's latest family of open multimodal models purpose-built for advanced reasoning and agentic workflows, is now officially live on SiliconFlow.

This release brings comprehensive multimodal capabilities and exceptional performance across diverse benchmarks, enabling developers to achieve seamless integration of vision, audio, and text understanding with greater efficiency and reliability. Built from Gemini 3 and as Google's most intelligent open models, Gemma 4 is truly open with Apache 2.0 licenses.

Through SiliconFlow's Gemma 4 API, You Can Expect

Cost-effective Pricing: Gemma-4-31B: $0.13/M tokens (Input), $0.40/M tokens (Output); Gemma-4-26B-A4B: $0.12/M tokens (Input), $0.40/M tokens (Output); Gemma-4-12B: $0.10/M tokens (Input), $0.30/M tokens (Output)
Up to 262K Context Window: Perfect for long documents, complex reasoning, and extended agentic tasks
Advanced Reasoning & agentic workflows: Native function calling for autonomous agents like Hermes Agent, OpenClaw, and Claude Code
Developer-Ready Integration: Instant compatibility with your existing stack, deploy via SiliconFlow's OpenAI/Anthropic-compatible API through Claude Code, Gen-CLI and Cline; ready-to-use in Dify, ChatHub, Chatbox, Sider, and also available through OpenRouter

Core Capabilities & Benchmark Performance

The different model sizes and precisions represent a set of trade-offs for AI application. Gemma 4 resolves this through architecture innovation and comprehensive training, achieving both frontier-level performance and true deployment versatility.

Reasoning: All models in the family are designed as highly capable reasoners, with configurable thinking modes
Diverse & Efficient Architectures: Offers a powerful 31B parameter dense model, and a highly efficient 26B MoE model
Frontier Multimodal Performance: Achieves Arena AI text score of 1452 (31B) and 1441 (26B A4B with just 4B active parameters), placing it among the world's top-tier models
Comprehensive Modality Support: Natively processes text, images, video, and audio inputs (E2B/E4B support all modalities; larger models support text+vision)
Enhanced Reasoning & Coding: 89.2% on AIME 2026, 80.0% on LiveCodeBench v6, and 86.4% on τ²-Bench agentic tool use (31B model)

Gemma 4 consistently outperforms previous-generation models including Gemma 3 27B across reasoning, coding, vision, and agentic benchmarks, while maintaining deployment flexibility that frontier closed models cannot match. The 26B A4B mixture-of-experts model delivers near-31B performance with only 4B active parameters, making frontier capabilities accessible on consumer hardware.

Gemma-4-12B: Google DeepMind's First Mid-sized Model to Feature Native Audio Inputs

Gemma-4-12B, a dense multimodal model with a unified, encoder-free architecture to integrate audio and vision input directly, delivers performance nearing the larger Gemma-4-26B MoE model on standard benchmarks.

Highlights:

Encoder-free Architecture: Eliminates traditional multimodal encoders, vision and audio inputs flow directly into the LLM backbone
Advanced Reasoning: Benchmark performance approaches Gemma-4-26B model, enabling sophisticated multi-step reasoning and agentic workflows
Latency Optimized: Equipped with Multi-Token Prediction (MTP) drafters for reduced inference latency

Real-World Applications

From reasoning and coding to vision and long-context tasks, Gemma 4 empowers developers to push the boundaries of AI across every dimension.

Vision-Language Coding Assistants: Developers can provide screenshots of UI elements, design mockups, or application interfaces, and receive accurate HTML/CSS code generation, GUI element detection with bounding boxes, or object detection—enabling rapid prototyping and automated design-to-code workflows. The models natively respond in structured JSON format without requiring specific instructions
Multimodal Agentic Systems: Build intelligent agents that combine vision and text understanding for complex tool use scenarios. Gemma 4 excels at multimodal function calling, allowing agents to analyze images and invoke appropriate tools based on visual context
Possible Use-cases: Generate creative text formats, text summarization, chatbots and conversational AI, image data extraction, research and education…

Get Started Immediately

Explore: Try Gemma-4-31B, Gemma-4-26B-A4B or Gemma-4-12B in the SiliconFlow playground.
Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.
Python Example for Gemma-4-26B-A4B API Usage:

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "google/gemma-4-26B-A4B-it",
    "stream": True,
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "How many r's are in the word 'strawberry'?"
        }
    ],
    "enable_thinking": True,
    "temperature": 0.7
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "google/gemma-4-26B-A4B-it",
    "stream": True,
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "How many r's are in the word 'strawberry'?"
        }
    ],
    "enable_thinking": True,
    "temperature": 0.7
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

import requests

url = "https://api.siliconflow.com/v1/chat/completions"

payload = {
    "model": "google/gemma-4-26B-A4B-it",
    "stream": True,
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "How many r's are in the word 'strawberry'?"
        }
    ],
    "enable_thinking": True,
    "temperature": 0.7
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)