Ling-mini-2.0 Now on SiliconFlow: MoE Model with SOTA Performance & High Efficiency

Sep 11, 2025

Ling-mini-2.0 Now on SiliconFlow
Ling-mini-2.0 Now on SiliconFlow
Ling-mini-2.0 Now on SiliconFlow

TL;DR: Ling-mini-2.0 is now available on SiliconFlowAnt Group inclusionAI's MoE model combining SOTA performance with unprecedented efficiency. With only 1.4B activated parameters, it delivers 7-8B dense performance, 300+ token/s high speed, and competitive coding & math capabilities. Now you can get enterprise-grade quality at budget-friendly pricing through our API services!


SiliconFlow is excited to introduce Ling-mini-2.0 — a breakthrough MoE-based language model that redefines how efficient AI models can be. With 16B total parameters but only 1.4B activated per token, this model achieves performance that matches or surpasses much larger models, reaching top-tier performance among sub-10B dense LLMs while delivering high speed and cost-effectiveness for your workflows.


With SiliconFlow's Ling-mini-2.0 API, you can expect:


  • Cost-Efficient Pricing: Ling-mini-2.0 $0.07/M tokens (input) and $0.29/M tokens (output).

  • Extended Context Window: 131K enables users to tackle complex tasks.

  • Exceptional Capabilities: Leading performance in coding and mathematical reasoning tasks.


Whether you're building complex coding assistants, mathematical reasoning applications, or general-purpose AI features, SiliconFlow's Ling-mini-2.0 API delivers the performance you need at a fraction of the expected cost and latency.


Why Ling-mini-2.0 Matters


Most large language models face a fundamental trade-off: powerful reasoning requires massive parameter counts, leading to latency and high costs. Developers often struggle choosing between smaller, fast models that lack advanced reasoning capabilities and larger models that deliver quality but drain budgets and slow applications to a crawl.


Ling-mini-2.0 breaks this situation:


  • 7× Equivalent Dense Performance Leverage

Guided by Ling Scaling Laws, Ling-mini-2.0's 1/32 activation ratio MoE design activates only the relevant experts. This enables small-activation MoE models to achieve over 7× equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7–8B dense model.


  • High-speed Generation at 300+ token/s

The highly sparse architecture enables 300+ token/s generation in simple QA scenarios — over 2x faster than comparable 8B dense models. As output length increases, relative speed can exceed 7x, making it ideal for real-time applications.


  • Strong General and Professional Reasoning

Trained on over 20T high-quality tokens and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 excels in complex reasoning tasks including coding (LiveCodeBench, CodeForces), mathematics (AIME 2025, HMMT 2025), and knowledge-intensive reasoning (MMLU-Pro, Humanity's Last Exam).


Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-NoThinking-2504) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities:


Benchmark

Ling-Mini-2.0

Qwen3-4B-instruct-2507

Qwen3-8B-NoThinking-2504

Ernie-4.5-21B-A3B-PT

GPT-OSS-20B/low

LiveCodeBench

34.8

31.9

26.1

26.1

46.6

CodeForces

59.5

55.4

28.2

21.7

67.0

AIME 2025

47.0

48.1

23.4

16.1

38.2

HMMT 2025

🥇35.8

29.8

11.5

6.9

21.7

MMLU-Pro

65.1

62.4

52.5

65.6

65.6

Humanity's Last Exam

🥇6.0

4.6

4.0

5.1

4.7



Real-World Application Scenarios


As demonstrated in our SiliconFlow playground below, Ling-mini-2.0's generation speed isn't just a technical benchmark — it transforms user experience in real-world applications.


Prompt: Create a complete Snake game in Python using pygame.


With lightning-fast responses, strong coding capabilities, and advanced mathematical reasoning, Ling-mini-2.0 unlocks new possibilities across industries where speed and intelligence matter most:


  • Real-Time Coding Assistants

    • Live code completion during development.

    • Instant debugging suggestions without workflow interruption.

    • Interactive code review with immediate feedback.

    • Perfect for: IDEs, code editors, pair programming tools.


  • Interactive Educational Platforms

    • Step-by-step math tutoring with instant explanations.

    • Real-time Q&A for programming bootcamps.

    • Interactive problem-solving without frustrating delays.

    • Perfect for: EdTech platforms, online courses, learning apps.


  • Customer Support & Chatbots

    • Instant responses that feel naturally conversational.

    • Complex query handling without compromising speed.

    • Multi-turn conversations that maintain context efficiently.

    • Perfect for: Customer service, technical support, enterprise chatbots.


Get Started Immediately


  1. 1. Explore: Try Ling-mini-2.0 in the SiliconFlow playground.

  2. 2. Integrate: Use our OpenAI-compatible API. Explore the full API specifications in the SiliconFlow API documentation.


import requestsurl = "https://api.siliconflow.com/v1/chat/completions"payload = {    "model": "inclusionAI/Ling-mini-2.0",    "thinking_budget": 4096,    "top_p": 0.7,    "messages": [        {            "content": "Tell me a story",            "role": "user"        }    ]}headers = {    "Authorization": "Bearer <token>",    "Content-Type": "application/json"}response = requests.post(url, json=payload, headers=headers)print(response.json())


Ready to experience the speed and intelligence of Ling-mini-2.0?

Start building with our API today and see the difference efficient AI can make.


Business or Sales Inquiries →

Join our Discord community now →

Follow us on X for the latest updates →

Explore all available models on SiliconFlow →

Ready to accelerate your AI development?

Ready to accelerate your AI development?

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.

© 2025 SiliconFlow Technology PTE. LTD.