LLM Leaderboard 2025

Comprehensive AI Model Benchmarks & Performance Metrics

Updated: Nov 13, 2025

Performance Analysis Charts

Top 10 Models by GPQA Diamond (Reasoning)

Coding Performance (SWE Bench) - Top Models

Price vs Performance Analysis

Input cost per 1M tokens vs GPQA performance score

Speed Comparison - Tokens per Second

Context Window Sizes

Maximum tokens each model can process

Math Capabilities (MATH 500) vs Tool Use (BFCL)

Scatter plot showing correlation between mathematical and tool-use capabilities

Top Performing Models by Category

Best in Reasoning (GPQA Diamond)

#1GPT 5.1 88.1%
#2Grok 4 87.5%
#3GPT-5 87.3%
#4Gemini 2.5 Pro 86.4%
#5Grok 3 [Beta] 84.6%

Best in High School Math (AIME 2025)

#1GPT-5 100%
#2Kimi K2 Thinking 99.1%
#3GPT oss 20b 98.7%
#4OpenAI o3 98.4%
#5GPT oss 120b 97.9%

Best in Agentic Coding (SWE Bench)

#1GPT 5.1 76.3%
#2Grok 4 75%
#3GPT-5 74.9%
#4Claude Opus 4.1 74.5%
#5Claude Haiku 4.5 73.3%

Best in Tool Use (BFCL)

#1Llama 3.1 405b 81.1%
#2Llama 3.3 70b 77.3%
#3GPT-4o 72.08%
#4GPT-4.5 69.94%
#5Nova Pro 68.4%

Best in Adaptive Reasoning (GRIND)

#1Gemini 2.5 Pro 82.1%
#2Claude 4 Sonnet 75%
#3Claude 4 Opus 67.9%
#4Claude 3.7 Sonnet [R] 60.7%
#5Nemotron Ultra 253B 57.1%

Best Overall (Humanity's Last Exam)

#1Kimi K2 Thinking 44.9
#2GPT-5 35.2
#3Grok 4 25.4
#4Gemini 2.5 Pro 21.6
#5OpenAI o3 20.32

Speed & Cost Leaders

Fastest Models (Tokens/second)

#1Llama 4 Scout 2600
#2Llama 3.3 70b 2500
#3Llama 3.1 70b 2100
#4Llama 3.1 8b 1800
#5Llama 3.1 405b 969

Lowest Latency (Time to First Token)

#1Nova Micro 0.3s
#2Llama 3.1 8b 0.32s
#3Llama 4 Scout 0.33s
#4Gemini 2.0 Flash 0.34s
#5GPT-4o mini 0.35s

Most Affordable (Input Cost per 1M tokens)

#1Nova Micro $0.04
#2Gemma 3 27b $0.07
#3Gemini 1.5 Flash $0.075
#4GPT oss 20b $0.08
#5Gemini 2.0 Flash $0.1

Comprehensive Model Benchmarks

Benchmark Metrics:
GRIND: Adaptive Reasoning AIME: High School Math GPQA: Graduate-Level Reasoning SWE Bench: Coding Tasks MATH 500: Mathematical Problem Solving BFCL: Tool Use Alder Polyglot: Multilingual
Model GRIND (%) AIME (%) GPQA (%) SWE Bench (%) MATH 500 (%) BFCL (%) Alder Polyglot (%)
Kimi K2 Thinking 84.5 71.3
GPT 5.1 88.1 76.3
Claude Haiku 4.5 73 73.3
GPT-5 87.3 74.9 88
Claude Opus 4.1 80.9 74.5
Grok 4 94 87.5 75 79.6
Claude 4 Opus 67.9 79.6 72.5
Claude 4 Sonnet 75 75.4 72.7
Gemini 2.5 Flash 88 78.3 51.1
OpenAI o3 91.6 83.3 69.1 81.3
Gemini 2.5 Pro 82.1 92 86.4 59.6 82.2
Grok 3 [Beta] 93.3 84.6
DeepSeek-R1 53.6 79.8 71.5 49.2 97.3 57.53 64
OpenAI o3-mini 50 87.3 79.7 61 97.9 65.12 60.4
Claude 3.7 Sonnet [R] 60.7 61.3 78.2 70.3 96.2 58.3 64.9
OpenAI o1 57.1 79.2 75.7 48.9 96.4 67.87 61.7
Llama 3.3 70b 50.5 77 77.3 51.43
Llama 3.1 405b 23.3 49 73.8 81.1
GPT-4o 13.4 56.1 31 60.3 72.08 27.1
Claude 3.5 Sonnet 16 65 49 78 56.46 51.6

Values marked as "—" indicate data not available or not applicable for that benchmark.

Understanding the Metrics:
Context Window: Maximum tokens the model can process Input Cost: Price per 1M input tokens (USD) Output Cost: Price per 1M output tokens (USD) Speed: Tokens generated per second Latency: Time to first token (seconds)
Model Context Window Input Cost / 1M Output Cost / 1M Speed (tok/s) Latency (s)
Llama 4 Scout 10,000,000 $0.11 $0.34 2600 0.33
Gemini 2.0 Flash 1,000,000 $0.1 $0.4 2570 0.34
Gemini 2.5 Flash 1,000,000 $0.15 $0.6 2000 0.35
Llama 4 Maverick 10,000,000 $0.2 $0.6 1260 0.45
Gemma 3 27b 128,000 $0.07 $0.07 590 0.72
Kimi K2 Thinking 256,000 $0.6 $2.5 792 5.3
GPT 5.1 200,000 $1.25 $10
GPT-5 400,000 $1.25 $10
Claude 4 Sonnet 200,000 $3 $15 1.9
Claude 4 Opus 200,000 $15 $75 1.95
Claude Opus 4.1 200,000 $15 $75
Gemini 2.5 Pro 1,000,000 $1.25 $10 1913
OpenAI o3 200,000 $10 $40 942 8
OpenAI o3-mini 200,000 $1.1 $4.4 2141 4
DeepSeek-R1 128,000 $0.55 $2.19 924 4
Llama 3.3 70b 128,000 $0.59 $0.72 5000 0.52
GPT-4o 128,000 $2.5 $10 1430 0.51
GPT-4o mini 128,000 $0.15 $0.6 650 0.35

Costs are in USD per million tokens. Speed measured in tokens per second. Lower latency is better.

Multi-Dimensional Model Comparison

Compare models across multiple benchmark categories. Select models to visualize their performance profile.

Key Insights

Performance Leaders

GPT-5 achieves a perfect 100% score on AIME 2025 (high school math), demonstrating exceptional mathematical reasoning. The GPT 5.x family and Grok 4 dominate across reasoning and coding benchmarks, while Claude models excel in adaptive reasoning (GRIND) and maintain strong performance in agentic coding tasks.

Best Value Propositions

The Llama 4 Scout offers exceptional value with blazing-fast speeds (2600 tokens/s), ultra-low latency (0.33s), and competitive pricing ($0.11/$0.34 per 1M tokens) while supporting a massive 10M token context window. Gemini 2.0 Flash provides excellent performance-to-cost ratio for production deployments.

Specialized Excellence

Llama models dominate tool use benchmarks (BFCL), with Llama 3.1 405b achieving 81.1%. Gemini 2.5 Pro leads in adaptive reasoning (82.1% on GRIND), while Kimi K2 Thinking tops the challenging "Humanity's Last Exam" benchmark with a score of 44.9, showcasing advanced general reasoning capabilities.