LLM Model Benchmarks 2026

Performance Analysis Charts

Top 10 Models by GPQA Diamond (Reasoning)

Coding Performance (SWE Bench) - Top Models

Price vs Performance Analysis

Input cost per 1M tokens vs GPQA performance score

Speed Comparison - Tokens per Second

Context Window Sizes

Maximum tokens each model can process

Math Capabilities (MATH 500) vs Tool Use (BFCL)

Scatter plot showing correlation between mathematical and tool-use capabilities

Top Performing Models by Category

Best in Reasoning (GPQA Diamond)

#1GPT 5.1 88.1%

#2Grok 4 87.5%

#3GPT-5 87.3%

#4Gemini 2.5 Pro 86.4%

#5Grok 3 [Beta] 84.6%

Best in High School Math (AIME 2026)

#1GPT-5 100%

#2Kimi K2 Thinking 99.1%

#3GPT oss 20b 98.7%

#4OpenAI o3 98.4%

#5GPT oss 120b 97.9%

Best in Agentic Coding (SWE Bench)

#1GPT 5.1 76.3%

#2Grok 4 75%

#3GPT-5 74.9%

#4Claude Opus 4.1 74.5%

#5Claude Haiku 4.5 73.3%

Best in Tool Use (BFCL)

#1Llama 3.1 405b 81.1%

#2Llama 3.3 70b 77.3%

#3GPT-4o 72.08%

#4GPT-4.5 69.94%

#5Nova Pro 68.4%

Best in Adaptive Reasoning (GRIND)

#1Gemini 2.5 Pro 82.1%

#2Claude 4 Sonnet 75%

#3Claude 4 Opus 67.9%

#4Claude 3.7 Sonnet [R] 60.7%

#5Nemotron Ultra 253B 57.1%

Best Overall (Humanity's Last Exam)

#1Kimi K2 Thinking 44.9

#2GPT-5 35.2

#3Grok 4 25.4

#4Gemini 2.5 Pro 21.6

#5OpenAI o3 20.32

Speed & Cost Leaders

Fastest Models (Tokens/second)

#1Llama 4 Scout 2600

#2Llama 3.3 70b 2500

#3Llama 3.1 70b 2100

#4Llama 3.1 8b 1800

#5Llama 3.1 405b 969

Lowest Latency (Time to First Token)

#1Nova Micro 0.3s

#2Llama 3.1 8b 0.32s

#3Llama 4 Scout 0.33s

#4Gemini 2.0 Flash 0.34s

#5GPT-4o mini 0.35s

Most Affordable (Input Cost per 1M tokens)

#1Nova Micro $0.04

#2Gemma 3 27b $0.07

#3Gemini 1.5 Flash $0.075

#4GPT oss 20b $0.08

#5Gemini 2.0 Flash $0.1

Comprehensive Model Benchmarks

Benchmark Metrics:
                        GRIND: Adaptive Reasoning
                        AIME: High School Math
                        GPQA: Graduate-Level Reasoning
                        SWE Bench: Coding Tasks
                        MATH 500: Mathematical Problem Solving
                        BFCL: Tool Use
                        Alder Polyglot: Multilingual
                    

Model	GRIND (%)	AIME (%)	GPQA (%)	SWE Bench (%)	MATH 500 (%)	BFCL (%)	Alder Polyglot (%)
Kimi K2 Thinking	—	—	84.5	71.3	—	—	—
GPT 5.1	—	—	88.1	76.3	—	—	—
Claude Haiku 4.5	—	—	73	73.3	—	—	—
GPT-5	—	—	87.3	74.9	—	—	88
Claude Opus 4.1	—	—	80.9	74.5	—	—	—
Grok 4	—	94	87.5	75	—	—	79.6
Claude 4 Opus	67.9	—	79.6	72.5	—	—	—
Claude 4 Sonnet	75	—	75.4	72.7	—	—	—
Gemini 2.5 Flash	—	88	78.3	—	—	—	51.1
OpenAI o3	—	91.6	83.3	69.1	—	—	81.3
Gemini 2.5 Pro	82.1	92	86.4	59.6	—	—	82.2
Grok 3 [Beta]	—	93.3	84.6	—	—	—	—
DeepSeek-R1	53.6	79.8	71.5	49.2	97.3	57.53	64
OpenAI o3-mini	50	87.3	79.7	61	97.9	65.12	60.4
Claude 3.7 Sonnet [R]	60.7	61.3	78.2	70.3	96.2	58.3	64.9
OpenAI o1	57.1	79.2	75.7	48.9	96.4	67.87	61.7
Llama 3.3 70b	—	—	50.5	—	77	77.3	51.43
Llama 3.1 405b	—	23.3	49	—	73.8	81.1	—
GPT-4o	—	13.4	56.1	31	60.3	72.08	27.1
Claude 3.5 Sonnet	—	16	65	49	78	56.46	51.6

Values marked as "—" indicate data not available or not applicable for that benchmark.

Understanding the Metrics:
                        Context Window: Maximum tokens the model can process
                        Input Cost: Price per 1M input tokens (USD)
                        Output Cost: Price per 1M output tokens (USD)
                        Speed: Tokens generated per second
                        Latency: Time to first token (seconds)
                    

Model	Context Window	Input Cost / 1M	Output Cost / 1M	Speed (tok/s)	Latency (s)
Llama 4 Scout	10,000,000	$0.11	$0.34	2600	0.33
Gemini 2.0 Flash	1,000,000	$0.1	$0.4	2570	0.34
Gemini 2.5 Flash	1,000,000	$0.15	$0.6	2000	0.35
Llama 4 Maverick	10,000,000	$0.2	$0.6	1260	0.45
Gemma 3 27b	128,000	$0.07	$0.07	590	0.72
Kimi K2 Thinking	256,000	$0.6	$2.5	792	5.3
GPT 5.1	200,000	$1.25	$10	—	—
GPT-5	400,000	$1.25	$10	—	—
Claude 4 Sonnet	200,000	$3	$15	—	1.9
Claude 4 Opus	200,000	$15	$75	—	1.95
Claude Opus 4.1	200,000	$15	$75	—	—
Gemini 2.5 Pro	1,000,000	$1.25	$10	1913	—
OpenAI o3	200,000	$10	$40	942	8
OpenAI o3-mini	200,000	$1.1	$4.4	2141	4
DeepSeek-R1	128,000	$0.55	$2.19	924	4
Llama 3.3 70b	128,000	$0.59	$0.72	5000	0.52
GPT-4o	128,000	$2.5	$10	1430	0.51
GPT-4o mini	128,000	$0.15	$0.6	650	0.35

Costs are in USD per million tokens. Speed measured in tokens per second. Lower latency is better.

Multi-Dimensional Model Comparison

Compare models across multiple benchmark categories. Select models to visualize their performance profile.

Key Insights

Performance Leaders

GPT-5 achieves a perfect 100% score on AIME 2026 (high school math), demonstrating exceptional mathematical reasoning. The GPT 5.x family and Grok 4 dominate across reasoning and coding benchmarks, while Claude models excel in adaptive reasoning (GRIND) and maintain strong performance in agentic coding tasks.

Best Value Propositions

The Llama 4 Scout offers exceptional value with blazing-fast speeds (2600 tokens/s), ultra-low latency (0.33s), and competitive pricing ($0.11/$0.34 per 1M tokens) while supporting a massive 10M token context window. Gemini 2.0 Flash provides excellent performance-to-cost ratio for production deployments.

Specialized Excellence

Llama models dominate tool use benchmarks (BFCL), with Llama 3.1 405b achieving 81.1%. Gemini 2.5 Pro leads in adaptive reasoning (82.1% on GRIND), while Kimi K2 Thinking tops the challenging "Humanity's Last Exam" benchmark with a score of 44.9, showcasing advanced general reasoning capabilities.