Comprehensive AI Model Benchmarks & Performance Metrics
Input cost per 1M tokens vs GPQA performance score
Maximum tokens each model can process
Scatter plot showing correlation between mathematical and tool-use capabilities
| Model | GRIND (%) | AIME (%) | GPQA (%) | SWE Bench (%) | MATH 500 (%) | BFCL (%) | Alder Polyglot (%) |
|---|---|---|---|---|---|---|---|
| Kimi K2 Thinking | — | — | 84.5 | 71.3 | — | — | — |
| GPT 5.1 | — | — | 88.1 | 76.3 | — | — | — |
| Claude Haiku 4.5 | — | — | 73 | 73.3 | — | — | — |
| GPT-5 | — | — | 87.3 | 74.9 | — | — | 88 |
| Claude Opus 4.1 | — | — | 80.9 | 74.5 | — | — | — |
| Grok 4 | — | 94 | 87.5 | 75 | — | — | 79.6 |
| Claude 4 Opus | 67.9 | — | 79.6 | 72.5 | — | — | — |
| Claude 4 Sonnet | 75 | — | 75.4 | 72.7 | — | — | — |
| Gemini 2.5 Flash | — | 88 | 78.3 | — | — | — | 51.1 |
| OpenAI o3 | — | 91.6 | 83.3 | 69.1 | — | — | 81.3 |
| Gemini 2.5 Pro | 82.1 | 92 | 86.4 | 59.6 | — | — | 82.2 |
| Grok 3 [Beta] | — | 93.3 | 84.6 | — | — | — | — |
| DeepSeek-R1 | 53.6 | 79.8 | 71.5 | 49.2 | 97.3 | 57.53 | 64 |
| OpenAI o3-mini | 50 | 87.3 | 79.7 | 61 | 97.9 | 65.12 | 60.4 |
| Claude 3.7 Sonnet [R] | 60.7 | 61.3 | 78.2 | 70.3 | 96.2 | 58.3 | 64.9 |
| OpenAI o1 | 57.1 | 79.2 | 75.7 | 48.9 | 96.4 | 67.87 | 61.7 |
| Llama 3.3 70b | — | — | 50.5 | — | 77 | 77.3 | 51.43 |
| Llama 3.1 405b | — | 23.3 | 49 | — | 73.8 | 81.1 | — |
| GPT-4o | — | 13.4 | 56.1 | 31 | 60.3 | 72.08 | 27.1 |
| Claude 3.5 Sonnet | — | 16 | 65 | 49 | 78 | 56.46 | 51.6 |
Values marked as "—" indicate data not available or not applicable for that benchmark.
| Model | Context Window | Input Cost / 1M | Output Cost / 1M | Speed (tok/s) | Latency (s) |
|---|---|---|---|---|---|
| Llama 4 Scout | 10,000,000 | $0.11 | $0.34 | 2600 | 0.33 |
| Gemini 2.0 Flash | 1,000,000 | $0.1 | $0.4 | 2570 | 0.34 |
| Gemini 2.5 Flash | 1,000,000 | $0.15 | $0.6 | 2000 | 0.35 |
| Llama 4 Maverick | 10,000,000 | $0.2 | $0.6 | 1260 | 0.45 |
| Gemma 3 27b | 128,000 | $0.07 | $0.07 | 590 | 0.72 |
| Kimi K2 Thinking | 256,000 | $0.6 | $2.5 | 792 | 5.3 |
| GPT 5.1 | 200,000 | $1.25 | $10 | — | — |
| GPT-5 | 400,000 | $1.25 | $10 | — | — |
| Claude 4 Sonnet | 200,000 | $3 | $15 | — | 1.9 |
| Claude 4 Opus | 200,000 | $15 | $75 | — | 1.95 |
| Claude Opus 4.1 | 200,000 | $15 | $75 | — | — |
| Gemini 2.5 Pro | 1,000,000 | $1.25 | $10 | 1913 | — |
| OpenAI o3 | 200,000 | $10 | $40 | 942 | 8 |
| OpenAI o3-mini | 200,000 | $1.1 | $4.4 | 2141 | 4 |
| DeepSeek-R1 | 128,000 | $0.55 | $2.19 | 924 | 4 |
| Llama 3.3 70b | 128,000 | $0.59 | $0.72 | 5000 | 0.52 |
| GPT-4o | 128,000 | $2.5 | $10 | 1430 | 0.51 |
| GPT-4o mini | 128,000 | $0.15 | $0.6 | 650 | 0.35 |
Costs are in USD per million tokens. Speed measured in tokens per second. Lower latency is better.
Compare models across multiple benchmark categories. Select models to visualize their performance profile.
GPT-5 achieves a perfect 100% score on AIME 2025 (high school math), demonstrating exceptional mathematical reasoning. The GPT 5.x family and Grok 4 dominate across reasoning and coding benchmarks, while Claude models excel in adaptive reasoning (GRIND) and maintain strong performance in agentic coding tasks.
The Llama 4 Scout offers exceptional value with blazing-fast speeds (2600 tokens/s), ultra-low latency (0.33s), and competitive pricing ($0.11/$0.34 per 1M tokens) while supporting a massive 10M token context window. Gemini 2.0 Flash provides excellent performance-to-cost ratio for production deployments.
Llama models dominate tool use benchmarks (BFCL), with Llama 3.1 405b achieving 81.1%. Gemini 2.5 Pro leads in adaptive reasoning (82.1% on GRIND), while Kimi K2 Thinking tops the challenging "Humanity's Last Exam" benchmark with a score of 44.9, showcasing advanced general reasoning capabilities.