AI Model Benchmarks

Updated 2026-06-07 22:32 UTC

Quality Leaderboard
# Model Overall Factual Reasoning Format Hardware When
1
llama3.2:3b
ollama
93% 93% 90% 100% Apple M4 · 16 GB 2026-06-07 22:05
2
gemma2:9b
ollama
91% 87% 90% 100% Apple M4 · 16 GB 2026-06-07 21:36
3
qwen2.5-coder:7b
ollama
91% 87% 90% 100% Apple M4 · 16 GB 2026-06-07 19:35
4
qwen2.5:7b
ollama
91% 87% 90% 100% Apple M4 · 16 GB 2026-06-07 21:40
5
mistral:7b
ollama
91% 87% 90% 100% Apple M4 · 16 GB 2026-06-07 21:45
6
phi3.5:3.8b
ollama
87% 87% 90% 80% Apple M4 · 16 GB 2026-06-07 21:55
7
phi4:14b
ollama
85% 73% 90% 100% Apple M4 · 16 GB 2026-06-07 18:50
8
codellama:13b
ollama
80% 80% 70% 100% Apple M4 · 16 GB 2026-06-07 20:32
9
llama3.1:8b
ollama
79% 87% 60% 100% Apple M4 · 16 GB 2026-06-07 21:50
10
gemma4:latest
ollama
56% 60% 30% 100% Apple M4 · 16 GB 2026-06-07 22:08
11
deepseek-r1:7b
ollama
17% 13% 0% 60% Apple M4 · 16 GB 2026-06-07 19:28
12
deepseek-r1:14b
ollama
13% 13% 0% 40% Apple M4 · 16 GB 2026-06-07 19:05
13
qwen3.6:latest
ollama
0% 0% 0% 0% 2026-06-07 18:04
Sub-scores — Latest per Model
Overall Score — Run History
All Quality Runs
WhenModelOverallFactualReasoningFormat
2026-06-07 17:55
gemma4:latest
ollama
61% 73% 40% 80%
2026-06-07 18:04
qwen3.6:latest
ollama
0% 0% 0% 0%
2026-06-07 18:50
phi4:14b
ollama
85% 73% 90% 100%
2026-06-07 19:05
deepseek-r1:14b
ollama
13% 13% 0% 40%
2026-06-07 19:28
deepseek-r1:7b
ollama
17% 13% 0% 60%
2026-06-07 19:35
qwen2.5-coder:7b
ollama
91% 87% 90% 100%
2026-06-07 20:32
codellama:13b
ollama
80% 80% 70% 100%
2026-06-07 21:36
gemma2:9b
ollama
91% 87% 90% 100%
2026-06-07 21:40
qwen2.5:7b
ollama
91% 87% 90% 100%
2026-06-07 21:45
mistral:7b
ollama
91% 87% 90% 100%
2026-06-07 21:50
llama3.1:8b
ollama
79% 87% 60% 100%
2026-06-07 21:55
phi3.5:3.8b
ollama
87% 87% 90% 80%
2026-06-07 22:05
llama3.2:3b
ollama
93% 93% 90% 100%
2026-06-07 22:08
gemma4:latest
ollama
56% 60% 30% 100%
Per-prompt Pass/Fail

No per-prompt data.

Speed — Tokens per Second
Speed — Time to First Token (ms)
Speed Summary
ModelShort TPSMedium TPSLong TPSShort TTFT
llama3.2:3b
ollama
43.9 38.8 37.8 540.3 ms
phi3.5:3.8b
ollama
31.5 24.5 22.7 1121.3 ms
gemma4:latest
ollama
26.7 26.2 26.3 12526.3 ms
qwen2.5-coder:7b
ollama
23.3 21.9 22.1 445.7 ms
qwen2.5:7b
ollama
22.8 22.4 22.3 443.7 ms
mistral:7b
ollama
22.7 21.8 21.9 332.0 ms
deepseek-r1:7b
ollama
20.0 19.6 19.5 4944.3 ms
gemma2:9b
ollama
18.9 17.5 17.3 444.7 ms
llama3.1:8b
ollama
18.7 16.8 16.5 806.7 ms
phi4:14b
ollama
9.9 9.3 9.1 968.0 ms
deepseek-r1:14b
ollama
8.6 8.2 6.0 12322.7 ms
codellama:13b
ollama
0.3 0.0 0.0 4630.7 ms
Quality vs Speed
Domain Suites

No domain results. Run bench-domain to generate data.