Modality: llm_chat · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | deepseek-chat | deepseek | 0.642 | 0.998 | n=9 |
| 2 | gpt-4o | openai | 0.456 | 0.456 | curated |
| 3 | gemini-2.5-flash | google_gemini | 0.347 | 0.407 | n=29 |
| 4 | claude-sonnet-4-5-20250929 | anthropic | 0.332 | 0.332 | n=11550 |
| 5 | claude-sonnet-4-6 | anthropic | 0.292 | 0.292 | curated |
| 6 | claude-haiku-4-5-20251001 | anthropic | 0.277 | 0.315 | n=37 |
| 7 | claude-opus-4-6 | anthropic | 0.239 | 0.239 | curated |
| 8 | gemini-2.5-pro | google_gemini | 0.017 | 0.017 | n=8006 |
| 9 | o3 | openai | 0.000 | 0.000 | provisional (n≤1) |
| 10 | o3-mini | openai | 0.000 | 0.000 | provisional (n≤1) |
No benchmark outputs recorded for this niche yet.