Modality: llm_chat · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | gpt-4o | openai | 0.995 | 1.000 | n=977 |
| 2 | deepseek-chat | deepseek | 0.556 | 0.556 | n=17551 |
| 3 | claude-haiku-4-5-20251001 | anthropic | 0.426 | 0.426 | n=13591 |
| 4 | gemini-2.5-flash | google_gemini | 0.063 | 0.063 | n=8321 |
| 5 | gemini-2.5-pro | google_gemini | 0.017 | 0.017 | n=8006 |
No benchmark outputs recorded for this niche yet.