Modality: llm_chat, modality_arm · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | deepseek-chat | deepseek | 0.976 | 1.000 | n=205 |
| 2 | claude-haiku-4-5 | anthropic | 0.850 | 0.850 | curated |
| 3 | gemini-2.5-flash | 0.820 | 0.820 | curated | |
| 4 | gpt-4o-mini | openai | 0.800 | 0.800 | curated |
| 5 | deberta-v3-large | huggingface | 0.750 | 0.750 | curated |
| 6 | distilbert-multilingual | huggingface | 0.700 | 0.700 | curated |
No benchmark outputs recorded for this niche yet.