Modality: llm_chat · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | deepseek-chat | deepseek | 0.720 | 0.900 | n=20 |
| 2 | gpt-4o-mini | openai | 0.720 | 0.900 | n=20 |
No benchmark outputs recorded for this niche yet.