Modality: llm_chat · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | claude-sonnet-4-6 | anthropic | 0.920 | 0.920 | curated |
| 2 | claude-opus-4-6 | anthropic | 0.840 | 0.840 | curated |
| 3 | deepseek-reasoner | deepseek | 0.700 | 0.700 | curated |
No benchmark outputs recorded for this niche yet.
Requests semantically routed into this niche.
| Requested | Method | Confidence | When |
|---|---|---|---|
code_cleanup | semantic | 0.623 | 2026-05-20 17:27 |
code_refactor | semantic | 0.576 | 2026-05-20 17:27 |
code_refactor | semantic | 0.576 | 2026-05-20 17:26 |
code_refactor | semantic | 0.576 | 2026-05-20 17:26 |
code_cleanup | lexical | 0.775 | 2026-05-20 17:23 |
code_writing | lexical | 0.775 | 2026-05-20 17:23 |
code_refactor | lexical | 0.775 | 2026-05-20 17:23 |