feature_request

Modality: llm_chat · full deep dive — every ranked model, test result, and artifact.

Models

Benchmark Results

Media Artifacts

Resolutions

Ranked Models

Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).

#	Model	Provider	Adj. Score	Raw	Evidence
1	gpt-4o	openai	0.995	1.000	n=977
2	deepseek-chat	deepseek	0.556	0.556	n=17551
3	gpt-5.5::think=fast	codex	0.525	0.963	n=6
4	gpt-5.5::think=deep	codex	0.525	0.963	n=6
5	claude-haiku-4-5-20251001	anthropic	0.426	0.426	n=13591
6	gpt-5.4::think=fast	codex	0.334	0.612	n=6
7	gpt-5.4::think=deep	codex	0.334	0.612	n=6
8	codex-auto-review	codex	0.286	0.525	n=6
9	gemini-2.5-flash	google_gemini	0.063	0.063	n=8321
10	gpt-5.4	codex	0.000	0.000	n=50
11	gpt-5.5	codex	0.000	0.000	n=50

No benchmark outputs recorded for this niche yet.