Modality: vision_qa · full deep dive — every ranked model, test result, and artifact.
Ranked by confidence-adjusted score (single/zero-sample, non-curated scores floored; curated empirical scores trusted as-is).
| # | Model | Provider | Adj. Score | Raw | Evidence |
|---|---|---|---|---|---|
| 1 | openai/gpt-4o-mini | openai | 0.150 | 1.000 | provisional (n≤1) |
| 2 | openai/gpt-4o | openai | 0.150 | 1.000 | provisional (n≤1) |
| 3 | openai/gpt-4.1 | openai | 0.150 | 1.000 | provisional (n≤1) |
| 4 | openai/gpt-5.1 | openai | 0.150 | 1.000 | provisional (n≤1) |
| 5 | openai/gpt-5-nano | openai | 0.150 | 1.000 | provisional (n≤1) |
| 6 | google/gemini-2.5-flash | google_gemini | 0.150 | 1.000 | provisional (n≤1) |
| 7 | xai/grok-4 | xai | 0.150 | 1.000 | provisional (n≤1) |
| 8 | anthropic/claude-haiku-4-5-20251001 | anthropic | 0.000 | 0.000 | provisional (n≤1) |
| 9 | anthropic/claude-sonnet-4-6 | anthropic | 0.000 | 0.000 | provisional (n≤1) |
| 10 | anthropic/claude-opus-4-7 | anthropic | 0.000 | 0.000 | provisional (n≤1) |
| 11 | google/gemini-2.5-pro | google_gemini | 0.000 | 0.000 | provisional (n≤1) |
No benchmark outputs recorded for this niche yet.