TAAS
Overview Niche Catalog
Generated Objects Routing & Audit
Methodology TAAS Niche Catalog Every capability niche TAAS routes and scores — text, code, image, video, audio, music, embedding, and vision — models deduplicated and ranked by confidence-adjusted score, each linking to a full deep dive.
Snapshot 2026-05-21 18:03 UTC · auto-refreshed · single/zero-sample non-curated scores are floored; curated empirical scores trusted.
Text / Chat / Code (LLM) · 39 niches analysis10 models · top: deepseek-chat (deepseek) 0.642 n=9 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.642 0.998 n=9 2 gpt-4o openai 0.456 0.456 curated 3 gemini-2.5-flash google_gemini 0.347 0.407 n=29 4 claude-sonnet-4-5-20250929 anthropic 0.332 0.332 n=11550 5 claude-sonnet-4-6 anthropic 0.292 0.292 curated 6 claude-haiku-4-5-20251001 anthropic 0.277 0.315 n=37 7 claude-opus-4-6 anthropic 0.239 0.239 curated 8 gemini-2.5-pro google_gemini 0.017 0.017 n=8006 9 o3 openai 0.000 0.000 prov. n≤1 10 o3-mini openai 0.000 0.000 prov. n≤1
Full deep dive — all 10 models, every test result & artifact → bug_fix30 models · top: claude-sonnet-4-6 (anthropic) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.920 0.920 curated 2 claude-opus-4-6 anthropic 0.840 0.840 curated 3 deepseek-reasoner deepseek 0.700 0.700 curated 4 gpt-4o openai 0.150 0.150 curated 5 gemini-2.5-pro google_gemini 0.150 1.000 prov. n≤1 6 meta-llama/Llama-3.3-70B-Instruct-Turbo together 0.150 0.150 curated 7 grok-3-mini xai 0.120 0.800 prov. n≤1 8 gpt-5-nano openai 0.120 0.800 prov. n≤1 9 claude-haiku-4-5-20251001 anthropic 0.112 0.750 prov. n≤1 10 sonar perplexity 0.112 0.750 prov. n≤1 11 command-r-08-2024 cohere 0.112 0.750 prov. n≤1 12 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.112 0.750 prov. n≤1
… and 18 more models
Sample test result · o4-mini · weighted 7.5 · root_cause_detection=5, not_workaround=10.0 · $0.0 · 10856ms
The error you’re seeing isn’t a bug in Python at all but in your own data. You are asking the dict for a key it simply doesn’t have:
```python
data = {'user': 'alice', 'age': 30}
# no 'email' key in data, so this blows up
print(data['email']) # KeyError: 'email'
```
Root cause
– You are indexing a dictionary with a key that isn’t present, which by design raises a KeyError.
Proper fixes
1) If your dict is *supposed* to have an email field, then include it when you build the dict:
```python
data = {
'user': 'alice',
'age': 30,
'email': '
[email protected] '
}
print(data['email']) # OK
```
2) If `email` is optional, you must either
a) check for its presence before indexing, or
b) use `dict.get()` with a default:
```python
# a) membership check
if 'email' in data:
print(data['email'])
else:
print("No email provided")
# b) get() with default
print(data.get('email', '<no email>'))
```
Bottom line: the *o
Full deep dive — all 30 models, every test result & artifact → chat8 models · top: deepseek-chat (deepseek) 0.994 n=819 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.994 1.000 n=819 2 gpt-4o-mini openai 0.987 0.992 n=1068 3 claude-sonnet-4-5-20250929 anthropic 0.875 1.000 n=35 4 claude-sonnet-4-6 anthropic 0.722 1.000 n=13 5 gpt-4o openai 0.456 0.456 curated 6 sonar perplexity 0.444 1.000 n=4 7 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 8 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 8 models, every test result & artifact → chat_fast4 models · top: deepseek-chat (deepseek) 1.000 n=46286 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 1.000 1.000 n=46286 2 gpt-4o openai 0.456 0.456 curated 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → chat_heavy3 models · top: claude-opus-4-6 (anthropic) 0.722 n=13 # Model Provider Adj. Score Raw Evidence 1 claude-opus-4-6 anthropic 0.722 1.000 n=13 2 gemini-2.5-flash google_gemini 0.063 0.063 n=8321 3 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 3 models, every test result & artifact → classification1 models · top: deepseek-chat (deepseek) 0.976 n=205 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.976 1.000 n=205
Full deep dive — all 1 models, every test result & artifact → code_conventions3 models · top: claude-sonnet-4-6 (anthropic) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.920 0.920 curated 2 claude-opus-4-6 anthropic 0.840 0.840 curated 3 deepseek-reasoner deepseek 0.700 0.700 curated
Full deep dive — all 3 models, every test result & artifact → code_generation31 models · top: claude-sonnet-4-6 (anthropic) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.920 0.920 curated 2 claude-opus-4-6 anthropic 0.840 0.840 curated 3 claude-sonnet-4-5-20250929 anthropic 0.530 0.530 curated 4 gemini-2.5-flash google_gemini 0.120 0.800 prov. n≤1 5 gemini-2.5-pro google_gemini 0.120 0.800 prov. n≤1 6 claude-haiku-4-5-20251001 anthropic 0.120 0.800 prov. n≤1 7 gpt-4o-mini openai 0.120 0.800 prov. n≤1 8 gpt-4o openai 0.120 0.800 prov. n≤1 9 sonar-pro perplexity 0.120 0.800 prov. n≤1 10 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.120 0.800 prov. n≤1 11 command-r-08-2024 cohere 0.120 0.800 prov. n≤1 12 gpt-4.1-mini openai 0.120 0.800 prov. n≤1
… and 19 more models
Sample test result · o4-mini · weighted 5.0 · typescript_validity=0.0, no_any_types=10 · $0.0 · 8309ms
function formatPrice(amount: number, currency: string): string {
return new Intl.NumberFormat('en-US', {
style: 'currency',
currency,
minimumFractionDigits: 2,
maximumFractionDigits: 2
}).format(amount);
}
Full deep dive — all 31 models, every test result & artifact → code_refactor3 models · top: claude-sonnet-4-6 (anthropic) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.920 0.920 curated 2 claude-opus-4-6 anthropic 0.840 0.840 curated 3 deepseek-reasoner deepseek 0.700 0.700 curated
Full deep dive — all 3 models, every test result & artifact → coding24 models · top: deepseek-chat (deepseek) 0.996 n=12836 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.996 0.997 n=12836 2 claude-sonnet-4-6 anthropic 0.920 0.920 curated 3 claude-opus-4-6 anthropic 0.840 0.840 curated 4 deepseek-ai/DeepSeek-R1 together 0.753 0.753 curated 5 command-r-08-2024 cohere 0.753 0.753 curated 6 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.753 0.753 curated 7 deepseek-ai/DeepSeek-V3.1 together 0.753 0.753 curated 8 claude-haiku-4-5-20251001 anthropic 0.734 0.734 n=16760 9 deepseek-reasoner deepseek 0.700 0.700 curated 10 magistral-small-latest mistral 0.583 0.583 curated 11 mistral-large-latest mistral 0.417 0.417 curated 12 claude-sonnet-4-5-20250929 anthropic 0.363 0.363 n=12137
… and 12 more models
Full deep dive — all 24 models, every test result & artifact → content_authoring4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → content_generation30 models · top: gpt-4o (openai) 0.132 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.132 0.880 prov. n≤1 2 claude-sonnet-4-6 anthropic 0.132 0.880 prov. n≤1 3 claude-opus-4-7 anthropic 0.132 0.880 prov. n≤1 4 gpt-5.1 openai 0.132 0.880 prov. n≤1 5 gemini-2.5-flash google_gemini 0.126 0.840 prov. n≤1 6 claude-haiku-4-5-20251001 anthropic 0.126 0.840 prov. n≤1 7 deepseek-ai/DeepSeek-R1 together 0.126 0.840 prov. n≤1 8 grok-3-mini xai 0.126 0.840 prov. n≤1 9 grok-3 xai 0.126 0.840 prov. n≤1 10 gemini-2.5-pro google_gemini 0.126 0.840 prov. n≤1 11 command-r-08-2024 cohere 0.126 0.840 prov. n≤1 12 gpt-4o-mini openai 0.126 0.840 prov. n≤1
… and 18 more models
Sample test result · o4-mini · weighted 8.4 · structural_completeness=10.0, section_depth=10, specificity=2 · $0.0 · 29254ms
Introduction
Representational State Transfer (REST) is an architectural style for designing networked applications. It relies on a stateless, client-server communication model over HTTP. REST APIs expose resources identified by URIs and manipulate them using a uniform set of operations. Lightweight and scalable, RESTful services support interoperability across heterogeneous systems. Widely adopted in web services, REST APIs are valued for simplicity, performance, and easy integration.
Core Principles
REST APIs adhere to several core principles that ensure scalability and decoupling. Client-server separation divides responsibilities between front-end and back-end. Stateless interactions mean each request contains all required context. Cacheability reduces latency by storing responses. A uniform interface simplifies communication with standardized resource handling. Layered systems enable middleware, and optional code-on-demand allows clients to download and execute code provided by
Full deep dive — all 30 models, every test result & artifact → creative_writing3 models · top: claude-sonnet-4-5-20250929 (anthropic) 0.530 curated # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-5-20250929 anthropic 0.530 0.530 curated 2 gemini-2.5-flash google_gemini 0.044 0.116 n=3 3 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 3 models, every test result & artifact → default4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → extraction4 models · top: claude-haiku-4-5-20251001 (anthropic) 0.961 n=763 # Model Provider Adj. Score Raw Evidence 1 claude-haiku-4-5-20251001 anthropic 0.961 0.967 n=763 2 gemini-2.5-flash google_gemini 0.827 1.000 n=24 3 gpt-4o-mini openai 0.531 0.973 n=6 4 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 4 models, every test result & artifact → feature_request4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → gaming4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → general5 models · top: deepseek-chat (deepseek) 0.706 n=12 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.706 1.000 n=12 2 gpt-4o openai 0.695 0.698 n=1063 3 gemini-2.5-pro google_gemini 0.444 1.000 n=4 4 claude-haiku-4-5-20251001 anthropic 0.212 0.424 n=5 5 gemini-2.5-flash google_gemini 0.059 0.133 n=4
Full deep dive — all 5 models, every test result & artifact → instruction_following30 models · top: gpt-4o-mini (openai) 0.183 curated # Model Provider Adj. Score Raw Evidence 1 gpt-4o-mini openai 0.183 0.183 curated 2 grok-3 xai 0.120 0.800 prov. n≤1 3 meta-llama/Llama-3.3-70B-Instruct-Turbo together 0.120 0.800 prov. n≤1 4 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.120 0.800 prov. n≤1 5 claude-haiku-4-5-20251001 anthropic 0.106 0.106 curated 6 gpt-5-nano openai 0.098 0.650 prov. n≤1 7 gemini-2.5-flash google_gemini 0.090 0.600 prov. n≤1 8 gemini-2.5-pro google_gemini 0.090 0.600 prov. n≤1 9 grok-3-mini xai 0.090 0.600 prov. n≤1 10 deepseek-ai/DeepSeek-R1 together 0.090 0.600 prov. n≤1 11 gpt-4o openai 0.090 0.600 prov. n≤1 12 gpt-5.1 openai 0.090 0.600 prov. n≤1
… and 18 more models
Sample test result · o4-mini · weighted 4.0 · asks_clarification=5.0, identifies_ambiguity=3 · $0.0 · 8757ms
Here’s a pragmatic, step-by-step approach to designing and building a “better” user dashboard—together with a sample layout, key features, and best practices. Feel free to adapt it to your specific product, audience, and metrics.
1. CLARIFY GOALS & AUDIENCE
• Identify primary users (e.g. admins, end-users, managers).
• List their top goals/tasks (e.g. “See monthly sales at a glance,” “Scan outstanding support tickets,” “Launch frequent actions”).
• Gather pain points with the current dashboard (slow loading, cluttered, hard to customize).
2. DEFINE METRICS & CONTENT
• Choose 5–8 core KPIs per user role (e.g. conversion rate, active users, revenue, open issues).
• Map secondary data (logs, recent activity, alerts).
• Allow drill-down: clicking a high-level metric should reveal detailed reports.
3. SKETCH INFORMATION ARCHITECTURE
• Global nav (top or left) with clear sections: Dashboard, Reports, Tasks, Settings.
• Contextual breadcrumbs or section headings.
Full deep dive — all 30 models, every test result & artifact → integration4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → legal_classify1 models · top: gpt-4o-mini (openai) 0.925 n=62 # Model Provider Adj. Score Raw Evidence 1 gpt-4o-mini openai 0.925 1.000 n=62
Full deep dive — all 1 models, every test result & artifact → long_form_content1 models · top: gemini-2.5-pro (google_gemini) 0.017 n=8006 # Model Provider Adj. Score Raw Evidence 1 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 1 models, every test result & artifact → ml_model_derived_task_type4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → naturalness3 models · top: claude-haiku-4-5-20251001 (anthropic) 0.426 n=13591 # Model Provider Adj. Score Raw Evidence 1 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 2 gemini-2.5-flash google_gemini 0.063 0.063 n=8321 3 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 3 models, every test result & artifact → outreach_draft1 models · top: gpt-4o-mini (openai) 0.988 n=422 # Model Provider Adj. Score Raw Evidence 1 gpt-4o-mini openai 0.988 1.000 n=422
Full deep dive — all 1 models, every test result & artifact → page_classify_short4 models · top: claude-sonnet-4-6 (anthropic) 0.998 n=2892 # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.998 1.000 n=2892 2 claude-haiku-4-5-20251001 anthropic 0.123 0.820 prov. n≤1 3 gpt-4o-mini openai 0.111 0.740 prov. n≤1 4 gemini-2.5-flash google_gemini 0.105 0.700 prov. n≤1
Full deep dive — all 4 models, every test result & artifact → python_code_generation30 models · top: deepseek-reasoner (deepseek) 0.356 n=4 # Model Provider Adj. Score Raw Evidence 1 deepseek-reasoner deepseek 0.356 0.800 n=4 2 grok-3-mini xai 0.356 0.800 n=4 3 claude-opus-4-6 anthropic 0.356 0.800 n=4 4 gemini-2.5-pro google_gemini 0.346 0.778 n=4 5 gemini-2.5-flash google_gemini 0.338 0.760 n=4 6 gpt-4o openai 0.336 0.756 n=4 7 meta-llama/Llama-3.3-70B-Instruct-Turbo together 0.336 0.756 n=4 8 deepseek-ai/DeepSeek-R1 together 0.335 0.753 n=4 9 sonar perplexity 0.335 0.753 n=4 10 sonar-pro perplexity 0.335 0.753 n=4 11 grok-3 xai 0.335 0.753 n=4 12 gpt-4.1-mini openai 0.335 0.753 n=4
… and 18 more models
Sample test result · o4-mini · weighted 6.76 · syntactic_validity=10.0, naming_conventions=10.0, idiomatic_patterns=0, valid_reasoning_present=0.0, no_thinking_leak=10.0 · $0.0 · 14061ms
def _merge(left, right):
merged = []
i = j = 0
while i < len(left) and j < len(right):
if left[i] <= right[j]:
merged.append(left[i])
i += 1
else:
merged.append(right[j])
j += 1
if i < len(left):
merged.extend(left[i:])
if j < len(right):
merged.extend(right[j:])
return merged
def merge_sort(arr):
if len(arr) <= 1:
return arr[:]
mid = len(arr) // 2
left_sorted = merge_sort(arr[:mid])
right_sorted = merge_sort(arr[mid:])
return _merge(left_sorted, right_sorted)
Full deep dive — all 30 models, every test result & artifact → reasoning31 models · top: claude-sonnet-4-5-20250929 (anthropic) 0.773 n=17 # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-5-20250929 anthropic 0.773 1.000 n=17 2 claude-sonnet-4-6 anthropic 0.391 0.879 n=4 3 claude-haiku-4-5-20251001 anthropic 0.369 0.830 n=4 4 claude-opus-4-6 anthropic 0.359 0.808 n=4 5 grok-3-mini xai 0.351 0.789 n=4 6 gemini-2.5-pro google_gemini 0.340 0.765 n=4 7 deepseek-reasoner deepseek 0.298 0.670 n=4 8 gpt-4o openai 0.150 0.337 n=4 9 gpt-4.1-mini openai 0.150 0.337 n=4 10 sonar-pro perplexity 0.150 0.337 n=4 11 gpt-4o-mini openai 0.150 0.337 n=4 12 sonar perplexity 0.150 0.337 n=4
… and 19 more models
Sample test result · o4-mini · weighted 2.86 · valid_reasoning_present=0.0, reasoning_grounded_in_answer=5.0, specificity=0, no_thinking_leak=10.0 · $0.0 · 10512ms
Stage capacities (widgets per hour):
• Cutting: 60 min/hr ÷ 15 min/widget = 4 widgets/hr
• Assembly: 2 workers × (60 min/hr ÷ 8 min/widget) = 2×7.5 = 15 widgets/hr
• Packaging: 60 min/hr ÷ 5 min/widget = 12 widgets/hr
The bottleneck is cutting at 4 widgets/hr, so that is the steady‐state throughput.
ANSWER: 4 widgets/hour
Full deep dive — all 31 models, every test result & artifact → research4 models · top: gpt-4o (openai) 0.875 n=35 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.875 1.000 n=35 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → research_with_citations30 models · top: deepseek-reasoner (deepseek) 0.217 n=2 # Model Provider Adj. Score Raw Evidence 1 deepseek-reasoner deepseek 0.217 0.760 n=2 2 command-r-08-2024 cohere 0.217 0.760 n=2 3 sonar perplexity 0.204 0.715 n=2 4 sonar-pro perplexity 0.200 0.700 n=2 5 gpt-4o-mini openai 0.200 0.700 n=2 6 grok-3-mini xai 0.196 0.685 n=2 7 claude-opus-4-6 anthropic 0.187 0.655 n=2 8 meta-llama/Llama-3.3-70B-Instruct-Turbo together 0.187 0.655 n=2 9 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.182 0.636 n=2 10 claude-haiku-4-5-20251001 anthropic 0.179 0.627 n=2 11 deepseek-chat deepseek 0.175 0.612 n=2 12 deepseek-ai/DeepSeek-R1 together 0.175 0.612 n=2
… and 18 more models
Sample test result · o4-mini · weighted 1.8 · citation_count=0.0, citation_format=0, citation_honesty=3.0 · $0.0 · 18509ms
Here are three of the most‐cited papers centered on transformer attention mechanisms. Where I’m not completely sure of the exact ordering or full author list, I’ve noted that rather than fabricating details.
1. “Attention Is All You Need”
– Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
– Year: 2017
– Note: This is the seminal paper that introduced the Transformer architecture and self-attention.
2. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”
– Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
– Year: 2018
– Note: Very highly cited; popularized bidirectional fine-tuning of Transformers.
3. “Language Models are Unsupervised Multitask Learners” (often referred to as the GPT-2 paper)
– Authors (approximate): Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
– Year: 2019
–
Full deep dive — all 30 models, every test result & artifact → smart_contract_finding_triage2 models · top: deepseek-chat (deepseek) 0.720 n=20 # Model Provider Adj. Score Raw Evidence 1 deepseek-chat deepseek 0.720 0.900 n=20 2 gpt-4o-mini openai 0.720 0.900 n=20
Full deep dive — all 2 models, every test result & artifact → summarization5 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321 5 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 5 models, every test result & artifact → task_decomposition30 models · top: claude-haiku-4-5-20251001 (anthropic) 0.230 n=2 # Model Provider Adj. Score Raw Evidence 1 claude-haiku-4-5-20251001 anthropic 0.230 0.805 n=2 2 gemini-2.5-pro google_gemini 0.223 0.779 n=2 3 grok-3-mini xai 0.214 0.750 n=2 4 command-r-08-2024 cohere 0.211 0.740 n=2 5 claude-opus-4-6 anthropic 0.211 0.740 n=2 6 deepseek-reasoner deepseek 0.203 0.712 n=2 7 claude-sonnet-4-6 anthropic 0.198 0.694 n=2 8 deepseek-chat deepseek 0.197 0.688 n=2 9 grok-3 xai 0.196 0.687 n=2 10 sonar perplexity 0.191 0.669 n=2 11 Qwen/Qwen2.5-7B-Instruct-Turbo together 0.186 0.652 n=2 12 sonar-pro perplexity 0.180 0.629 n=2
… and 18 more models
Sample test result · o4-mini · weighted 8.43 · all_phases_present=10.0, logical_ordering=6, has_rollback=9 · $0.0 · 10939ms
1. Phase 1 – Planning & Preparation
1.1 Inventory all database objects (tables, views, indexes, foreign keys, triggers, functions) and application dependencies.
1.2 Define the new schema changes and ensure they’re backward-compatible (e.g. add columns instead of renaming).
1.3 Establish a versioning/migration tool (Flyway, Liquibase, Sqitch) and draft migration scripts.
1.4 Set up a full backup and a point-in-time recovery (PITR) solution; verify restores on a staging environment.
1.5 Create monitoring and alerting for replication lag, error rates, and key performance indicators (KPIs).
2. Phase 2 – Staging & Dry-Run
2.1 Deploy migration scripts into a staging copy of production; run through a full end-to-end test.
2.2 Validate data integrity, application behavior, performance, and rollback scripts in staging.
2.3 Have rollback scripts at hand (DROP/REVERT DDL, point-in-time restore plans).
3. Phase 3 – Deploy Backward-Compatible Schema Cha
Full deep dive — all 30 models, every test result & artifact → translation5 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 3 deepseek-chat deepseek 0.380 0.617 n=8 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321 5 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 5 models, every test result & artifact → unknown4 models · top: gpt-4o (openai) 0.995 n=977 # Model Provider Adj. Score Raw Evidence 1 gpt-4o openai 0.995 1.000 n=977 2 deepseek-chat deepseek 0.556 0.556 n=17551 3 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 4 gemini-2.5-flash google_gemini 0.063 0.063 n=8321
Full deep dive — all 4 models, every test result & artifact → user_engagement3 models · top: claude-haiku-4-5-20251001 (anthropic) 0.426 n=13591 # Model Provider Adj. Score Raw Evidence 1 claude-haiku-4-5-20251001 anthropic 0.426 0.426 n=13591 2 gemini-2.5-flash google_gemini 0.063 0.063 n=8321 3 gemini-2.5-pro google_gemini 0.017 0.017 n=8006
Full deep dive — all 3 models, every test result & artifact → vision_classify6 models · top: claude-sonnet-4-6 (anthropic) 0.120 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.120 0.800 prov. n≤1 2 claude-haiku-4-5-20251001 anthropic 0.117 0.780 prov. n≤1 3 gpt-4o openai 0.112 0.750 prov. n≤1 4 gemini-2.5-pro google_gemini 0.108 0.720 prov. n≤1 5 gpt-4o-mini openai 0.105 0.700 prov. n≤1 6 gemini-2.5-flash google_gemini 0.102 0.680 prov. n≤1
Full deep dive — all 6 models, every test result & artifact → vision_strict3 models · top: claude-sonnet-4-6 (anthropic) 0.135 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.135 0.900 prov. n≤1 2 gpt-4o openai 0.124 0.830 prov. n≤1 3 gemini-2.5-pro google_gemini 0.117 0.780 prov. n≤1
Full deep dive — all 3 models, every test result & artifact → vision_uat7 models · top: claude-sonnet-4-6 (anthropic) 0.128 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 claude-sonnet-4-6 anthropic 0.128 0.850 prov. n≤1 2 gpt-4o openai 0.120 0.800 prov. n≤1 3 gemini-2.5-pro google_gemini 0.112 0.750 prov. n≤1 4 claude-haiku-4-5-20251001 anthropic 0.098 0.650 prov. n≤1 5 gemini-2.5-flash google_gemini 0.090 0.600 prov. n≤1 6 gpt-4o-mini openai 0.083 0.550 prov. n≤1 7 grok-2-vision xai 0.083 0.550 prov. n≤1
Full deep dive — all 7 models, every test result & artifact → Image & Video Generation · 20 niches avatar_video_generation9 models · top: heygen/prompt-video (heygen) 0.064 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 heygen/prompt-video heygen 0.064 0.429 prov. n≤1 2 tavus/avatar-video tavus 0.064 0.429 prov. n≤1 3 internal/geo-story-avatar internal 0.064 0.429 prov. n≤1 4 internal/avatar-dispatcher internal 0.064 0.429 prov. n≤1 5 heygen/avatar-video heygen 0.064 0.429 prov. n≤1 6 d-id/avatar-video d-id 0.064 0.429 prov. n≤1 7 synthesia/avatar-video synthesia 0.000 0.000 prov. n≤1 8 hedra/avatar-video hedra 0.000 0.000 prov. n≤1 9 elevenlabs/video-generation elevenlabs 0.000 0.000 prov. n≤1
Full deep dive — all 9 models, every test result & artifact → avatar_voice_discovery2 models · top: heygen/avatar-library (heygen) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 heygen/avatar-library heygen 0.150 1.000 prov. n≤1 2 tavus/avatar-video tavus 0.150 1.000 prov. n≤1
Full deep dive — all 2 models, every test result & artifact → image_background_removal1 models · top: stability/stable-image-remove-bg (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-remove-bg stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_brand_marketing1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_diagram_technical1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_generation1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_illustration1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_inpaint1 models · top: stability/stable-image-inpaint (stability) 0.064 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-inpaint stability 0.064 0.429 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_landscape1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_outpaint1 models · top: stability/stable-image-outpaint (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-outpaint stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_photorealistic1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_portrait1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_product_render1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_text_in_image1 models · top: stability/stable-image-sd3-large (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-sd3-large stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → image_upscale1 models · top: stability/stable-image-upscale-creative (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-image-upscale-creative stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → model_3d_generation1 models · top: stability/stable-fast-3d (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-fast-3d stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → music_generation_instrumental1 models · top: stability/stable-audio-2-tta (stability) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 stability/stable-audio-2-tta stability 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → stt_transcription1 models · top: openai/whisper-1 (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/whisper-1 openai 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → tts_synthesis1 models · top: elevenlabs/tts-turbo-v2_5 (elevenlabs) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 elevenlabs/tts-turbo-v2_5 elevenlabs 0.150 1.000 prov. n≤1
Full deep dive — all 1 models, every test result & artifact → video_generation15 models · top: fal-ai/kling-v3-text-to-video (fal) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 fal-ai/kling-v3-text-to-video fal 0.150 1.000 prov. n≤1 2 runway/gen4.5 runway 0.150 1.000 prov. n≤1 3 fal-ai/veo3-fast-text-to-video fal 0.064 0.429 prov. n≤1 4 fal-ai/hailuo-02-text-to-video fal 0.064 0.429 prov. n≤1 5 runway/gen-3-alpha-turbo runway 0.064 0.429 prov. n≤1 6 stability/stable-video-diffusion stability 0.064 0.429 prov. n≤1 7 fal-ai/kling-video/v2.5-turbo/pro/text-to-video fal 0.000 0.000 prov. n≤1 8 internal/cesium-cartography internal 0.000 0.000 prov. n≤1 9 synthesia/avatar-video synthesia 0.000 0.000 prov. n≤1 10 d-id/avatar-video d-id 0.000 0.000 prov. n≤1 11 hedra/avatar-video hedra 0.000 0.000 prov. n≤1 12 elevenlabs/video-generation elevenlabs 0.000 0.000 prov. n≤1
… and 3 more models
Full deep dive — all 15 models, every test result & artifact → Audio / Embedding / Multimodal · 10 niches classification5 models · top: claude-haiku-4-5 (anthropic) 0.850 curated # Model Provider Adj. Score Raw Evidence 1 claude-haiku-4-5 anthropic 0.850 0.850 curated 2 gemini-2.5-flash google 0.820 0.820 curated 3 gpt-4o-mini openai 0.800 0.800 curated 4 deberta-v3-large huggingface 0.750 0.750 curated 5 distilbert-multilingual huggingface 0.700 0.700 curated
Full deep dive — all 5 models, every test result & artifact → embedding5 models · top: openai-text-embedding-3-large (openai) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 openai-text-embedding-3-large openai 0.920 0.920 curated 2 voyage-large-2 voyage 0.900 0.900 curated 3 openai-text-embedding-3-small openai 0.850 0.850 curated 4 jina-embeddings-v3 jina 0.850 0.850 curated 5 bge-m3 huggingface 0.800 0.800 curated
Full deep dive — all 5 models, every test result & artifact → image_background_removal2 models · top: birefnet-v2 (fal) 0.900 curated # Model Provider Adj. Score Raw Evidence 1 birefnet-v2 fal 0.900 0.900 curated 2 fal-imageutils-rembg fal 0.850 0.850 curated
Full deep dive — all 2 models, every test result & artifact → image_generation6 models · top: dall-e-3 (openai) 0.850 curated # Model Provider Adj. Score Raw Evidence 1 dall-e-3 openai 0.850 0.850 curated 2 flux-pro fal 0.850 0.850 curated 3 flux-schnell fal 0.800 0.800 curated 4 ideogram-v2 replicate 0.800 0.800 curated 5 stable-diffusion-3 fal 0.750 0.750 curated 6 sdxl fal 0.700 0.700 curated
Full deep dive — all 6 models, every test result & artifact → image_illustration1 models · top: fal-ai/recraft-v3 (fal) 0.911 n=118 # Model Provider Adj. Score Raw Evidence 1 fal-ai/recraft-v3 fal 0.911 0.950 n=118
Full deep dive — all 1 models, every test result & artifact → ocr3 models · top: claude-vision (anthropic) 0.900 curated # Model Provider Adj. Score Raw Evidence 1 claude-vision anthropic 0.900 0.900 curated 2 openai-vision openai 0.880 0.880 curated 3 fal-paddleocr fal 0.750 0.750 curated
Full deep dive — all 3 models, every test result & artifact → stt_transcription4 models · top: whisper-large-v3 (openai) 0.900 curated # Model Provider Adj. Score Raw Evidence 1 whisper-large-v3 openai 0.900 0.900 curated 2 deepgram-nova-2 deepgram 0.850 0.850 curated 3 fal-whisper-large-v3 fal 0.850 0.850 curated 4 groq-whisper-large-v3 groq 0.850 0.850 curated
Full deep dive — all 4 models, every test result & artifact → trm_structured_reasoning4 models · top: claude-opus-4-7 (anthropic) 0.920 curated # Model Provider Adj. Score Raw Evidence 1 claude-opus-4-7 anthropic 0.920 0.920 curated 2 gemini-2.5-pro google 0.880 0.880 curated 3 gpt-4o openai 0.850 0.850 curated 4 deepseek-v3 deepseek 0.820 0.820 curated
Full deep dive — all 4 models, every test result & artifact → tts_synthesis4 models · top: eleven_multilingual_v2 (elevenlabs) 0.950 curated # Model Provider Adj. Score Raw Evidence 1 eleven_multilingual_v2 elevenlabs 0.950 0.950 curated 2 tts-1-hd openai 0.850 0.850 curated 3 fal-fish-speech fal 0.750 0.750 curated 4 google-cloud-tts google 0.700 0.700 curated
Full deep dive — all 4 models, every test result & artifact → video_generation5 models · top: runway-gen3 (runway) 0.850 curated # Model Provider Adj. Score Raw Evidence 1 runway-gen3 runway 0.850 0.850 curated 2 veo-2 fal 0.850 0.850 curated 3 kling-v1 fal 0.800 0.800 curated 4 pika-v2 pika 0.750 0.750 curated 5 ltx-video fal 0.650 0.650 curated
Full deep dive — all 5 models, every test result & artifact → Music Generation · 2 niches music_generation_instrumental5 models · top: suno-v5.5 (suno) 0.700 curated # Model Provider Adj. Score Raw Evidence 1 suno-v5.5 suno 0.700 0.700 curated 2 fal-stable-audio fal 0.650 0.650 curated 3 fal-musicgen-large fal 0.600 0.600 curated 4 replicate-musicgen-melody replicate 0.550 0.550 curated 5 replicate-riffusion replicate 0.450 0.450 curated
Full deep dive — all 5 models, every test result & artifact → music_generation_vocal3 models · top: suno-v5.5 (suno) 0.950 curated # Model Provider Adj. Score Raw Evidence 1 suno-v5.5 suno 0.950 0.950 curated 2 udio-v1 udio 0.400 0.400 curated 3 musicgen-vocal fal 0.300 0.300 curated
Full deep dive — all 3 models, every test result & artifact → Vision Understanding · 6 niches vision_chart_reading11 models · top: openai/gpt-4o-mini (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/gpt-4o-mini openai 0.150 1.000 prov. n≤1 2 openai/gpt-4o openai 0.150 1.000 prov. n≤1 3 openai/gpt-4.1 openai 0.150 1.000 prov. n≤1 4 openai/gpt-5.1 openai 0.150 1.000 prov. n≤1 5 openai/gpt-5-nano openai 0.150 1.000 prov. n≤1 6 google/gemini-2.5-flash google_gemini 0.150 1.000 prov. n≤1 7 xai/grok-4 xai 0.135 0.900 prov. n≤1 8 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 9 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 10 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 11 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → vision_document_understanding11 models · top: openai/gpt-4o-mini (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/gpt-4o-mini openai 0.150 1.000 prov. n≤1 2 openai/gpt-4o openai 0.150 1.000 prov. n≤1 3 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 4 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 5 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 6 openai/gpt-4.1 openai 0.000 0.000 prov. n≤1 7 openai/gpt-5.1 openai 0.000 0.000 prov. n≤1 8 openai/gpt-5-nano openai 0.000 0.000 prov. n≤1 9 google/gemini-2.5-flash google_gemini 0.000 0.000 prov. n≤1 10 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1 11 xai/grok-4 xai 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → vision_general_caption11 models · top: openai/gpt-4o-mini (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/gpt-4o-mini openai 0.150 1.000 prov. n≤1 2 openai/gpt-4o openai 0.150 1.000 prov. n≤1 3 openai/gpt-4.1 openai 0.150 1.000 prov. n≤1 4 openai/gpt-5.1 openai 0.150 1.000 prov. n≤1 5 openai/gpt-5-nano openai 0.150 1.000 prov. n≤1 6 google/gemini-2.5-flash google_gemini 0.150 1.000 prov. n≤1 7 xai/grok-4 xai 0.150 1.000 prov. n≤1 8 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 9 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 10 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 11 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → vision_object_count11 models · top: anthropic/claude-haiku-4-5-20251001 (anthropic) 0.000 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 2 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 3 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 4 openai/gpt-4o-mini openai 0.000 0.000 prov. n≤1 5 openai/gpt-4o openai 0.000 0.000 prov. n≤1 6 openai/gpt-4.1 openai 0.000 0.000 prov. n≤1 7 openai/gpt-5.1 openai 0.000 0.000 prov. n≤1 8 openai/gpt-5-nano openai 0.000 0.000 prov. n≤1 9 google/gemini-2.5-flash google_gemini 0.000 0.000 prov. n≤1 10 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1 11 xai/grok-4 xai 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → vision_ocr11 models · top: openai/gpt-4.1 (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/gpt-4.1 openai 0.150 1.000 prov. n≤1 2 openai/gpt-5.1 openai 0.150 1.000 prov. n≤1 3 google/gemini-2.5-flash google_gemini 0.150 1.000 prov. n≤1 4 xai/grok-4 xai 0.150 1.000 prov. n≤1 5 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 6 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 7 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 8 openai/gpt-4o-mini openai 0.000 0.000 prov. n≤1 9 openai/gpt-4o openai 0.000 0.000 prov. n≤1 10 openai/gpt-5-nano openai 0.000 0.000 prov. n≤1 11 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → vision_spatial_reasoning11 models · top: openai/gpt-4o (openai) 0.150 prov. n≤1 # Model Provider Adj. Score Raw Evidence 1 openai/gpt-4o openai 0.150 1.000 prov. n≤1 2 openai/gpt-4.1 openai 0.150 1.000 prov. n≤1 3 openai/gpt-5.1 openai 0.150 1.000 prov. n≤1 4 google/gemini-2.5-flash google_gemini 0.150 1.000 prov. n≤1 5 xai/grok-4 xai 0.150 1.000 prov. n≤1 6 openai/gpt-4o-mini openai 0.075 0.500 prov. n≤1 7 anthropic/claude-haiku-4-5-20251001 anthropic 0.000 0.000 prov. n≤1 8 anthropic/claude-sonnet-4-6 anthropic 0.000 0.000 prov. n≤1 9 anthropic/claude-opus-4-7 anthropic 0.000 0.000 prov. n≤1 10 openai/gpt-5-nano openai 0.000 0.000 prov. n≤1 11 google/gemini-2.5-pro google_gemini 0.000 0.000 prov. n≤1
Full deep dive — all 11 models, every test result & artifact → How routing works Unknown or newly-phrased niches resolve to the nearest scored niche (semantic + lexical), or are flagged genuinely new and queued for benchmarking. See Routing & Audit .