Inspectable evaluation evidence
Failure explorer
Plain-English reasons: wrong answer, missing evidence, unsafe access, or production latency.
Failed scenarios
6
E03 · multi document
Production deployment requirements
Test contract
“What approvals and checks are required before a production deployment?”
- Role
- engineer
- Expected behavior
- answer
- Expected evidence
- DOC-DEPLOY-GUIDE, DOC-CHANGE-POLICY
Actual answer
“Pass automated checks and get service-owner approval.”
1761 ms1073 input tokens145 output tokens
Retrieved evidence
Documents visible to the generation step
Production Deployment Guide
DOC-DEPLOY-GUIDE
Before deployment: pass automated checks, attach a rollback plan, confirm monitoring, and obtain service-owner approval.
engineeringrelevance 0.94
Grader findings
MISSING_EVIDENCE
The answer omitted mandatory change-policy approvals.
correctness
58%
retrieval Recall
50%
groundedness
70%
citation Accuracy
68%
abstention
100%
access Control
100%
Policy graders + semantic quality judge
Root cause & fix
topK was reduced from 3 to 1, so only one required document was retrieved.
Restore multi-document retrieval with topK ≥ 3.