Kenshiki

L1–L4 — Evaluation

Claim Ledger

Prove what the system is willing to say.

Ledger breaks the answer into claims, checks those claims against the evidence, and records what is supported, what is unsupported, and what evidence is missing. Unsupported claims do not get through. Every response carries a per-claim audit trail — not just a score, but the evidence each claim was checked against and why it passed or failed.

Without this: you are scoring responses as monoliths. One backed claim hides three fabricated ones. You cannot tell auditors which specific assertion failed or why.

How evaluation works

The Ledger decomposes every model response into atomic claims, then runs each claim through a multi-layer evaluation pipeline. The core differentiator is contrastive causal attribution — measuring whether governed evidence actually caused a specific claim, not just whether it appeared nearby.

  • Claim extraction: splits output into atomic factual assertions with claim IDs and types (all tiers)
  • L1 calibrated confidence: converts token logprobs into calibrated correctness probabilities (all tiers)
  • L2 source entailment: scores each claim against evidence using embedding similarity + NLI (all tiers)
  • L3 stability: multi-draw regeneration and semantic clustering to identify reproducible vs stochastic claims (all tiers where deterministic sampling is available)
  • L4 representation uncertainty: hidden-state probes to detect internal volatility not visible in token confidence (Refinery and Clean Room only — requires self-hosted inference)

Who this is for

Governance runtime

executes automatically on every governed request — decomposes, scores, and gates without manual review.

Compliance officers and auditors

consume per-claim audit trails. Each claim links to the evidence it was checked against, the score it received, and the gate decision it triggered.