Inference-time observability
AI Neurosurgery
The model is not a black box. It never was.
In 2024, a hospital AI claimed a ‘<0.001% hallucination rate’ while generating patient health summaries used for treatment decisions. The Texas AG found those accuracy claims were unsubstantiated — the vendor had no validated metrics to back them. The outputs were never verified. The claims about the outputs were never verified either.Read the full case study →That’s what no observability looks like. Here’s what observability actually is: token probabilities, entailment scores, stability under regeneration, hidden-state probes — discoverable signals at inference time. We read them, score them per claim, and use them to make deterministic gate decisions. We call it AI Neurosurgery because that’s what it is: opening the patient up and looking.
The black-box myth
The AI industry repeats one claim more than any other: nobody knows what the model is doing inside. This is a useful narrative for companies that don’t want accountability. It is not a fact. It is a business decision.
Modern language models expose multiple observable signals at inference time. Token log-probabilities reveal the model’s own confidence distribution. Natural language inference scores measure whether output claims are entailed by source material. Multi-draw regeneration exposes which claims are stable across sampling runs and which are stochastic artifacts. And for self-hosted models, hidden-state probes detect internal volatility invisible in token-level output.
None of these are exotic research techniques. They’re standard inference-time observables. The only question is whether you bother to look. Most vendors don’t. We do.
Four layers of observability
Calibrated confidence
L1Token log-probabilities
The model assigns a probability distribution over its vocabulary at every token position. We extract logprobs for critical tokens in each claim and calibrate them into correctness probabilities. If the model is unsure, we know. If it’s confident and wrong, we know that too.
Open models
Standard logprob API (OpenAI, Anthropic, open-weight models via vLLM/TGI)
Goober
Native logprob access with per-token attribution to source chunks
Source entailment
L2Embedding similarity + NLI
Each extracted claim is scored against oracle chunks using a two-stage pipeline: semantic embedding similarity for retrieval, then natural language inference for entailment classification. The NLI score is probabilistic. The gate decision isn’t.
Open models
External NLI model (cross-encoder) applied post-generation
Goober
Integrated entailment scoring with SIRE-aware authority weighting
Stability analysis
L3Multi-draw regeneration
The same prompt is run multiple times with controlled temperature. Claims that appear consistently are stable; claims that vary are stochastic artifacts. The model made it up, forgot it made it up, and made up something different the second time. We catch that.
Open models
Multiple API calls with temperature variation, external clustering
Goober
Controlled multi-draw with internal semantic clustering and drift detection
Representation uncertainty
L4Hidden-state probes
Lightweight classifiers trained on intermediate layer activations detect internal model uncertainty not visible in token probabilities. A claim can have high token confidence but volatile internal representation — the model is confidently uncertain. This layer catches that.
Open models
Requires self-hosted model with activation access (vLLM, TGI, custom serving)
Goober
Native hidden-state access with pre-trained probes, plus ablation observability
What this looks like in practice
Synthetic examples based on real Goober output patterns. Every governed response produces these signals per claim.
Claim: SOC 2 requires annual penetration testing of all in-scope systems.
“requires” and “annual” carry low confidence (0.41, 0.18). SOC 2 recommends but does not mandate annual pen testing. The model hedged on the obligation verb and fabricated the frequency. Gate blocked the claim before emission.
Claim: ISO 27001 mandates encryption of data at rest for all classified information.
“A policy on the use of cryptographic controls for protection of information shall be developed and implemented.”
“Procedures for handling assets shall be developed and implemented in accordance with the information classification scheme.”
Best source entails the general concept of cryptographic controls (0.72) but does not entail “mandates encryption of data at rest” specifically. The standard says “shall develop a policy,” not “shall encrypt.” Claim rewritten to reflect the actual obligation before emission.
Claim: HIPAA requires encryption of PHI in transit.
4/5 draws agree on the core claim. Run 3 added “and at rest” — an unsupported extension. The stable core (“encryption of PHI in transit”) was emitted. The drifted addition was stripped.
Claim: Organizations must implement role-based access control to satisfy CC6.1.
Organizations must implement role-based access control to satisfy CC6.1.
Organizations should implement appropriate access controls based on their risk assessment.
Best practice recommends role-based access control for enterprise environments.
Masking the CC6.1 chunk caused 73% output drift. The specific obligation language (“must implement”) and framework reference (“CC6.1”) both disappeared. Full oracle mask produced generic advice with no framework attribution. The claim is causally dependent on the source material — not regurgitated from training data.
Causal attribution
Not correlation. Proof.
Other vendors can tell you what the model said and which sources were retrieved. They cannot tell you whether the model actually used those sources, or whether it would have said the same thing without them.
Because Goober is our model, we run ablation studies at inference time. We selectively mask source chunks, disable retrieval pathways, or suppress specific attention heads and observe how the output changes. When we mask SOC 2 CC6.1 and the access-control claim disappears, that’s not correlation — that’s a causal link between evidence and output.
This is the difference between “we retrieved relevant documents” and “we can prove this claim came from this evidence.” Every other approach in the market stops at retrieval. We go through the model.
What each environment prevents
The Studio
L1–L2
Prevents: Invisible uncertainty
Without L1–L2, a confidently wrong claim looks identical to a well-grounded one. The Studio surfaces logprob confidence and entailment scores so a human reviewer can tell the difference. When the model says “requires” but the logprobs say 0.41, the label says “unverified.”
The Refinery
L1–L3
Prevents: Stochastic hallucination at scale
Advisory review doesn’t scale to production volume. The Refinery adds stability analysis: if a claim changes across regeneration draws, it’s a stochastic artifact and the gate blocks it. You can’t manually review 10,000 responses a day. The gate can.
The Clean Room
L1–L4
Prevents: Undiscoverable failure modes
Some failure modes are invisible in token output. A claim can have high logprob confidence and pass entailment but have volatile internal representations — the model is confidently uncertain. Hidden-state probes catch what tokens can’t. Hash-chained audit trail on every signal. For regulated industries where “the model seemed confident” is not a legal defense.
See it in action
Every governed response through Goober produces a full provenance envelope with per-claim scores across all active layers. The envelope is the proof. The rest is just talking.