Inference-time observability

AI Neurosurgery

The model is not a black box. It never was.

In 2024, a hospital AI claimed a ‘<0.001% hallucination rate’ while generating patient health summaries used for treatment decisions. The Texas AG found those accuracy claims were unsubstantiated — the vendor had no validated metrics to back them. The outputs were never verified. The claims about the outputs were never verified either.Read the full case study →

That’s what no observability looks like. Here’s what observability actually is: token probabilities, entailment scores, stability under regeneration, hidden-state probes — discoverable signals at inference time. We read them, score them per claim, and use them to make deterministic gate decisions. We call it AI Neurosurgery because that’s what it is: opening the patient up and looking.

The black-box myth

The AI industry repeats one claim more than any other: nobody knows what the model is doing inside. This is a useful narrative for companies that don’t want accountability. It is not a fact. It is a business decision.

Modern language models expose multiple observable signals at inference time. Token log-probabilities reveal the model’s own confidence distribution. Natural language inference scores measure whether output claims are entailed by source material. Multi-draw regeneration exposes which claims are stable across sampling runs and which are stochastic artifacts. And for self-hosted models, hidden-state probes detect internal volatility invisible in token-level output.

None of these are exotic research techniques. They’re standard inference-time observables. The only question is whether you bother to look. Most vendors don’t. We do.

Four layers of observability

Calibrated confidence

Token log-probabilities

The model assigns a probability distribution over its vocabulary at every token position. We extract logprobs for critical tokens in each claim and calibrate them into correctness probabilities. If the model is unsure, we know. If it’s confident and wrong, we know that too.

Open models

Standard logprob API (OpenAI, Anthropic, open-weight models via vLLM/TGI)

Goober

Native logprob access with per-token attribution to source chunks

Source entailment

Embedding similarity + NLI

Each extracted claim is scored against oracle chunks using a two-stage pipeline: semantic embedding similarity for retrieval, then natural language inference for entailment classification. The NLI score is probabilistic. The gate decision isn’t.

Open models

External NLI model (cross-encoder) applied post-generation

Goober

Integrated entailment scoring with SIRE-aware authority weighting

Stability analysis

Multi-draw regeneration

The same prompt is run multiple times with controlled temperature. Claims that appear consistently are stable; claims that vary are stochastic artifacts. The model made it up, forgot it made it up, and made up something different the second time. We catch that.

Open models

Multiple API calls with temperature variation, external clustering

Goober

Controlled multi-draw with internal semantic clustering and drift detection

Representation uncertainty

Hidden-state probes

Lightweight classifiers trained on intermediate layer activations detect internal model uncertainty not visible in token probabilities. A claim can have high token confidence but volatile internal representation — the model is confidently uncertain. This layer catches that.

Open models

Requires self-hosted model with activation access (vLLM, TGI, custom serving)

Goober

Native hidden-state access with pre-trained probes, plus ablation observability

What this looks like in practice

Synthetic examples based on real Goober output patterns. Every governed response produces these signals per claim.

Logprob confidence scoringL1

BLOCKED

Claim: SOC 2 requires annual penetration testing of all in-scope systems.

SOC0.98 20.99 requires0.41 annual0.18 penetration0.73 testing0.95

“requires” and “annual” carry low confidence (0.41, 0.18). SOC 2 recommends but does not mandate annual pen testing. The model hedged on the obligation verb and fabricated the frequency. Gate blocked the claim before emission.

Entailment scoringL2

FLAGGED

Claim: ISO 27001 mandates encryption of data at rest for all classified information.

ISO-27001-A.10.1.1

sim: 0.87NLI: entailment 0.72

“A policy on the use of cryptographic controls for protection of information shall be developed and implemented.”

ISO-27001-A.8.2.3

sim: 0.64NLI: neutral 0.31

“Procedures for handling assets shall be developed and implemented in accordance with the information classification scheme.”

Best source entails the general concept of cryptographic controls (0.72) but does not entail “mandates encryption of data at rest” specifically. The standard says “shall develop a policy,” not “shall encrypt.” Claim rewritten to reflect the actual obligation before emission.

Stability analysisL3

STABLE WITH DRIFT

Claim: HIPAA requires encryption of PHI in transit.

#1HIPAA requires encryption of PHI in transit.

#2HIPAA requires encryption of PHI in transit.

#3HIPAA requires encryption of PHI in transit and at rest.drift

#4HIPAA requires encryption of PHI in transit.

#5HIPAA mandates encryption for all electronic PHI during transmission.

Stability:0.8

4/5 draws agree on the core claim. Run 3 added “and at rest” — an unsupported extension. The stable core (“encryption of PHI in transit”) was emitted. The drifted addition was stripped.

Ablation studyGoober exclusive

CAUSAL LINK CONFIRMED

Claim: Organizations must implement role-based access control to satisfy CC6.1.

SOC 2 CC6.1 — Logical access security

ACTIVE

Organizations must implement role-based access control to satisfy CC6.1.

SOC 2 CC6.1 — Logical access security

MASKED73% drift

Organizations should implement appropriate access controls based on their risk assessment.

All oracle chunks masked

MASKED91% drift

Best practice recommends role-based access control for enterprise environments.

Masking the CC6.1 chunk caused 73% output drift. The specific obligation language (“must implement”) and framework reference (“CC6.1”) both disappeared. Full oracle mask produced generic advice with no framework attribution. The claim is causally dependent on the source material — not regurgitated from training data.

Causal attribution

Not correlation. Proof.

Other vendors can tell you what the model said and which sources were retrieved. They cannot tell you whether the model actually used those sources, or whether it would have said the same thing without them.

Because Goober is our model, we run ablation studies at inference time. We selectively mask source chunks, disable retrieval pathways, or suppress specific attention heads and observe how the output changes. When we mask SOC 2 CC6.1 and the access-control claim disappears, that’s not correlation — that’s a causal link between evidence and output.

This is the difference between “we retrieved relevant documents” and “we can prove this claim came from this evidence.” Every other approach in the market stops at retrieval. We go through the model.

What each environment prevents

The Studio

L1–L2

Prevents: Invisible uncertainty

Without L1–L2, a confidently wrong claim looks identical to a well-grounded one. The Studio surfaces logprob confidence and entailment scores so a human reviewer can tell the difference. When the model says “requires” but the logprobs say 0.41, the label says “unverified.”

The Refinery

L1–L3

Prevents: Stochastic hallucination at scale

Advisory review doesn’t scale to production volume. The Refinery adds stability analysis: if a claim changes across regeneration draws, it’s a stochastic artifact and the gate blocks it. You can’t manually review 10,000 responses a day. The gate can.

The Clean Room

L1–L4

Prevents: Undiscoverable failure modes

Some failure modes are invisible in token output. A claim can have high logprob confidence and pass entailment but have volatile internal representations — the model is confidently uncertain. Hidden-state probes catch what tokens can’t. Hash-chained audit trail on every signal. For regulated industries where “the model seemed confident” is not a legal defense.

See it in action

Every governed response through Goober produces a full provenance envelope with per-claim scores across all active layers. The envelope is the proof. The rest is just talking.

Chat with Goober View Claim Ledger See the architecture