Kenshiki

Evidence store

Kura Index

Store what counts as real.

Kura is the evidence store. You POST source material into Kura, and the system preserves provenance, structure, and retrieval boundaries so every downstream answer can be traced back to something real. Documents are parsed by Docling (GPU-accelerated layout analysis, table extraction, OCR), enriched with clause IDs, normative language markers, SIRE identity tags, and cross-references, then chunked and embedded into the governed evidence store. Kura includes the Crosswalk — the authority map that scopes retrieval by caller identity and source boundaries.

Without Kura, every downstream governance decision is an assertion without evidence. The Prompt Compiler cannot scope what the model sees. The Claim Ledger has nothing to check claims against. The Boundary Gate has no basis for its decision. No evidence in Kura, no grounded answer from Kadai.

Why Kura Exists

Standard RAG systems retrieve whatever is nearest in embedding space and hand it to the model. There is no authority boundary, no source provenance, no access control, and no way to prove what the model was allowed to see. Kura exists because governed inference requires a governed evidence boundary — not just a vector database, but a system that knows what each source is, who can access it, and what it is allowed to support.

  • RAG without authority boundaries is retrieval, not governance
  • The model must not see evidence the caller is not authorized to use
  • Post-generation scoring cannot fix what was never in scope
  • Every claim in the Ledger must trace back to a specific chunk with provenance

What Kura Does

Transforms authoritative source documents into a queryable, tamper-evident knowledge base. Every chunk carries provenance from upload through embedding.

  • SHA-256 source hash, idempotent upsert, version-aware change detection
  • Section-aware chunking on heading boundaries with merge for undersized chunks
  • HMAC-SHA-256 watermarks per chunk — self-contained verification without database access
  • Embedding via text-embedding-3-large (512d Matryoshka)
  • Tenant provenance on every row, enforced by database CHECK constraints

How Documents Are Parsed

Docling runs GPU-accelerated layout analysis (DocLayNet), table structure extraction (TableFormer), and OCR (EasyOCR) on every uploaded document. The output is enriched before chunking — not after, and not by the model.

  • Two-stage pipeline: GPU parse → CPU enrichment
  • Clause ID extraction for regulatory citations (e.g., "DFARS 252.204-7012" as a single entity)
  • Normative language detection (SHALL/MUST/REQUIRED flags)
  • Cross-reference resolution between sections and documents
  • Quality gate rejects OCR garbage, TOC entries, and low-density chunks
  • SIRE identity tags stamped on every chunk during enrichment
  • If a document fails parsing after 3 retries: quarantined, not dropped. The Claim Ledger receives a DEGRADED_BOUNDARY annotation.

SIRE Identity System

SIRE (Subject, Included, Relevant, Excluded) is deterministic identity metadata embedded in source frontmatter during ingestion. It defines what each source is about, what it covers, what it relates to, and what it must never be used to answer. Only Excluded enforces — the other three inform discovery.

  • Subject: anchors the source to a specific domain (e.g., soc_2_trust_services_criteria, eu_ai_act, hipaa_privacy_security)
  • Included: enriches search with the terminology the source covers (e.g., 'conformity assessment', 'data protection officer', 'cardholder data')
  • Relevant: maps cross-source topology (e.g., ISO 27001 is relevant to SOC 2; NIST AI RMF is relevant to EU AI Act)
  • Excluded: hard boundary enforcement (e.g., SOC 2 source excludes 'sox', 'gaap', 'hipaa'; GDPR source excludes 'ccpa only', 'hipaa only')
  • At retrieval time, the exclusion gate purges any chunk whose content matches an excluded term — case-insensitive, word-boundary match
  • SIRE proposals are generated by keyword frequency scan, then manually curated and approved before being applied

Retrieval and Access Control

At retrieval time, Kura's Crosswalk scopes evidence by the caller's access boundary via OpenFGA/ReBAC. The model only sees evidence the caller is authorized to use for this specific question.

  • Hybrid retrieval (pgvector + tsvector) ranked by semantic + lexical similarity
  • Chunks grouped by SIRE subject, subjects ranked by mean relevance score
  • SIRE exclusion gate purges unauthorized or out-of-scope chunks before they reach the model
  • Per-caller evidence scoping via OpenFGA relationship-based access control
  • Tenant-scoped row-level security on every evidence table
  • The model never sees evidence outside the caller's authorization boundary

Relationship Discovery

The Crosswalk processes the full evidence library to map coverage, overlaps, conflicts, and routing paths so multi-source governance can run deterministically.

  • Declared relationships: matches Excluded-to-Included across all sources, validates Relevant references, detects Subject overlap
  • Discovered relationships: cross-source embedding similarity (cosine threshold 0.80) finds coverage SIRE tags missed
  • Registry merge: confirmed, declared-only, discovered-only, and conflict relationships in one authority map with O(1) concept lookup

What Kura Proves

Every chunk in Kura carries an immutable attribution chain: embedding authority, egress policy, and pipeline run attestation. Un-attributed chunks are structurally impossible — enforced by database constraints, not application logic.

  • Chunk watermarks verify without database access (HMAC-SHA-256)
  • Immutable event log for every embedding operation
  • Every embedding vector carries sovereignty attribution enforced by CHECK constraint
  • Designed for air-gap and VPC deployment where data must never leave the customer's boundary

Tier Variations

Kura runs in every deployment tier. What changes is where the evidence store lives and who controls it.

  • Workshop: Kura runs in the shared Kenshiki environment. Evidence is managed by Kenshiki.
  • Refinery: Kura runs inside the customer's private deployment (VPC, GovCloud, or connected on-prem). Evidence stays inside the customer's boundary.
  • Clean Room: Kura runs on air-gapped, verified hardware. Evidence never leaves the customer's physical premises.

Dependency on the Prompt Compiler

The metadata Kura stamps on every chunk — clause IDs, normative markers, SIRE tags, source tier — is what the Prompt Compiler uses to decide which CFPO zone each piece of evidence belongs in. Without this metadata, the Compiler cannot make informed zone decisions, and the prompt contract is ungoverned.

  • Normative mandates (SHALL/MUST) → Policy zone
  • Structural definitions and schemas → Format zone
  • Advisory narrative and context → Content zone
  • If a chunk arrives without SIRE tags or normative markers, it defaults to the Content zone with reduced authority weight

Who this is for

Corpus engineers

data stewards who curate, version, and maintain authoritative source collections inside the evidence boundary. Responsible for ingestion, SIRE tagging, and evidence quality.

Every downstream system

the Prompt Compiler draws evidence for zone mapping. The Claim Ledger checks claims against it. The Boundary Gate relies on it for emission decisions. Kadai returns answers bounded by what Kura contains.