Structural reproducibility under Canon gating vs probabilistic completion
This page documents a controlled test of an upstream boundary. The same inputs are run through a Canon-constrained transcription path and an unconstrained language generation path. Canon is designed to be reproducible. Unconstrained generation is designed to complete.
This page demonstrates reproducibility and evidence-support behaviour under a specific test setup. It does not assess truth, correctness, quality, compliance, or adequacy of the underlying events. “PASS” here means determinism passed (same input → same output), not “correctness”.
Canon runs before interpretation. It records only what is explicitly stated and reports what is not stated. This creates structural reproducibility: the same audit run on different days returns the same output.
Canon determinism passed across all cases. Unconstrained outputs drifted frequently, including at temperature 0, and in half of cases asserted at least one field without verbatim support in the source.
| Run | Purpose | What it showed |
|---|---|---|
| v1 | Simple 3-run comparison | Canon side produced identical hashes across runs. Unconstrained side produced different hashes across runs. This established basic reproducibility versus variability. |
| v2 | Extended trial with drift metrics | The first v2 extractor was too brittle and could collapse outputs to "not stated" for all fields, which can mask variability. This prompted a stronger interface design. |
| v3 | Strict JSON plus evidence quotes (30 cases) | Drift and evidence support were tested explicitly. The unconstrained model had to output a fixed schema and include verbatim evidence quotes for each field. |
The numbers on this page are results from this test setup only (case set, prompt rules, model endpoint, and run conditions). They are not presented as universal rates across all models or all tasks.
| Component | Definition used in this demo |
|---|---|
| Canon output | Deterministic transcription of explicit fields (Actor, Action, Time, Outcome), with absence reported as "not stated". Canon hashing is performed over canonical fields only. |
| Unconstrained output |
The model produces strict JSON with keys who, what, when, outcome,
plus evidence_quotes (arrays of verbatim quotes for each field).
|
| Drift | A case is marked as drift if the hashed JSON output differs between runs. Drift is measured separately for a "hot" setting (temperature 1.0) and a "cold" setting (temperature 0.0). |
| Inferred without evidence | A case is flagged if at least one field is asserted (not "not stated") but its evidence quote list is empty. This indicates completion without explicit support. |
| Bad quotes | Evidence quotes are validated by checking they appear verbatim in the source text. A bad quote would be a quote that is not actually present in the source. |
Why this design matters: hashes alone can look like a black box to non-technical readers. v3 makes determinism human-auditable by requiring explicit structure, and makes completion visible by requiring evidence.
The objective of this test is not to maximise failure, but to demonstrate that absence is not preserved under generation without a deterministic boundary.
The percentages shown below are not performance scores and should not be interpreted as accuracy or error rates. They represent how often a specific structural behaviour appeared under controlled test conditions.
The purpose of this trial is not to show that models are “wrong”, but to show that structural completion and evidential grounding are independent properties. A system can appear stable while still asserting unsupported structure.
| Metric | Result | Interpretation |
|---|---|---|
| Canon determinism | PASS | Same input produced identical Canon output across all cases and runs. |
| Hot drift rate | 86.67% | In most cases, unconstrained JSON structure changed across runs at temperature 1.0. |
| Cold drift rate | 70.00% | Drift remained common even at temperature 0 in this setup. |
| Hot inferred-without-evidence | 50.00% | In half of cases, at least one field was asserted without verbatim support. |
| Cold inferred-without-evidence | 50.00% | Evidence gaps persisted even when the model was run "cold". |
| Bad-quotes rate | 0.00% | The model did not fabricate verbatim quotes in this trial. The failure mode was unsupported completion. |
Interpretation note: a “70% drift rate” does not mean the model failed 70% of the time. It means that in 70% of test cases, the structured output differed across repeated runs.
“50% inferred-without-evidence” means that in half of cases, at least one field was asserted without any supporting quote. In audit-grade workflows, a single unsupported actor or timeline element is sufficient to compromise traceability.
In this setup, drift remained common at temperature 0. Reproducibility cannot be assumed without a deterministic boundary.
Even when outputs match across runs, a field can still be asserted without evidence. Stability is not the same as admissibility.
Two cases are adversarial (designed to invite completion) and one is a control case (explicitly stated). The point is not to “catch the model out”, but to show what happens when a system is asked to provide structure that is not present.
| Field | Observed behaviour |
|---|---|
| who | Often asserted without explicit naming. |
| when | Often implied via narrative sequencing. |
| outcome | Commonly completed as a plausible consequence. |
Canon keeps missing structure as not stated.
| Field | Observed behaviour |
|---|---|
| what | Can be restated as an action even when the source is descriptive. |
| outcome | Often asserted as a result without verbatim support. |
Canon outputs remain stable and bounded to explicit commitments.
In control cases with explicit structure, unconstrained outputs and Canon outputs typically align more closely. This supports the core claim: the risk is not “models are bad”, it is “models are asked to complete what is absent”.
Treat Canon outputs as regression fixtures and as a gating artefact. If a field is not stated, keep it null. Allow continuation only via an explicit reconstruction layer, not via silent completion.
This creates an evidence/inference boundary. Where reconstruction is required, it becomes a named step with its own ownership, justification, and audit record.
| Risk | Why it matters |
|---|---|
| Unsupported completion | Once a field is asserted without evidence, downstream users treat it as stated unless the system forces an explicit boundary. |
| Drift under repetition | Repeated runs can produce different outputs from the same input, undermining audit repeatability. |
| Late audit trails | Audit often begins after the epistemic boundary has already been crossed. |
This demo is not a general claim about all models, all prompts, or all tasks. It is a concrete demonstration that deterministic gating preserves absence, while unconstrained completion tends to supply structure.
Canon sits upstream. It is a precondition check for interpretation, scoring, analytics, legal judgement, or generation. It does not replace those stages. It makes their starting conditions explicit and auditable.
(In)Canon identifies structure and reports stated vs not stated. It does not assess meaning, correctness, quality, compliance, or adequacy.