Home
Automated Document Ingestion & Validation Workflows
Schema Validation & Error Categorization
Categorizing Validation Errors in Regulatory Document Pipelines

Categorizing Validation Errors in Regulatory Document Pipelines

A production guide to turning raw validation failures into a deterministic taxonomy: run jsonschema and pydantic v2 over ingested clinical documents, classify each error by type and severity, then route every finding to dead-letter, human review, or a safe auto-fix with PHI-free structured logging and metrics.

When a site-activation packet, an IND amendment, or a consent form fails validation, the question is never simply “is it valid?” It is “which of the forty fields broke, how badly, who needs to look at it, and can the pipeline recover on its own?” A flat boolean answer forces a human to re-open every rejected document. A well-categorized answer lets the pipeline auto-fix a stray whitespace, quarantine a malformed payload to a dead-letter queue, and page a regulatory reviewer only for the genuinely ambiguous cases. This guide is the deep build behind Schema Validation & Error Categorization, and it sits within the wider Automated Document Ingestion & Validation Workflows architecture. It focuses narrowly on one thing: a concrete, runnable error-categorization system. It consumes the parse output produced upstream by async batch ingestion and the OCR & Metadata Extraction Pipelines.

Why naive approaches fail

Most validation code fails a regulated pipeline in one of four predictable ways, and each one shows up under inspection rather than in development:

The flat boolean. if not is_valid(doc): reject() throws away everything that makes triage cheap. A document with one over-long comment and a document with a missing required field are treated identically, so a human re-opens both. At multi-site scale that is thousands of needless manual re-reviews per activation wave.
Conflating type with policy. Teams hardcode “a pattern failure is always fatal,” then discover that the same pattern rule guards both protocol_version (genuinely blocking) and a free-text reviewer_comment (cosmetic). The type is a structural fact; whether it blocks is a business decision. Baking them together makes every regulatory-policy change a code change.
Leaking PHI into the error path. pydantic’s errors() carries an input key holding the offending value; a verbose jsonschema message echoes the instance. A consent form’s failing value might be a subject identifier or a date of birth. Logging the error object verbatim writes protected health information into your log store and breaks the ALCOA+ data-integrity chain the moment an inspector reads it.
Unbounded or fabricating auto-fix. “Just coerce it” is fatal in regulated data. Reformatting an ambiguous date, inventing a missing value, or looping a fixer until it stops erroring destroys attributability. Auto-fix must be a single, bounded, non-fabricating pass followed by a re-validation.

The rest of this guide is a design that pre-empts all four: it separates the type of an error from the policy applied to it, routes every finding explicitly, logs only field paths and constraint names, and allows exactly one conservative auto-fix pass.

What “categorization” actually means

A validation error has three independent axes, and conflating them is the most common design mistake:

Type — the machine-readable reason the value failed. From pydantic v2 this is the error type string (missing, extra_forbidden, value_error, string_pattern_mismatch, enum, and so on). From jsonschema this is the validator keyword that failed (required, type, pattern, enum, additionalProperties).
Severity — the operational impact: blocking (the document cannot proceed to submission), warning (it proceeds but is flagged), or info (cosmetic). Severity is a business decision layered on top of type; the same pattern failure may block on protocol_version but only warn on a free-text comment.
Disposition — what the pipeline does next: send to dead-letter, escalate to human review, or apply a bounded auto-fix and re-validate.

Because the type is structural and the severity/disposition are policy, we keep them in separate layers. The validators emit types; a policy table maps (field, type) to severity and disposition. That separation is what makes the system maintainable as regulatory requirements change.

Architecture overview

The pipeline runs both validators, normalizes every error into one shape, looks up policy, attempts a single bounded auto-fix, re-validates, and routes on the worst disposition present — emitting a PHI-free audit record at the end.

Setup and configuration

Install the two validators plus a JSON log formatter. All three are maintained; none of the deprecated PDF libraries are involved at this stage because categorization operates on already-parsed payloads.

pip install "pydantic>=2.6" "jsonschema>=4.21" "python-json-logger>=2.0"

Every tunable is read from the environment — no thresholds, queue URLs, or namespaces are hardcoded, and a missing required value fails fast at start-up rather than mid-batch. The dead-letter destination is a credential-bearing URL, so it is never baked into an image.

"""Runtime configuration read entirely from the environment."""
from __future__ import annotations

import logging
import os
from dataclasses import dataclass

from pythonjsonlogger import jsonlogger


@dataclass(frozen=True)
class ValidatorConfig:
    """Immutable, env-sourced configuration. No secrets in source."""

    log_level: str
    dead_letter_url: str
    metrics_namespace: str

    @classmethod
    def from_env(cls) -> "ValidatorConfig":
        try:
            dead_letter_url = os.environ["DEAD_LETTER_QUEUE_URL"]
        except KeyError as exc:  # fail fast — never start half-configured
            raise RuntimeError("DEAD_LETTER_QUEUE_URL must be set") from exc
        return cls(
            log_level=os.environ.get("VALIDATION_LOG_LEVEL", "INFO"),
            dead_letter_url=dead_letter_url,
            metrics_namespace=os.environ.get(
                "METRICS_NAMESPACE", "regulatory.validation"
            ),
        )


def init_logging(config: ValidatorConfig) -> logging.Logger:
    """JSON logs so field paths and type codes are queryable but values are not.

    The formatter never receives the raw payload; only the structured `extra`
    fields assembled in `_log` are serialized.
    """
    handler = logging.StreamHandler()
    handler.setFormatter(
        jsonlogger.JsonFormatter("%(asctime)s %(levelname)s %(name)s %(message)s")
    )
    logger = logging.getLogger("regulatory.validation")
    logger.setLevel(config.log_level)
    logger.addHandler(handler)
    return logger

The schemas: jsonschema and pydantic v2 side by side

We validate twice on purpose. JSON Schema is the contract the sponsor and the EDC agree on — it travels with the data and is easy to version. Pydantic v2 is the in-process model that gives us typed objects, cross-field validators, and rich error metadata. Running both catches different classes of defect: jsonschema reports additionalProperties violations and exact JSON paths cleanly, while pydantic’s @model_validator expresses business rules such as “consent date cannot precede IRB approval” that are awkward in pure JSON Schema.

"""Schemas for a clinical site-activation document."""
from __future__ import annotations

from datetime import date

from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator

# JSON Schema (Draft 2020-12) — the portable contract shared with the sponsor/EDC.
DOCUMENT_JSON_SCHEMA: dict[str, object] = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "additionalProperties": False,
    "required": [
        "document_id",
        "protocol_version",
        "site_id",
        "irb_approval_date",
        "site_classification",
        "attachments",
    ],
    "properties": {
        "document_id": {"type": "string", "minLength": 8, "maxLength": 32},
        "protocol_version": {"type": "string", "pattern": r"^v\d+\.\d+\.\d+$"},
        "site_id": {"type": "string", "pattern": r"^SITE-\d{4}$"},
        "irb_approval_date": {"type": "string", "format": "date"},
        "consent_date": {"type": "string", "format": "date"},
        "site_classification": {
            "type": "string",
            "enum": ["Phase I", "Phase II", "Phase III", "Phase IV"],
        },
        "attachments": {
            "type": "array",
            "minItems": 1,
            "items": {"type": "string"},
        },
        "reviewer_comment": {"type": "string", "maxLength": 500},
    },
}


class RegulatoryDocument(BaseModel):
    """In-process model. `extra='forbid'` makes unexpected keys raise
    `extra_forbidden`, mirroring the JSON Schema `additionalProperties: false`."""

    model_config = ConfigDict(extra="forbid", str_strip_whitespace=False)

    document_id: str = Field(min_length=8, max_length=32)
    protocol_version: str = Field(pattern=r"^v\d+\.\d+\.\d+$")
    site_id: str = Field(pattern=r"^SITE-\d{4}$")
    irb_approval_date: date
    consent_date: date | None = None
    site_classification: str
    attachments: list[str] = Field(min_length=1)
    reviewer_comment: str | None = Field(default=None, max_length=500)

    @field_validator("site_classification")
    @classmethod
    def _known_phase(cls, value: str) -> str:
        allowed = {"Phase I", "Phase II", "Phase III", "Phase IV"}
        if value not in allowed:
            # Surfaces as a pydantic error with type 'value_error'.
            raise ValueError(f"unknown site_classification: {value!r}")
        return value

    @model_validator(mode="after")
    def _consent_after_irb(self) -> "RegulatoryDocument":
        if self.consent_date is not None and self.consent_date < self.irb_approval_date:
            raise ValueError("consent_date precedes irb_approval_date")
        return self

The error type strings used below (missing, extra_forbidden, value_error, string_pattern_mismatch, enum, too_short) are the actual stable identifiers pydantic v2 emits in ValidationError.errors()[i]["type"]. The jsonschema attributes (error.validator, error.json_path, error.absolute_path) are the real attributes on jsonschema.exceptions.ValidationError.

Normalizing both validators into one error record

The core idea: collapse every jsonschema and pydantic error into a single ValidationFinding dataclass keyed by (field_path, type_code). Downstream policy only ever sees that uniform shape.

"""Normalize jsonschema + pydantic errors into a uniform finding."""
from __future__ import annotations

import enum
from dataclasses import dataclass

from jsonschema import Draft202012Validator
from jsonschema.exceptions import ValidationError as JsonSchemaError
from pydantic import ValidationError as PydanticError


class Severity(enum.IntEnum):
    """Ordered so max() yields the most serious severity present."""

    INFO = 10
    WARNING = 20
    BLOCKING = 30


class Disposition(str, enum.Enum):
    PROCEED = "proceed"
    AUTO_FIX = "auto_fix"
    HUMAN_REVIEW = "human_review"
    DEAD_LETTER = "dead_letter"


@dataclass(frozen=True)
class ValidationFinding:
    """One normalized validation problem, validator-agnostic."""

    source: str          # "jsonschema" or "pydantic"
    field_path: str      # dotted path, e.g. "attachments.0" or "protocol_version"
    type_code: str       # "missing", "pattern", "extra_forbidden", ...
    message: str         # short, PHI-free description
    severity: Severity = Severity.BLOCKING
    disposition: Disposition = Disposition.HUMAN_REVIEW


def _json_path_to_dotted(error: JsonSchemaError) -> str:
    """Build a dotted field path from a jsonschema error.

    `absolute_path` is a deque of property names / array indices; an empty
    path means the error is on the document root (e.g. a `required` failure).
    """
    parts = [str(p) for p in error.absolute_path]
    if not parts and error.validator == "required":
        # The missing property name lives in the validator_value/message.
        return "<root>"
    return ".".join(parts) if parts else "<root>"


def collect_jsonschema_findings(payload: dict[str, object]) -> list[ValidationFinding]:
    """Run Draft 2020-12 validation and emit one finding per error."""
    validator = Draft202012Validator(DOCUMENT_JSON_SCHEMA)
    findings: list[ValidationFinding] = []
    # iter_errors yields *all* violations, not just the first.
    for err in sorted(validator.iter_errors(payload), key=lambda e: list(e.absolute_path)):
        findings.append(
            ValidationFinding(
                source="jsonschema",
                field_path=_json_path_to_dotted(err),
                # `validator` is the failing keyword: required/type/pattern/enum/...
                type_code=str(err.validator),
                message=_safe_jsonschema_message(err),
            )
        )
    return findings


def _safe_jsonschema_message(err: JsonSchemaError) -> str:
    """Return a message that names the constraint, never the instance value
    (which could be PHI)."""
    if err.validator == "required":
        return f"missing required property: {err.message.split()[0]}"
    if err.validator == "additionalProperties":
        return "unexpected property present"
    return f"failed constraint '{err.validator}'"


# Map pydantic v2 type prefixes to our normalized type codes.
_PYDANTIC_TYPE_MAP: dict[str, str] = {
    "missing": "missing",
    "extra_forbidden": "extra_forbidden",
    "string_pattern_mismatch": "pattern",
    "string_too_short": "min_length",
    "too_short": "min_length",
    "enum": "enum",
    "value_error": "value_error",
    "date_from_datetime_parsing": "type",
    "date_parsing": "type",
}


def collect_pydantic_findings(payload: dict[str, object]) -> list[ValidationFinding]:
    """Validate with pydantic v2 and normalize each error."""
    try:
        RegulatoryDocument.model_validate(payload)
    except PydanticError as exc:
        return [_finding_from_pydantic(e) for e in exc.errors()]
    return []


def _finding_from_pydantic(err: dict[str, object]) -> ValidationFinding:
    raw_type = str(err["type"])
    loc = err.get("loc", ())
    field_path = ".".join(str(p) for p in loc) if loc else "<root>"
    return ValidationFinding(
        source="pydantic",
        field_path=field_path,
        type_code=_PYDANTIC_TYPE_MAP.get(raw_type, raw_type),
        # err["msg"] from pydantic v2 does not echo the input value by default,
        # but we still avoid logging err["input"], which may contain PHI.
        message=str(err.get("msg", raw_type)),
    )

Note the deliberate PHI hygiene: pydantic’s errors() includes an input key holding the offending value, and a verbose jsonschema message echoes the instance. A consent form’s offending value could be a subject identifier or date of birth, so we never copy err["input"] or err.instance into a finding. We log the field path and the constraint name only.

The policy table: type and field to severity and disposition

This is the only place business rules live. It is intentionally data, not code, so a regulatory analyst can review it.

"""Severity + disposition policy. Keyed by (field_path, type_code) with
sensible per-type fallbacks. This is the single source of truth for routing."""
from __future__ import annotations

# (field_path, type_code) -> (Severity, Disposition)
_FIELD_POLICY: dict[tuple[str, str], tuple[Severity, Disposition]] = {
    ("protocol_version", "pattern"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    ("site_id", "pattern"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    ("irb_approval_date", "type"): (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    ("attachments", "min_length"): (Severity.BLOCKING, Disposition.DEAD_LETTER),
    ("attachments", "minItems"): (Severity.BLOCKING, Disposition.DEAD_LETTER),
    # A stray reviewer comment over the limit is cosmetic and auto-trimmable.
    ("reviewer_comment", "max_length"): (Severity.WARNING, Disposition.AUTO_FIX),
    ("reviewer_comment", "maxLength"): (Severity.WARNING, Disposition.AUTO_FIX),
    ("<root>", "additionalProperties"): (Severity.WARNING, Disposition.AUTO_FIX),
    ("<root>", "extra_forbidden"): (Severity.WARNING, Disposition.AUTO_FIX),
}

# Fallback by type_code when no specific (field, type) rule matches.
_TYPE_DEFAULT: dict[str, tuple[Severity, Disposition]] = {
    "missing": (Severity.BLOCKING, Disposition.DEAD_LETTER),
    "required": (Severity.BLOCKING, Disposition.DEAD_LETTER),
    "type": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    "pattern": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    "enum": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    "value_error": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
    "extra_forbidden": (Severity.WARNING, Disposition.AUTO_FIX),
    "additionalProperties": (Severity.WARNING, Disposition.AUTO_FIX),
    "min_length": (Severity.BLOCKING, Disposition.HUMAN_REVIEW),
}

_GLOBAL_DEFAULT: tuple[Severity, Disposition] = (
    Severity.BLOCKING,
    Disposition.HUMAN_REVIEW,
)


def classify(finding: ValidationFinding) -> ValidationFinding:
    """Return a copy of the finding enriched with severity + disposition."""
    severity, disposition = _FIELD_POLICY.get(
        (finding.field_path, finding.type_code),
        _TYPE_DEFAULT.get(finding.type_code, _GLOBAL_DEFAULT),
    )
    # Dataclass is frozen; produce a new instance.
    return ValidationFinding(
        source=finding.source,
        field_path=finding.field_path,
        type_code=finding.type_code,
        message=finding.message,
        severity=severity,
        disposition=disposition,
    )

The rationale behind a few choices:

A missing required field is structurally broken and almost always indicates an upstream extraction or mapping fault, so it dead-letters rather than wasting a reviewer’s time.
A pattern failure on protocol_version is usually a real but human-resolvable transcription issue, so it goes to review.
Extra properties (extra_forbidden / additionalProperties) are warnings: the safest auto-fix in regulated data is to drop unknown keys (never to invent values), which is non-mutating with respect to required content.

Auto-fix: bounded, non-mutating, and re-validated

Auto-fix in a regulated pipeline must be conservative. We only ever remove unknown keys or truncate an over-long free-text comment — operations that cannot fabricate or alter regulatory content. Any fix is logged, and the document is re-validated; if it still fails, it escalates.

"""Bounded auto-fix. Only safe, non-fabricating transformations are allowed."""
from __future__ import annotations

import copy


def apply_auto_fix(
    payload: dict[str, object],
    findings: list[ValidationFinding],
) -> tuple[dict[str, object], list[str]]:
    """Return a fixed copy of the payload plus a list of applied fix notes.

    Never mutates the input. Only handles findings whose disposition is
    AUTO_FIX; everything else is left untouched for the caller to route.
    """
    fixed = copy.deepcopy(payload)
    notes: list[str] = []
    allowed_keys = set(DOCUMENT_JSON_SCHEMA["properties"])  # type: ignore[arg-type]

    for finding in findings:
        if finding.disposition is not Disposition.AUTO_FIX:
            continue
        if finding.type_code in {"extra_forbidden", "additionalProperties"}:
            removed = [k for k in list(fixed) if k not in allowed_keys]
            for key in removed:
                del fixed[key]
            if removed:
                notes.append(f"dropped unknown keys: {sorted(removed)}")
        elif finding.type_code in {"max_length", "maxLength"}:
            comment = fixed.get("reviewer_comment")
            if isinstance(comment, str) and len(comment) > 500:
                fixed["reviewer_comment"] = comment[:500]
                notes.append("truncated reviewer_comment to 500 chars")
    return fixed, notes

Tying it together: orchestration, logging, and metrics

The orchestrator runs both validators, classifies, attempts one auto-fix pass, re-validates, and emits a PHI-free structured log line plus counters. The disposition is the worst one present, because a document with one dead-letter finding cannot proceed even if its other findings are auto-fixable.

"""End-to-end orchestration with structured logging and metrics."""
from __future__ import annotations

import logging
from collections import Counter

# `python-json-logger` or stdlib `logging` with a JSON formatter both work;
# here we emit structured `extra` fields and rely on the handler to serialize.
logger = logging.getLogger("regulatory.validation")

# Module-level counters; in production back these with Prometheus/StatsD.
METRICS: Counter[str] = Counter()


def _collect_all(payload: dict[str, object]) -> list[ValidationFinding]:
    raw = collect_jsonschema_findings(payload) + collect_pydantic_findings(payload)
    return [classify(f) for f in raw]


def _worst_disposition(findings: list[ValidationFinding]) -> Disposition:
    """Resolve the binding disposition for the whole document."""
    if any(f.disposition is Disposition.DEAD_LETTER for f in findings):
        return Disposition.DEAD_LETTER
    if any(f.disposition is Disposition.HUMAN_REVIEW for f in findings):
        return Disposition.HUMAN_REVIEW
    if any(f.disposition is Disposition.AUTO_FIX for f in findings):
        return Disposition.AUTO_FIX
    return Disposition.PROCEED


def validate_document(
    payload: dict[str, object],
    document_id: str,
) -> dict[str, object]:
    """Validate, categorize, and route a single document.

    `document_id` is a non-PHI surrogate key safe to log. The raw `payload`
    and individual field values are never logged.
    """
    findings = _collect_all(payload)
    disposition = _worst_disposition(findings)

    if disposition is Disposition.AUTO_FIX:
        fixed, notes = apply_auto_fix(payload, findings)
        residual = _collect_all(fixed)
        if not residual:
            payload, findings, disposition = fixed, [], Disposition.PROCEED
            _log("auto_fix_succeeded", document_id, findings, extra={"notes": notes})
        else:
            # Auto-fix did not fully clean it; escalate the residual.
            findings = residual
            disposition = _worst_disposition(residual)

    severity = max((f.severity for f in findings), default=Severity.INFO)

    METRICS[f"disposition.{disposition.value}"] += 1
    for finding in findings:
        METRICS[f"error_type.{finding.type_code}"] += 1

    _log("validation_complete", document_id, findings,
         extra={"disposition": disposition.value, "severity": severity.name})

    return {
        "document_id": document_id,
        "valid": disposition is Disposition.PROCEED,
        "disposition": disposition.value,
        "severity": severity.name,
        "findings": [
            {
                "source": f.source,
                "field": f.field_path,
                "type": f.type_code,
                "severity": f.severity.name,
                "disposition": f.disposition.value,
            }
            for f in findings
        ],
    }


def _log(
    event: str,
    document_id: str,
    findings: list[ValidationFinding],
    extra: dict[str, object] | None = None,
) -> None:
    """Emit a PHI-free structured log line.

    We log field paths and type codes only — never values, never `err.input`.
    """
    payload: dict[str, object] = {
        "event": event,
        "document_id": document_id,
        "finding_count": len(findings),
        "error_types": sorted({f.type_code for f in findings}),
    }
    if extra:
        payload.update(extra)
    logger.info(event, extra={"validation": payload})

Validation and edge-case handling

A worked example shows the routing in action. Given a payload missing attachments, carrying an unknown legacy_notes key, and a protocol_version of "1.0" (no leading v):

jsonschema emits required (for attachments), additionalProperties (for legacy_notes), and pattern (for protocol_version).
pydantic emits missing, extra_forbidden, and string_pattern_mismatch for the same problems.
Classification yields one DEAD_LETTER (missing attachments), one AUTO_FIX (drop legacy_notes), and one HUMAN_REVIEW (protocol pattern).
_worst_disposition resolves to dead-letter: the document cannot proceed regardless of the fixable noise, and a reviewer is not paged for a document that is structurally incomplete.

The edge cases that most often bite this design, and how the code above already handles them:

Root-level required failures carry an empty absolute_path. _json_path_to_dotted maps them to <root> so they never crash the dotted-path join, and the policy table has explicit <root> rows for the extra-property case.
The same defect appears twice (once per validator). That is deliberate — the counters double-count on purpose so you can see which layer caught what — but _worst_disposition and the max() severity de-duplicate the routing decision, so a document is never dead-lettered and proceeded.
Auto-fix that only partially cleans. If the residual re-validation still returns findings, the document escalates on the residual’s worst disposition instead of silently proceeding. Auto-fix runs exactly once — there is no loop to oscillate.
Unknown pydantic type codes. _PYDANTIC_TYPE_MAP.get(raw_type, raw_type) falls through to the raw code, and classify falls through to _GLOBAL_DEFAULT (blocking → human review), so a validator upgrade that introduces a new type string fails safe rather than silently proceeding.

Testing and verification

The taxonomy is only trustworthy if each error class provably routes where policy says it should. These pytest checks pin the four canonical dispositions and assert the PHI-hygiene guarantee against a known fixture.

"""pytest suite: confirm each error class routes to the expected disposition."""
from __future__ import annotations

import json

VALID_DOC: dict[str, object] = {
    "document_id": "SITE-ACT-0001",
    "protocol_version": "v1.0.0",
    "site_id": "SITE-0042",
    "irb_approval_date": "2026-01-10",
    "consent_date": "2026-02-01",
    "site_classification": "Phase II",
    "attachments": ["protocol.pdf"],
}


def test_valid_document_proceeds() -> None:
    result = validate_document(VALID_DOC, document_id="DOC-OK")
    assert result["valid"] is True
    assert result["disposition"] == "proceed"


def test_missing_required_field_dead_letters() -> None:
    doc = {k: v for k, v in VALID_DOC.items() if k != "attachments"}
    result = validate_document(doc, document_id="DOC-MISS")
    assert result["valid"] is False
    assert result["disposition"] == "dead_letter"


def test_unknown_key_auto_fixes_then_proceeds() -> None:
    doc = dict(VALID_DOC, legacy_notes="carried over from old EDC")
    result = validate_document(doc, document_id="DOC-EXTRA")
    # The unknown key is dropped and the residual re-validation is clean.
    assert result["valid"] is True
    assert result["disposition"] == "proceed"


def test_protocol_pattern_goes_to_human_review() -> None:
    doc = dict(VALID_DOC, protocol_version="1.0")  # missing leading 'v'
    result = validate_document(doc, document_id="DOC-PAT")
    assert result["disposition"] == "human_review"
    assert any(f["type"] == "pattern" for f in result["findings"])


def test_worst_disposition_wins_over_fixable_noise() -> None:
    doc = dict(VALID_DOC, legacy_notes="junk")
    doc.pop("attachments")  # dead-letter class present alongside auto-fixable
    result = validate_document(doc, document_id="DOC-MIX")
    assert result["disposition"] == "dead_letter"


def test_findings_never_echo_field_values() -> None:
    # An unexpected key holds a PHI-shaped value; it must not surface anywhere.
    doc = dict(VALID_DOC, ssn="123-45-6789")
    result = validate_document(doc, document_id="DOC-PHI")
    assert "123-45-6789" not in json.dumps(result)

Run them with pytest -q. The suite doubles as living documentation of the policy table: change a row in _FIELD_POLICY and the corresponding assertion tells you exactly which disposition moved.

Operational concerns

Metrics that matter. Track disposition.dead_letter rate, disposition.human_review rate, and per-error_type counts. A spike in error_type.pattern on protocol_version usually signals an upstream template change, not a data-entry problem — categorization makes that visible.
Idempotency. Run auto-fix exactly once per document, then re-validate. Looping auto-fix invites oscillation and obscures provenance.
Audit alignment. Every disposition is an event worth recording in your append-only audit log for 21 CFR Part 11 with the surrogate document_id, the timestamp, and the resolved disposition — but, per ALCOA+ and PHI-minimization, never the field values themselves.

FAQ

Why validate with both jsonschema and pydantic instead of one?

JSON Schema is the portable, versionable contract you share with sponsors and EDC vendors; pydantic v2 gives you typed objects and expressive cross-field validators (@model_validator) for rules like “consent date must not precede IRB approval.” Running both surfaces a wider class of defects and lets the JSON Schema travel with the data while the model stays in your process.

How do I avoid leaking PHI into validation logs?

Never log the offending value. Pydantic’s errors() includes an input key and jsonschema errors expose .instance; both can contain subject identifiers or dates. Log only the field_path, the normalized type_code, and a constraint-only message — exactly what the _log helper and _safe_jsonschema_message above enforce.

When is auto-fix safe in a regulated pipeline?

Only for non-fabricating, non-mutating transformations: dropping unknown keys or truncating an over-length free-text field. Auto-fix must never invent a missing value, reformat a date ambiguously, or alter regulatory content. Always re-validate after a fix and escalate anything still failing.

What’s the difference between dead-letter and human-review routing?

Dead-letter is for structurally broken or unrecoverable payloads (missing required fields, decode failures) that indicate an upstream extraction or mapping fault — a human fixes the source, not the document. Human-review is for valid-but-ambiguous findings (a malformed protocol_version) where a reviewer can make a judgment call on the document itself.

Schema Validation & Error Categorization — the parent guide that frames the validation stage this taxonomy implements.
Handling async batch processing for multi-site document ingestion — the upstream worker that produces the parsed payloads validated here.
OCR & Metadata Extraction Pipelines — where scanned, text-less packets are recognized before they reach this validator.
Automating checklist synchronization between EDC and CTMS — a sibling automation that consumes the same validated fields.
Automated Document Ingestion & Validation Workflows — how validation fits the wider ingest-to-archive architecture.

Up one level: this is a deep how-to under Schema Validation & Error Categorization.

Categorizing Validation Errors in Regulatory Document Pipelines

Why naive approaches fail #

What “categorization” actually means #

Architecture overview #

Setup and configuration #

The schemas: jsonschema and pydantic v2 side by side #

Normalizing both validators into one error record #

The policy table: type and field to severity and disposition #

Auto-fix: bounded, non-mutating, and re-validated #

Tying it together: orchestration, logging, and metrics #

Validation and edge-case handling #

Testing and verification #

Operational concerns #

FAQ #

Why validate with both jsonschema and pydantic instead of one? #

How do I avoid leaking PHI into validation logs? #

When is auto-fix safe in a regulated pipeline? #

What’s the difference between dead-letter and human-review routing? #

Related #