Schema Validation & Error Categorization

Schema validation turns messy clinical site-activation documents into trustworthy, submission-ready data, and error categorization decides what happens when a check fails. This guide maps how Python pipelines validate ingested payloads with jsonschema and Pydantic v2, then triage every failure by severity so regulatory blockers never hide behind cosmetic warnings.

When site packets, FDA Form 1572s, investigator CVs, financial disclosures, and delegation logs flow into a submission pipeline, the difference between a clean filing and a 30-day deficiency letter is often a single unvalidated field. Ad-hoc if-checks scattered across scripts cannot scale to dozens of sites and hundreds of documents, and they leave no audit trail. A disciplined validation layer replaces manual checklist review with version-controlled schemas, deterministic error classification, and severity-based routing. This is the gate inside Automated Document Ingestion & Validation Workflows that every normalized document must pass before it reaches a submission queue.

Where Validation Sits in the Pipeline

Validation is not the first step. Raw files are parsed and normalized into structured payloads upstream — see PDF/DOCX Parsing for Clinical Docs and OCR & Metadata Extraction Pipelines for how heterogeneous documents become machine-readable dictionaries while preserving original file hashes for 21 CFR Part 11 chain-of-custody.

By the time a payload reaches the validation gate, it is a plain Python dict. Validation answers two questions in order:

Is the structure correct? Required keys present, types correct, formats and ranges within bounds.
Are the regulatory and business rules satisfied? Cross-field dependencies, conditional requirements, and protocol-specific constraints.

Only after both pass does the document continue downstream. Failures are not simply rejected — they are categorized and routed, the subject of the deep-dive on categorizing validation errors in regulatory document pipelines.

Two Complementary Validation Engines

Most production clinical pipelines use both jsonschema and Pydantic, each where it is strongest.

Engine	Best for	Strengths	Trade-offs
jsonschema (Draft 2020-12)	Declarative, language-neutral contracts shared with sponsors/CROs	Schema is data, easy to version and exchange, `iter_errors` enumerates every failure	No native cross-field business logic, weaker Python typing
Pydantic v2	In-process typed models with rich validators	Static typing, `field_validator`/`model_validator`, fast `pydantic-core` engine, structured `.errors()`	Schema lives in code, less portable across stacks

A common pattern: a JSON Schema artifact is the externally shared contract (mirroring the work in FDA/EMA Submission Schema Design), while a Pydantic model enforces the same shape plus Python-side regulatory logic at runtime.

Deprecated API warning. Pin pydantic>=2 and use the v2 surface throughout. The Pydantic v1 decorators @validator and @root_validator, and .dict()/.parse_obj(), are legacy — they still import under a compatibility shim but change error-type strings and coercion behavior, which silently breaks the deterministic classification below. Likewise use the explicit Draft202012Validator rather than the version-guessing jsonschema.validate() helper, which stops at the first error.

Validating with jsonschema (Draft 2020-12)

JSON Schema is the right tool when the contract must be portable and inspectable as plain data. Use the Draft202012Validator explicitly rather than the version-guessing validate() helper, and use iter_errors to collect all failures in one pass instead of stopping at the first.

"""Structural validation of an ingested document payload against JSON Schema."""
from __future__ import annotations

from typing import Any

from jsonschema import Draft202012Validator
from jsonschema.exceptions import ValidationError as JsonSchemaError

# Draft 2020-12 contract. In production this is loaded from a versioned file,
# not inlined, so it can be shared with sponsors and tracked in change control.
DOCUMENT_SCHEMA: dict[str, Any] = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "additionalProperties": False,
    "required": ["document_id", "document_type", "country", "raw_sha256"],
    "properties": {
        "document_id": {"type": "string", "pattern": r"^[A-Z0-9\-]{10,50}$"},
        "document_type": {
            "type": "string",
            "enum": ["FDA_1572", "INVESTIGATOR_CV", "FIN_DISCLOSURE", "DELEGATION_LOG"],
        },
        "country": {"type": "string", "pattern": r"^[A-Z]{2}$"},
        "signature_date": {"type": "string", "format": "date"},
        "license_expiry": {"type": "string", "format": "date"},
        # 64 lowercase hex chars = a SHA-256 digest of the original file.
        "raw_sha256": {"type": "string", "pattern": r"^[a-f0-9]{64}$"},
    },
}

# Build the validator once and reuse it; construction compiles the schema.
_VALIDATOR = Draft202012Validator(DOCUMENT_SCHEMA)


def structural_errors(payload: dict[str, Any]) -> list[JsonSchemaError]:
    """Return every structural violation, sorted by location for stable output.

    An empty list means the payload satisfies the structural contract.
    """
    return sorted(_VALIDATOR.iter_errors(payload), key=lambda e: list(e.absolute_path))

Each JsonSchemaError exposes machine-usable attributes that make categorization deterministic: error.validator (the keyword that failed, e.g. "required", "enum", "pattern"), error.json_path (e.g. "$.country"), error.absolute_path, and error.message. We use error.validator as the primary signal for severity in the next section.

Validating with Pydantic v2

Where business rules and cross-field dependencies live, Pydantic v2 is the cleaner choice. Note the v2-specific surface: ConfigDict for model config, @field_validator with mode="before" for pre-coercion cleanup, and @model_validator(mode="after") for whole-object regulatory rules. The .errors() method on a ValidationError returns one dict per failure, each with a stable type, a loc tuple, a msg, and the offending input.

"""Typed model that enforces structure plus regulatory cross-field rules."""
from __future__ import annotations

from datetime import date
from enum import Enum
from typing import Any

from pydantic import (
    BaseModel,
    ConfigDict,
    Field,
    ValidationError,
    field_validator,
    model_validator,
)


class DocumentType(str, Enum):
    FDA_1572 = "FDA_1572"
    INVESTIGATOR_CV = "INVESTIGATOR_CV"
    FIN_DISCLOSURE = "FIN_DISCLOSURE"
    DELEGATION_LOG = "DELEGATION_LOG"


class RegulatoryDocument(BaseModel):
    # extra="forbid" rejects unexpected keys; this is critical for catching
    # silent upstream parser drift that would otherwise pass unnoticed.
    model_config = ConfigDict(extra="forbid", str_strip_whitespace=True)

    document_id: str = Field(min_length=10, max_length=50, pattern=r"^[A-Z0-9\-]+$")
    document_type: DocumentType
    country: str = Field(pattern=r"^[A-Z]{2}$")
    raw_sha256: str = Field(pattern=r"^[a-f0-9]{64}$")
    signature_date: date | None = None
    license_expiry: date | None = None

    @field_validator("signature_date", "license_expiry", mode="before")
    @classmethod
    def reject_future_or_empty(cls, value: Any) -> Any:
        """Normalize empty strings to None before date coercion."""
        if value in ("", None):
            return None
        return value

    @model_validator(mode="after")
    def enforce_regulatory_rules(self) -> "RegulatoryDocument":
        """Cross-field rules that JSON Schema cannot express cleanly."""
        if self.document_type is DocumentType.FDA_1572:
            # 21 CFR 312.53 requires a signed, dated 1572 for US investigators.
            if self.country == "US" and self.signature_date is None:
                raise ValueError("US FDA 1572 requires signature_date per 21 CFR 312.53")
        if self.signature_date and self.signature_date > date.today():
            raise ValueError("signature_date cannot be in the future")
        if self.license_expiry and self.license_expiry < date.today():
            raise ValueError("medical license has expired")
        return self

This single model rejects unknown fields, coerces ISO dates, and encodes the conditional FDA 1572 rule that purely structural validation cannot. The same conditional-requirement pattern shows up when reconciling fields across systems in Checklist Sync & Gap Analysis.

A Deterministic Error Taxonomy

Categorization is the control plane that stops validation noise from masking critical failures. A DPI warning and a missing principal-investigator signature are both “errors,” but conflating them is how submissions slip. We map every failure to one of four severity tiers, each with a fixed routing action.

Severity tier	Regulatory impact	Routing action	Example
`CRITICAL_BLOCKER`	Submission rejection or non-compliance	Halt pipeline, quarantine document, notify regulatory affairs	Missing PI signature on FDA 1572, expired medical license
`STRUCTURAL_WARNING`	Recoverable shape/format mismatch	Auto-retry with fallback normalization, flag for ops review	Date as `MM/DD/YYYY`, malformed but present optional field
`SEMANTIC_GAP`	Cross-field or business-rule inconsistency	Route to gap analysis for protocol alignment	Site budget mismatch vs. IRB-approved version
`OPERATIONAL_INFO`	Non-blocking audit note	Log to compliance ledger, continue	Scan at 150 DPI, non-standard filename

The mapping must be deterministic: the same failure always yields the same tier. Below, the failed keyword (jsonschema) or error type (Pydantic) drives the classification — no human judgment in the hot path.

"""Map raw validation failures to a severity tier for routing."""
from __future__ import annotations

from enum import Enum


class Severity(str, Enum):
    CRITICAL_BLOCKER = "CRITICAL_BLOCKER"
    STRUCTURAL_WARNING = "STRUCTURAL_WARNING"
    SEMANTIC_GAP = "SEMANTIC_GAP"
    OPERATIONAL_INFO = "OPERATIONAL_INFO"


# Missing mandatory data is a blocker; bad shape/format is recoverable;
# our custom regulatory rules raise value_error and signal a semantic gap.
_PYDANTIC_TIER: dict[str, Severity] = {
    "missing": Severity.CRITICAL_BLOCKER,
    "extra_forbidden": Severity.STRUCTURAL_WARNING,
    "value_error": Severity.SEMANTIC_GAP,
}


def classify_pydantic(error_type: str) -> Severity:
    """Classify a Pydantic v2 error by its stable ``type`` string."""
    if error_type in _PYDANTIC_TIER:
        return _PYDANTIC_TIER[error_type]
    # string_type, enum, string_pattern_mismatch, int_parsing, etc.
    return Severity.STRUCTURAL_WARNING


# jsonschema severity keys off the failing keyword (error.validator).
_JSONSCHEMA_TIER: dict[str, Severity] = {
    "required": Severity.CRITICAL_BLOCKER,
    "enum": Severity.STRUCTURAL_WARNING,
    "pattern": Severity.STRUCTURAL_WARNING,
    "type": Severity.STRUCTURAL_WARNING,
    "format": Severity.STRUCTURAL_WARNING,
    "additionalProperties": Severity.STRUCTURAL_WARNING,
}


def classify_jsonschema(validator_keyword: str) -> Severity:
    """Classify a jsonschema failure by its failing keyword."""
    return _JSONSCHEMA_TIER.get(validator_keyword, Severity.STRUCTURAL_WARNING)

Validation to Categorization Flow

Putting It Together

A single entry point runs both engines, classifies every failure, and returns a deterministic, audit-friendly result. Each error record carries the field path, the failing rule, the severity tier, and a regulatory reference so downstream routing and audit review need no re-derivation.

"""End-to-end validation gate: structure, rules, and categorized errors."""
from __future__ import annotations

import logging
from typing import Any, TypedDict

from pydantic import ValidationError

logger = logging.getLogger("clinical.validation")


class ErrorRecord(TypedDict):
    severity: str
    field_path: str
    rule: str
    message: str
    regulatory_ref: str


def validate_document(payload: dict[str, Any]) -> dict[str, Any]:
    """Validate one ingested payload and return a categorized result.

    Returns a dict with ``status`` of ``PASS`` or ``FAIL``. On failure the
    ``errors`` list is sorted with critical blockers first for triage.
    """
    records: list[ErrorRecord] = []

    # 1) Structural contract first.
    for err in structural_errors(payload):
        records.append(
            ErrorRecord(
                severity=classify_jsonschema(err.validator).value,
                field_path=err.json_path,
                rule=str(err.validator),
                message=err.message,
                regulatory_ref="ICH-GCP E6(R3) / FDA eCTD",
            )
        )

    # 2) Typed model and regulatory cross-field rules.
    if not records:
        try:
            doc = RegulatoryDocument.model_validate(payload)
            return {"status": "PASS", "data": doc.model_dump(mode="json")}
        except ValidationError as exc:
            for err in exc.errors():
                # err["loc"] is a tuple; join into a dotted path.
                field_path = ".".join(str(part) for part in err["loc"]) or "<model>"
                records.append(
                    ErrorRecord(
                        severity=classify_pydantic(err["type"]).value,
                        field_path=field_path,
                        rule=err["type"],
                        message=err["msg"],
                        regulatory_ref="21 CFR 312.53",
                    )
                )

    # Critical blockers first so triage and on-call see them immediately.
    tier_order = {"CRITICAL_BLOCKER": 0, "SEMANTIC_GAP": 1,
                  "STRUCTURAL_WARNING": 2, "OPERATIONAL_INFO": 3}
    records.sort(key=lambda r: tier_order[r["severity"]])

    logger.warning(
        "validation_failed",
        extra={"document_id": payload.get("document_id"), "error_count": len(records)},
    )
    return {"status": "FAIL", "errors": records}

Because every record is structured and the classification is pure (no I/O, no clock reads inside the mapping), the gate is safe to run inside concurrent workers — the foundation for fan-out scaling covered in Async Batch Processing for Site Packets.

Audit Trail and ALCOA+ Considerations

Validation outcomes are regulatory records and must satisfy ALCOA+ data-integrity principles. Each event should be persisted to append-only storage with:

Original file SHA-256 and the normalized-payload hash
Schema version identifier and the regulatory guideline release in effect
Every categorized error with field-level traceability
Execution context — library versions, container ID, UTC timestamp
Operator or system attribution for 21 CFR Part 11 attributable records

A useful integrity check on the audit record itself is a content hash. If $H$ is SHA-256 and $r$ is the canonical JSON serialization of the validation record, store $h = H(r)$ ; any later tampering changes $r$ and therefore $h$ , making alteration detectable. Logs must never contain PHI or PII beyond the minimum needed for identification — apply field-level masking before persistence.

FAQ

Should I use jsonschema or Pydantic for clinical document validation?

Use both. jsonschema gives you a portable, language-neutral contract you can share with sponsors and CROs and version in change control. Pydantic v2 enforces the same structure in-process plus the cross-field regulatory rules (like the conditional FDA 1572 signature requirement) that JSON Schema cannot express cleanly. Run the structural jsonschema check first, then the Pydantic model.

How do I collect every validation error instead of stopping at the first?

With jsonschema, build a Draft202012Validator and iterate iter_errors(payload) rather than calling validate(), which raises on the first failure. With Pydantic v2, a single ValidationError already aggregates all field failures — call .errors() to get the full list of structured records.

Why is `extra="forbid"` important in the Pydantic model?

Upstream parsers and OCR steps can silently introduce or rename fields. Setting extra="forbid" (via ConfigDict) turns an unexpected key into a caught extra_forbidden error instead of letting unvalidated data flow downstream. It is one of the cheapest defenses against parser drift between staging and production.

How are validation errors routed after categorization?

Each error is mapped to a severity tier, and the tier determines the action: critical blockers halt and quarantine the document and notify regulatory affairs; semantic gaps route to gap analysis; structural warnings trigger a normalization retry and an ops flag; operational info is logged and the pipeline continues. The full escalation-matrix and human-in-the-loop design is detailed in Categorizing validation errors in regulatory document pipelines.

Categorizing validation errors in regulatory document pipelines — the full escalation matrix, retry ladder, and human-in-the-loop routing built on the severity tiers above.
PDF/DOCX Parsing for Clinical Docs — the upstream step that turns native documents into the payloads this gate validates.
OCR & Metadata Extraction Pipelines — how image-only documents become validated dictionaries with confidence scores attached.
FDA/EMA Submission Schema Design — authoring the shared Draft 2020-12 contracts this page validates against.
Async Batch Processing for Site Packets — running the pure validation gate inside concurrent workers at filing-window throughput.

Up one level: this is one build area of Automated Document Ingestion & Validation Workflows.

Schema Validation & Error Categorization

Where Validation Sits in the Pipeline #

Two Complementary Validation Engines #

Validating with jsonschema (Draft 2020-12) #

Validating with Pydantic v2 #

A Deterministic Error Taxonomy #

Validation to Categorization Flow #

Putting It Together #

Audit Trail and ALCOA+ Considerations #

FAQ #

Should I use jsonschema or Pydantic for clinical document validation? #

How do I collect every validation error instead of stopping at the first? #

Why is extra="forbid" important in the Pydantic model? #

How are validation errors routed after categorization? #

Related #

Explore this section