Home
Core Architecture & Regulatory Mapping for Clinical Trials
FDA/EMA Submission Schema Design
Building FDA eCTD-Compliant JSON Schemas for Clinical Trials

Building FDA eCTD-Compliant JSON Schemas for Clinical Trials

This guide shows how to model the electronic Common Technical Document (eCTD) as JSON Schema Draft 2020-12 and pydantic v2 models: capturing harmonized CTD Modules 2-5, region-specific Module 1, document leaf metadata, lifecycle operations, and file checksums in a validatable, regulator-aligned data contract.

The eCTD is the structured format health authorities such as FDA and EMA use to receive marketing applications, INDs, and their amendments. Underneath the regulator-supplied DTDs and validation criteria sits a logical model: a tree of modules, headings, and document leaves, each carrying metadata and a lifecycle operation. This article builds that logical model in JSON Schema and pydantic v2 so your internal pipeline can author, validate, and version submission metadata before anything is rendered to the official XML backbone. We deliberately treat the eCTD specification version and region-specific rules as configuration, not hardcoded constants, because those values change and differ by region.

This is the deep, code-first walkthrough beneath FDA/EMA Submission Schema Design, which maps the whole design space and sits inside the Core Architecture & Regulatory Mapping for Clinical Trials foundation. It pairs closely with Regulatory Data Dictionary Construction for the controlled vocabularies these models reference, and with Schema Validation & Error Categorization for how to surface and triage the failures these schemas produce.

Why naive approaches fail

Teams reach for the same shortcuts when they first model a submission in Python, and each one leaks invalid metadata straight to a gateway:

One Module 1 model with optional fields. Because Module 1 genuinely differs by region, a single “just make everything Optional” model cannot express “application_number is required for FDA but meaningless for EMA.” Invalid region/field combinations validate cleanly and fail only at the authority.
A loose oneOf with no discriminator. Without a discriminator key, both pydantic and downstream JSON Schema tooling have to guess which branch a payload matches, producing cryptic multi-branch errors instead of a single clear “unknown region” message.
Default additionalProperties. JSON Schema and pydantic both permit unknown keys unless you forbid them. eCTD backbones are strict; a silently accepted typo like medai_type is exactly how malformed leaf metadata reaches the official validator.
Hardcoding the spec version. Baking a literal like "3.3" into the model forces a code fork every time a region publishes a new specification version — and guarantees drift between your FDA and EMA pipelines.
Trusting a client-supplied checksum string. A digest that is not computed from the actual file bytes is decoration. It passes your schema and fails the authority’s integrity check, after transmission.
Ignoring format assertions. jsonschema treats format as advisory by default, so a malformed media_type or language tag slips through unless you explicitly attach a FormatChecker.

The rest of this page removes each of those failure modes by construction.

Architecture overview

The model is a tree: a submission envelope carries a region-specific Module 1 and the harmonized Modules 2-5, each module is a tree of heading sections, and every section holds document leaves. Each leaf carries the three facts that must survive the round-trip into the XML backbone — its metadata, its lifecycle operation, and its file checksum.

Three concepts must survive the round-trip into the backbone, so we model them explicitly:

Leaf metadata — title, file href, MIME/media type, language, and a stable identifier per document.
Lifecycle operation — every leaf in a submission declares what it does relative to prior sequences: new, replace, append, or delete. A replace or delete must point at the leaf it modifies.
Checksums — each leaf carries a file checksum so the receiving authority can verify integrity. The checksum algorithm is region- and version-configurable, so we store both the algorithm name and the digest.

Setup and configuration

Two maintained libraries do all the work: pydantic v2 for typed authoring and JSON Schema emission, and jsonschema for validating untrusted payloads that arrive from other systems.

python -m pip install "pydantic>=2.6" "jsonschema>=4.21"

Nothing about a submission’s region, spec version, or checksum algorithm belongs in source. Resolve those from the environment (or your secrets manager) so one codebase serves FDA and EMA, current and prior spec versions:

"""Runtime configuration for the eCTD model. Nothing regulatory is hardcoded."""
from __future__ import annotations

import logging
import os
from dataclasses import dataclass

logging.basicConfig(
    level=os.environ.get("ECTD_LOG_LEVEL", "INFO"),
    format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
logger = logging.getLogger("ectd.schema")


@dataclass(frozen=True)
class EctdConfig:
    """Region/version validation criteria, resolved from the environment."""
    region: str
    spec_version: str
    checksum_algorithm: str
    audit_log_path: str

    @classmethod
    def from_env(cls) -> "EctdConfig":
        """Fail fast on missing configuration rather than defaulting silently."""
        try:
            return cls(
                region=os.environ["ECTD_REGION"],                 # e.g. "fda" | "ema"
                spec_version=os.environ["ECTD_SPEC_VERSION"],      # e.g. "us-regional-3.3"
                checksum_algorithm=os.environ["ECTD_CHECKSUM_ALGO"],  # e.g. "sha-256"
                audit_log_path=os.environ["ECTD_AUDIT_LOG_PATH"],
            )
        except KeyError as missing:
            raise RuntimeError(f"missing required eCTD env var: {missing}") from missing

The concrete values ("us-regional-3.3" and friends) are tokens your configuration supplies. The model neither asserts nor invents a regulatory version — it records whatever your region/version config declares and lets the exporter and the official validation tool enforce the authority’s current criteria. Pull the real allow-lists from your Regulatory Data Dictionary Construction registry at startup.

Full working implementation

Document leaves and lifecycle operations

We start at the leaf because it is the atom of an eCTD submission. The lifecycle operation is the trickiest part: replace, append, and delete reference a prior leaf, while new must not. That is a natural fit for a discriminated union keyed on the operation name, which keeps the “modified-leaf-ID is required here, forbidden there” rule inside the type system instead of in scattered if statements.

"""eCTD logical model: document leaves and lifecycle operations.

Validated with pydantic v2. Emits JSON Schema (Draft 2020-12) via
model_json_schema(), which is checkable with jsonschema's Draft202012Validator.
"""
from __future__ import annotations

import hashlib
from enum import Enum
from pathlib import Path
from typing import Annotated, Literal, Union

from pydantic import BaseModel, ConfigDict, Field, field_validator


class LifecycleOp(str, Enum):
    """Lifecycle operations an eCTD leaf may declare against prior sequences."""
    NEW = "new"
    REPLACE = "replace"
    APPEND = "append"
    DELETE = "delete"


class ChecksumAlgo(str, Enum):
    """Checksum algorithm. The required algorithm is configured per region
    and eCTD version rather than hardcoded, so multiple values are allowed."""
    MD5 = "md5"
    SHA256 = "sha-256"


class FileChecksum(BaseModel):
    """Integrity descriptor for a single physical file."""
    model_config = ConfigDict(extra="forbid")

    algorithm: ChecksumAlgo
    digest: Annotated[str, Field(pattern=r"^[0-9a-f]+$", min_length=32)]

    @field_validator("digest")
    @classmethod
    def _lowercase_hex(cls, value: str) -> str:
        return value.lower()


class LeafBase(BaseModel):
    """Fields common to every document leaf regardless of operation."""
    model_config = ConfigDict(extra="forbid")

    leaf_id: Annotated[str, Field(pattern=r"^[A-Za-z][A-Za-z0-9_.\-]{2,63}$")]
    title: Annotated[str, Field(min_length=1, max_length=512)]
    href: Annotated[str, Field(min_length=1, max_length=2048)]
    media_type: Annotated[str, Field(pattern=r"^[\w.+-]+/[\w.+-]+$")] = "application/pdf"
    language: Annotated[str, Field(pattern=r"^[a-z]{2}(-[A-Z]{2})?$")] = "en"
    checksum: FileChecksum

    @field_validator("href")
    @classmethod
    def _reject_absolute_paths(cls, value: str) -> str:
        """Leaf hrefs are submission-relative; reject traversal and absolute paths."""
        if value.startswith(("/", "\\")) or ".." in Path(value).parts:
            raise ValueError("href must be a relative path without '..' segments")
        return value


class NewLeaf(LeafBase):
    operation: Literal[LifecycleOp.NEW] = LifecycleOp.NEW


class ReplaceLeaf(LeafBase):
    operation: Literal[LifecycleOp.REPLACE] = LifecycleOp.REPLACE
    modified_leaf_id: str = Field(description="leaf_id this leaf replaces")


class AppendLeaf(LeafBase):
    operation: Literal[LifecycleOp.APPEND] = LifecycleOp.APPEND
    modified_leaf_id: str = Field(description="leaf_id this leaf appends to")


class DeleteLeaf(LeafBase):
    operation: Literal[LifecycleOp.DELETE] = LifecycleOp.DELETE
    modified_leaf_id: str = Field(description="leaf_id this leaf deletes")


# Discriminated union: pydantic and JSON Schema both route on `operation`.
Leaf = Annotated[
    Union[NewLeaf, ReplaceLeaf, AppendLeaf, DeleteLeaf],
    Field(discriminator="operation"),
]

The discriminator="operation" annotation is what makes this production-grade rather than a loose oneOf. pydantic uses it to pick the right model with a clear error when operation is missing or unknown, and model_json_schema() emits a JSON Schema oneOf plus a discriminator mapping, so downstream tooling that understands discriminators (OpenAPI-style) routes the same way. extra="forbid" maps to additionalProperties: false, which closes the silent-unknown-key failure mode from the top of the page.

Region-specific Module 1 vs. harmonized Modules 2-5

Module 1 is where regional divergence lives, so we model the region itself as a discriminator. An FDA Module 1 and an EMA Module 1 are different shapes; a payload tagged region: "fda" can only ever satisfy the FDA shape.

class ModuleSection(BaseModel):
    """A heading node that groups leaves and/or nested sections."""
    model_config = ConfigDict(extra="forbid")

    section_code: Annotated[str, Field(pattern=r"^[0-9](\.[0-9A-Za-z]+)*$")]
    title: Annotated[str, Field(min_length=1, max_length=512)]
    leaves: list[Leaf] = Field(default_factory=list)
    subsections: list["ModuleSection"] = Field(default_factory=list)


class FDAModule1(BaseModel):
    """US region-specific administrative module."""
    model_config = ConfigDict(extra="forbid")

    region: Literal["fda"] = "fda"
    submission_type: Literal["ind", "nda", "bla", "anda"]
    application_number: Annotated[str, Field(pattern=r"^[0-9]{6}$")]
    cover_letter: Leaf
    sections: list[ModuleSection] = Field(default_factory=list)


class EMAModule1(BaseModel):
    """EU region-specific administrative module."""
    model_config = ConfigDict(extra="forbid")

    region: Literal["ema"] = "ema"
    procedure_type: Literal["centralised", "national", "mrp", "dcp"]
    cover_letter: Leaf
    sections: list[ModuleSection] = Field(default_factory=list)


Module1 = Annotated[
    Union[FDAModule1, EMAModule1],
    Field(discriminator="region"),
]


class HarmonizedModule(BaseModel):
    """Modules 2-5 share one shape across regions (ICH-harmonized content)."""
    model_config = ConfigDict(extra="forbid")

    module_number: Literal[2, 3, 4, 5]
    sections: list[ModuleSection] = Field(default_factory=list)

The application_number and procedure_type patterns are illustrative shapes, not regulatory facts; parametrize them per region and eCTD version from your data dictionary. The point is structural: Module 1 varies by region and is selected by a discriminator; Modules 2-5 do not.

The submission envelope and configurable spec version

The top-level model ties everything together and carries the sequence metadata that drives lifecycle. Crucially, the eCTD specification version is a field validated against an allow-list you supply at construction time, not a literal baked into the schema.

class Submission(BaseModel):
    """Root logical model rendered into an eCTD backbone by an exporter."""
    model_config = ConfigDict(extra="forbid")

    ectd_spec_version: Annotated[str, Field(min_length=1, max_length=32)]
    sequence: Annotated[str, Field(pattern=r"^[0-9]{4}$")]
    module_1: Module1
    harmonized_modules: list[HarmonizedModule] = Field(default_factory=list)

    @field_validator("harmonized_modules")
    @classmethod
    def _unique_modules(cls, mods: list[HarmonizedModule]) -> list[HarmonizedModule]:
        numbers = [m.module_number for m in mods]
        if len(numbers) != len(set(numbers)):
            raise ValueError("each harmonized module (2-5) may appear at most once")
        return mods

Treating ectd_spec_version as data means one codebase serves FDA and EMA, current and prior spec versions, without forking the model. The exporter that renders the XML backbone reads this field to choose the correct DTD and validation profile.

Computing leaf checksums deterministically

Checksums are integrity facts, not free text. Compute them from the actual file bytes with a streaming read so large clinical PDFs do not exhaust memory, and store the algorithm alongside the digest so verification is unambiguous.

def compute_checksum(path: Path, algorithm: ChecksumAlgo) -> FileChecksum:
    """Stream a file through the configured hash and return a FileChecksum.

    Reads in fixed-size chunks to bound memory on large submission documents.
    """
    algo_name = "sha256" if algorithm is ChecksumAlgo.SHA256 else "md5"
    digest = hashlib.new(algo_name)
    with path.open("rb") as handle:
        for chunk in iter(lambda: handle.read(1024 * 1024), b""):
            digest.update(chunk)
    return FileChecksum(algorithm=algorithm, digest=digest.hexdigest())

The algorithm is passed in, never assumed: the required checksum algorithm is part of the region/version validation criteria, so resolve it from EctdConfig and feed it here.

Emitting, checking, and recording an ALCOA+ audit entry

pydantic v2 emits Draft 2020-12 JSON Schema directly. We validate raw, untrusted JSON with jsonschema’s Draft202012Validator — useful when input arrives from a system that does not import your Python models — and we attach a FormatChecker so format assertions are actually enforced. Every build-and-validate event then appends an immutable record to the append-only audit log that 21 CFR Part 11 and ALCOA+ expect: attributable, contemporaneous, and hash-anchored.

import json
from dataclasses import asdict, dataclass
from datetime import datetime, timezone

from jsonschema import Draft202012Validator
from jsonschema.validators import validator_for


def build_submission_schema() -> dict:
    """Generate a Draft 2020-12 JSON Schema for the Submission model."""
    return Submission.model_json_schema()


def make_validator(schema: dict) -> Draft202012Validator:
    """Build a draft-pinned validator with format assertions enforced.

    check_schema confirms the emitted schema is itself well-formed before use;
    the runtime validator is pinned explicitly to Draft 2020-12.
    """
    validator_cls = validator_for(schema)
    validator_cls.check_schema(schema)
    return Draft202012Validator(
        schema, format_checker=Draft202012Validator.FORMAT_CHECKER
    )


def validate_payload(payload: dict, schema: dict) -> list[str]:
    """Return human-readable error strings; empty list means valid.

    Errors are sorted by JSON path so the report is deterministic and
    digestible by an error-categorization stage downstream.
    """
    validator = make_validator(schema)
    errors = sorted(validator.iter_errors(payload), key=lambda e: list(e.absolute_path))
    return [f"{'/'.join(map(str, e.absolute_path)) or '<root>'}: {e.message}" for e in errors]


@dataclass(frozen=True)
class AuditRecord:
    """ALCOA+ record for a submission build/validate event (append-only)."""
    event: str
    actor: str
    when_utc: str
    sequence: str
    region: str
    spec_version: str
    outcome: str
    payload_sha256: str
    error_count: int


def emit_audit_record(
    submission: Submission, payload: dict, errors: list[str], cfg: EctdConfig, actor: str
) -> AuditRecord:
    """Append an immutable ALCOA+ record for this build/validate event."""
    record = AuditRecord(
        event="ectd.submission.validate",
        actor=actor,
        when_utc=datetime.now(timezone.utc).isoformat(),
        sequence=submission.sequence,
        region=cfg.region,
        spec_version=submission.ectd_spec_version,
        outcome="valid" if not errors else "invalid",
        payload_sha256=hashlib.sha256(
            json.dumps(payload, sort_keys=True).encode("utf-8")
        ).hexdigest(),
        error_count=len(errors),
    )
    with open(cfg.audit_log_path, "a", encoding="utf-8") as log:
        log.write(json.dumps(asdict(record), sort_keys=True) + "\n")
    logger.info("audit %s sequence=%s outcome=%s", record.event, record.sequence, record.outcome)
    return record

Returning all errors via iter_errors (rather than raising on the first) is what lets a categorization layer group failures by type and location. That handoff is covered in Schema Validation & Error Categorization; here we make sure the validator surfaces every problem with a stable path, then record the outcome before returning it.

Putting it together

A minimal end-to-end flow constructs a typed Submission, dumps it to JSON, regenerates the schema, validates the round-tripped payload, and writes the audit record — proving the model and the emitted schema agree.

def build_and_validate(cfg: EctdConfig, actor: str) -> AuditRecord:
    leaf = NewLeaf(
        leaf_id="m1.cover.001",
        title="Cover Letter",
        href="m1/us/cover-letter.pdf",
        checksum=FileChecksum(
            algorithm=ChecksumAlgo.SHA256,
            digest="e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        ),
    )
    submission = Submission(
        ectd_spec_version=cfg.spec_version,  # configurable, not hardcoded
        sequence="0000",
        module_1=FDAModule1(
            submission_type="ind",
            application_number="123456",
            cover_letter=leaf,
        ),
        harmonized_modules=[HarmonizedModule(module_number=3)],
    )

    schema = build_submission_schema()
    payload = json.loads(submission.model_dump_json())
    errors = validate_payload(payload, schema)
    return emit_audit_record(submission, payload, errors, cfg, actor)


if __name__ == "__main__":
    result = build_and_validate(EctdConfig.from_env(), actor="pipeline@example")
    print(f"submission {result.sequence}: {result.outcome} ({result.error_count} errors)")

Validation and edge-case handling

The design closes each failure mode named earlier, but a few edge cases deserve explicit checks in your pipeline:

replace/append/delete with a dangling modified_leaf_id. pydantic enforces that the field is present; it cannot know the referenced leaf_id exists in a prior sequence. Cross-check modified_leaf_id against your sequence index and raise before export — an orphan reference is a gateway rejection.
Unknown region or operation. Because both are discriminated unions with extra="forbid", an unmapped region or operation value raises a precise pydantic error naming the discriminator, rather than a fan-out of branch failures.
Non-canonical digests. The _lowercase_hex validator normalizes case and the ^[0-9a-f]+$ pattern rejects non-hex characters, so a mixed-case or space-padded digest never reaches the checksum comparison.
Path traversal in href. _reject_absolute_paths blocks /etc/..., C:\..., and any .. segment, so a leaf can only ever point inside the submission directory.
Duplicate harmonized modules. _unique_modules prevents two Module 3 blocks from coexisting, a common merge artifact when packages are assembled from multiple contributors.
Silent format skips. Building the validator through make_validator guarantees the FormatChecker is attached; never construct a bare Draft202012Validator without it.

Testing and verification

Pin the guarantees with pytest. These assertions confirm the model and its emitted schema agree and that each guard actually rejects the bad input it targets.

import pytest
from pydantic import ValidationError

GOOD_CHECKSUM = FileChecksum(
    algorithm=ChecksumAlgo.SHA256,
    digest="e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
)


def _cover() -> NewLeaf:
    return NewLeaf(leaf_id="m1.cover.001", title="Cover", href="m1/us/cl.pdf",
                   checksum=GOOD_CHECKSUM)


def test_valid_submission_round_trips_against_its_own_schema() -> None:
    sub = Submission(
        ectd_spec_version="us-regional-3.3", sequence="0000",
        module_1=FDAModule1(submission_type="ind", application_number="123456",
                            cover_letter=_cover()),
        harmonized_modules=[HarmonizedModule(module_number=3)],
    )
    schema = build_submission_schema()
    payload = json.loads(sub.model_dump_json())
    assert validate_payload(payload, schema) == []


def test_replace_requires_modified_leaf_id() -> None:
    with pytest.raises(ValidationError):
        ReplaceLeaf(leaf_id="m3.spec.002", title="Spec", href="m3/spec.pdf",
                    checksum=GOOD_CHECKSUM)  # modified_leaf_id missing


def test_new_leaf_forbids_modified_leaf_id() -> None:
    with pytest.raises(ValidationError):
        NewLeaf(leaf_id="m3.spec.003", title="Spec", href="m3/spec.pdf",
                checksum=GOOD_CHECKSUM, modified_leaf_id="m3.spec.001")  # extra=forbid


def test_unknown_region_is_rejected_by_discriminator() -> None:
    with pytest.raises(ValidationError):
        Submission.model_validate({
            "ectd_spec_version": "x", "sequence": "0000",
            "module_1": {"region": "pmda"},  # not fda|ema
        })


def test_traversal_href_is_rejected() -> None:
    with pytest.raises(ValidationError):
        NewLeaf(leaf_id="m1.bad.001", title="Bad", href="../../etc/passwd",
                checksum=GOOD_CHECKSUM)


def test_duplicate_harmonized_module_is_rejected() -> None:
    with pytest.raises(ValidationError):
        Submission(
            ectd_spec_version="x", sequence="0000",
            module_1=FDAModule1(submission_type="ind", application_number="123456",
                                cover_letter=_cover()),
            harmonized_modules=[HarmonizedModule(module_number=3),
                                HarmonizedModule(module_number=3)],
        )

Run them with pytest -q. The first test is the load-bearing one: it proves model_json_schema() and the pydantic models cannot silently diverge, because a payload the models produce is validated against the schema they emit.

Validation and design checklist

extra="forbid" on every model so the schema emits additionalProperties: false.
Lifecycle operations modeled as a discriminated union; replace/append/delete require modified_leaf_id, new forbids it.
Module 1 modeled per region behind a region discriminator; Modules 2-5 share one harmonized shape.
eCTD spec version, region rules, and checksum algorithm read from configuration, never hardcoded.
Draft202012Validator built via validator_for + check_schema, with a FormatChecker attached.
Leaf href is submission-relative with no traversal segments.
Checksums computed from real file bytes via streaming reads, with algorithm stored alongside the digest.
Validation returns all errors with stable JSON paths for downstream categorization.
Every build/validate event appends an immutable ALCOA+ audit record.

FAQ

Does this JSON model replace the eCTD XML backbone?

No. The eCTD transport remains the regulator-specified XML backbone with its DTDs. This model is the upstream, machine-checkable source of truth that an exporter renders into that backbone. Validating the JSON early catches structural and metadata errors before they reach official validation.

Why use a discriminated union instead of a single Module 1 model with optional fields?

Because Module 1 genuinely differs by region. A single model full of optional fields cannot express “application_number is required for FDA but meaningless for EMA,” so invalid combinations pass silently. A discriminator keyed on region makes each shape exact and produces clear validation errors.

Where do the real version numbers and validation criteria come from?

From each authority’s published region-specific specification and validation criteria, resolved through your regulatory data dictionary at runtime. This article keeps those values configurable on purpose so the model stays correct as specifications are revised. See Regulatory Data Dictionary Construction.

How should I surface the validation errors this produces?

Collect them with iter_errors, keep the JSON path stable, then group and prioritize them in a dedicated categorization stage. That pattern is detailed in Schema Validation & Error Categorization.

Up one level: FDA/EMA Submission Schema Design
Foundation: Core Architecture & Regulatory Mapping for Clinical Trials
Regulatory Data Dictionary Construction
Schema Validation & Error Categorization
Sibling: How to Map IRB Submission Workflows to Automated State Machines

Building FDA eCTD-Compliant JSON Schemas for Clinical Trials

Why naive approaches fail #

Architecture overview #

Setup and configuration #

Full working implementation #

Document leaves and lifecycle operations #

Region-specific Module 1 vs. harmonized Modules 2-5 #

The submission envelope and configurable spec version #

Computing leaf checksums deterministically #

Emitting, checking, and recording an ALCOA+ audit entry #

Putting it together #

Validation and edge-case handling #

Testing and verification #

Validation and design checklist #

FAQ #

Does this JSON model replace the eCTD XML backbone? #

Why use a discriminated union instead of a single Module 1 model with optional fields? #

Where do the real version numbers and validation criteria come from? #

How should I surface the validation errors this produces? #

Related #