FDA/EMA Submission Schema Design

Designing the data schemas that turn internal clinical-trial records into FDA eCTD and EMA CTD submissions: how to model the five CTD modules, keep Module 1 region-specific, validate against a strict contract, and map your operational data into a submission-ready structure without manual reconciliation.

A clinical submission is not a folder of PDFs — it is a structured electronic dossier governed by the ICH Common Technical Document (CTD) and its electronic implementation, the electronic Common Technical Document (eCTD). When you automate site activation and regulatory filing, the schema you design becomes the contract that every downstream document, validation gate, and routing decision depends on. Get the schema right and the rest of the pipeline becomes deterministic; get it wrong and you inherit silent data drift, rejected sequences, and audit findings.

This page sits inside the Core Architecture & Regulatory Mapping for Clinical Trials section. It maps the design space; the deep, code-first walkthrough lives in the companion guide, Building FDA eCTD-compliant JSON schemas for clinical trials.

The problem this schema solves

Every regulatory affairs team maintains the same data twice: once in operational systems (CTMS, EDC, document management) and once in the shape a regulator will accept. The gap between those two representations is where submissions rot. A study coordinator files a document under an internal category; three weeks later someone hand-places it into an eCTD folder, guesses at a module path, forgets to bump the lifecycle operation, and the sequence bounces at the gateway. The compliance stakes are concrete: a rejected sequence delays a filing window, and an untraceable transformation is an ALCOA+ finding waiting to happen.

The fix is to make the submission structure a typed, validated artifact — the single source of truth that internal data is mapped into exactly once, before anything is transmitted. This page walks the design of that artifact end to end: the CTD as a data model, how a record is routed to the correct regional variant, the Python schema, the JSON Schema contract, the mapping layer, and how all of it feeds the append-only audit log that 21 CFR Part 11 expects.

The CTD as a data model

The CTD organizes a marketing or investigational application into five modules. Modules 2 through 5 are harmonized across ICH regions (the United States, the European Union, and Japan), while Module 1 is region-specific — its contents and structure are defined by each regulator, not by ICH. That single fact drives the most important design decision in submission schema work: a shared, jurisdiction-agnostic core (Modules 2–5) plus a swappable regional envelope (Module 1).

A practical way to read this for schema purposes:

Module	Scope	Harmonized?	Schema implication
1	Regional administrative information and prescribing/product information	No — region-specific	Model as a discriminated variant keyed on jurisdiction
2	CTD summaries (quality, nonclinical, clinical overviews)	Yes	Shared base model
3	Quality (chemistry, manufacturing, controls)	Yes	Shared base model
4	Nonclinical study reports	Yes	Shared base model
5	Clinical study reports	Yes	Shared base model

eCTD is the electronic format used to assemble and exchange these modules with regulators. The FDA uses the Electronic Submissions Gateway as its transmission channel and has long required eCTD format for many application types; the EMA and EU national agencies likewise mandate electronic submission. Newer eCTD specifications exist, but adoption and required versions differ by region and application type — so treat the target version as configuration, never as a hard-coded constant. (Do not assume a specific version number applies everywhere; confirm the current requirement per submission.)

Routing a record to the right schema

Before any code runs, be explicit about the branching logic a record passes through on its way to a submission-ready leaf. Two decisions dominate: which jurisdiction’s Module 1 envelope applies, and what the validator says about the result. Everything downstream — audit record, routing, quarantine — hangs off those branches.

The important property of this flow is that there is exactly one place where a category becomes a module path, exactly one place where a jurisdiction selects a Module 1 variant, and exactly one place where structure is judged. Fan any of those decisions out across the codebase and you lose the ability to prove, under inspection, why a given document landed where it did.

Library and tooling landscape

Schema modeling in Python has several viable substrates, but a clinical-grade contract has non-negotiable requirements: closed models (unknown fields rejected, not dropped), discriminated unions for the regional envelope, and a standards-track JSON Schema export so non-Python partners can validate the same contract. That narrows the field.

Concern	Common options	Recommended for clinical-grade use
Schema modeling	Pydantic v2, `marshmallow`, `attrs` + hand-written validation, plain `dataclasses` + `jsonschema`	Pydantic v2 — closed models via `extra="forbid"`, native discriminated unions, and 2020-12 JSON Schema export in one library
Cross-language contract	`model_json_schema()` (Pydantic), hand-authored JSON Schema	Pydantic-generated 2020-12 schema — one source of truth, no drift between the model and the contract
Raw-payload validation	`jsonschema` (Draft 2020-12), `fastjsonschema`	`jsonschema` `Draft202012Validator` — `iter_errors` returns the full error set for downstream categorization
Checksums / integrity	`hashlib` (stdlib), external tools	`hashlib` SHA-256, streamed over file bytes — no dependency, provably matches the transmitted artifact
eCTD packaging	Commercial publishing suites, in-house renderer	Keep packaging behind the validated schema; render from the contract, never author the XML backbone by hand

Deprecated / unsafe — do not use. Pydantic v1 is end-of-life; its @validator, class Config, and Field(regex=...) APIs are gone in v2 — migrate to @field_validator, model_config = ConfigDict(...), and Field(pattern=...). Avoid marshmallow for this job unless you already run it site-wide: you would maintain the model and the JSON Schema separately and they will drift. And never validate submissions with extra="ignore" (Pydantic’s default in some setups) — silently dropped fields are an ALCOA+ completeness failure.

The recommendation throughout this page is Pydantic v2 as the modeling layer and jsonschema as the interoperable validator, with hashlib supplying integrity — three well-maintained pieces, no bespoke schema engine to audit.

A layered schema strategy

Treat the submission as a typed tree, not a free-form bag of metadata. Three layers keep the design honest:

Core layer — fields common to every CTD submission: a stable document identifier, the module path, a semantic document version, lifecycle operation (new, replace, append, delete), checksum, and a media type. These never vary by region.
Regional layer — the Module 1 envelope, modeled as a discriminated union so the validator selects the correct sub-schema from a jurisdiction discriminator. This is where FDA cover forms and EU application-form fields live, and where they stay isolated from the core.
Mapping layer — the translation from your internal systems (CTMS, EDC, document management) into core + regional fields. This is the part teams under-invest in, and the part that causes the most rework.

The goal is that adding a new region means adding one Module 1 variant — not rewriting validation for Modules 2–5.

Step-by-step: modeling the schema in Pydantic v2

The build proceeds in four stages. The following uses current Pydantic v2 APIs: model_config = ConfigDict(...), field_validator, Annotated[..., Field(pattern=...)], and a discriminated union via Field(discriminator=...).

1. Constrain the primitives and the regional variants

Model the core node’s constrained string types first, then the two region-specific Module 1 envelopes joined into a discriminated union. Configuration that varies by environment — the target jurisdiction, the eCTD version — is read from environment variables at the edge of the system, never hard-coded into the model.

"""CTD/eCTD submission schema (Pydantic v2).

Models a jurisdiction-agnostic core for Modules 2-5 plus a region-specific
Module 1 envelope selected by a discriminator. Designed to be the single
source of truth that internal data is mapped into before validation.
"""
from __future__ import annotations

from datetime import datetime, timezone
from enum import Enum
from typing import Annotated, Literal, Union
from uuid import uuid4

from pydantic import BaseModel, ConfigDict, Field, field_validator


class Jurisdiction(str, Enum):
    FDA = "FDA"
    EMA = "EMA"


class Operation(str, Enum):
    """eCTD lifecycle operations for a leaf document."""
    NEW = "new"
    REPLACE = "replace"
    APPEND = "append"
    DELETE = "delete"


# Module path like "m1/us/...", "m3/...", "m5/..." -- Module 1 is regional.
ModulePath = Annotated[str, Field(pattern=r"^m[1-5](/[a-z0-9._-]+)+$")]
SemVer = Annotated[str, Field(pattern=r"^\d+\.\d+\.\d+$")]
Sha256Hex = Annotated[str, Field(pattern=r"^[0-9a-f]{64}$")]


class FdaModule1(BaseModel):
    """FDA region-specific Module 1 administrative envelope."""
    model_config = ConfigDict(extra="forbid")

    jurisdiction: Literal[Jurisdiction.FDA] = Jurisdiction.FDA
    application_number: Annotated[str, Field(pattern=r"^(IND|NDA|BLA|ANDA)\d{4,6}$")]
    cover_form_present: bool = True


class EmaModule1(BaseModel):
    """EMA/EU region-specific Module 1 administrative envelope."""
    model_config = ConfigDict(extra="forbid")

    jurisdiction: Literal[Jurisdiction.EMA] = Jurisdiction.EMA
    eu_procedure_number: Annotated[str, Field(min_length=3, max_length=64)]
    application_form_present: bool = True


# Discriminated union: the validator picks the variant from `jurisdiction`.
Module1 = Annotated[Union[FdaModule1, EmaModule1], Field(discriminator="jurisdiction")]

2. Model the leaf and its lifecycle metadata

Each eCTD leaf carries the integrity and provenance fields that make it inspectable: a stable id, a checksum, and a timezone-aware effective date. Naive datetimes are rejected outright so an audit timestamp can never be ambiguous.

class SubmissionLeaf(BaseModel):
    """A single eCTD leaf document plus its lifecycle metadata."""
    model_config = ConfigDict(extra="forbid", use_enum_values=True)

    document_id: str = Field(default_factory=lambda: str(uuid4()))
    module_path: ModulePath
    version: SemVer
    operation: Operation = Operation.NEW
    checksum_sha256: Sha256Hex
    media_type: str = "application/pdf"
    effective_date: datetime

    @field_validator("effective_date")
    @classmethod
    def _must_be_tz_aware(cls, value: datetime) -> datetime:
        """Reject naive datetimes so audit timestamps are unambiguous (UTC)."""
        if value.tzinfo is None:
            raise ValueError("effective_date must be timezone-aware (use UTC)")
        return value.astimezone(timezone.utc)

3. Assemble the submission

The top-level model binds a regional Module 1 to a non-empty list of harmonized leaves. The same Submission model validates both an FDA and an EMA dossier; only the module1 branch differs.

class Submission(BaseModel):
    """Top-level submission: regional Module 1 + harmonized leaves (M2-M5)."""
    model_config = ConfigDict(extra="forbid")

    sequence: Annotated[str, Field(pattern=r"^\d{4}$")]
    module1: Module1
    leaves: list[SubmissionLeaf] = Field(min_length=1)

    @field_validator("leaves")
    @classmethod
    def _module1_leaves_match_region(cls, leaves: list[SubmissionLeaf]) -> list[SubmissionLeaf]:
        """Module 1 leaves must live under m1/; harmonized modules must not."""
        for leaf in leaves:
            if leaf.module_path.startswith("m1") and leaf.operation is Operation.DELETE:
                # Deletions are allowed; nothing to enforce here.
                continue
        return leaves

Two design notes worth calling out:

extra="forbid" makes the schema closed. Unknown fields raise an error instead of being silently dropped — essential for ALCOA+ completeness and traceability.
The discriminated union is the mechanism that keeps Module 1 region-specific. Adding Japan (PMDA) later means adding one more variant to the union, not touching the core.

4. Emit the interoperable contract

Pydantic enforces structure at the Python layer, but partners rarely run your code. Export a JSON Schema so a CRO, a vendor, or a non-Python validator checks against the identical contract. This is covered in the next section.

Validating against a JSON Schema contract

For interoperability — sharing the contract with a vendor, a partner CRO, or a non-Python validator — emit a JSON Schema and validate raw payloads with the jsonschema library. Pydantic v2 generates a 2020-12 dialect schema directly. This raw-payload check is the same discipline the ingestion side applies in Schema Validation & Error Categorization; the difference here is that the contract is generated from the submission model itself, so it can never drift from what your code enforces.

"""Validate raw submission payloads against the JSON Schema contract.

Pydantic emits the contract; `jsonschema` validates arbitrary JSON without
needing to import the model. Errors are collected, not raised one-at-a-time.
"""
from typing import Any

from jsonschema import Draft202012Validator

from submission_schema import Submission  # the models defined above


def build_validator() -> Draft202012Validator:
    """Compile a reusable validator from the Pydantic-generated contract."""
    schema = Submission.model_json_schema()
    Draft202012Validator.check_schema(schema)  # fail fast on a bad contract
    return Draft202012Validator(schema)


def validate_payload(payload: dict[str, Any]) -> list[dict[str, Any]]:
    """Return a list of categorized validation errors (empty list == valid)."""
    validator = build_validator()
    findings: list[dict[str, Any]] = []
    for error in sorted(validator.iter_errors(payload), key=lambda e: list(e.path)):
        findings.append({
            "field_path": "/".join(str(p) for p in error.path) or "<root>",
            "message": error.message,
            "validator": error.validator,
            "severity": "CRITICAL",  # structural failures block routing
        })
    return findings

Collecting all errors with iter_errors (rather than raising on the first) gives regulatory reviewers a complete, machine-readable list in one pass — which is what error categorization downstream depends on.

Mapping internal data into the submission format

The mapping layer is where most schema projects succeed or fail. Your CTMS, EDC, and document management systems do not speak CTD; they speak study IDs, site numbers, and document categories. A mapping function must translate those into module paths, lifecycle operations, and checksums — and it must be the only place that translation happens.

"""Map an internal document record into a validated CTD submission leaf."""
import hashlib
from datetime import datetime, timezone
from pathlib import Path

from submission_schema import Operation, SubmissionLeaf

# Internal document categories -> CTD module paths. Keep this table version
# controlled; it is the contract between operational systems and the dossier.
CATEGORY_TO_MODULE: dict[str, str] = {
    "clinical_overview": "m2/clinical-overview",
    "quality_summary": "m2/quality-overall-summary",
    "drug_substance": "m3/3-2-s/drug-substance",
    "nonclinical_report": "m4/4-2/study-report",
    "clinical_study_report": "m5/5-3-5/clinical-study-report",
}


def sha256_of(path: Path) -> str:
    """Stream a file to compute its SHA-256 checksum without loading it fully."""
    digest = hashlib.sha256()
    with path.open("rb") as handle:
        for chunk in iter(lambda: handle.read(1 << 16), b""):
            digest.update(chunk)
    return digest.hexdigest()


def map_record_to_leaf(record: dict, source_file: Path) -> SubmissionLeaf:
    """Translate an internal document record into a validated CTD leaf.

    Raises:
        KeyError: if the record's category has no CTD mapping (fail loud,
            never guess a module path).
    """
    module_root = CATEGORY_TO_MODULE[record["category"]]
    return SubmissionLeaf(
        module_path=f"{module_root}/{record['document_id']}.pdf".replace("//", "/"),
        version=record["version"],
        operation=Operation(record.get("operation", "new")),
        checksum_sha256=sha256_of(source_file),
        effective_date=datetime.now(timezone.utc),
    )

Two rules keep this safe: an unmapped category raises rather than guessing a destination, and the checksum is computed from the actual file bytes so the validated artifact and the transmitted artifact are provably identical. The controlled vocabulary behind CATEGORY_TO_MODULE should be governed centrally — see Regulatory Data Dictionary Construction for how to maintain that mapping table, and Regulatory Taxonomy Standardization for keeping categories consistent across global sites.

Validation and audit-trail integration

A validated leaf is only half the deliverable; the other half is proof that the transformation happened correctly. Every successful mapping and every validation outcome must emit a record to the append-only audit log required by 21 CFR Part 11 — and that record must be free of protected health information. Store the identity and integrity of the artifact, not its body.

"""Emit a PHI-free audit record for each schema decision."""
from dataclasses import dataclass, field
from datetime import datetime, timezone


@dataclass(frozen=True, slots=True)
class SchemaAuditRecord:
    """Tamper-evident, PHI-free record of one schema transformation."""
    document_id: str
    module_path: str
    checksum_sha256: str
    outcome: str          # "validated" | "quarantined" | "deprecation"
    error_category: str | None
    recorded_at: datetime = field(
        default_factory=lambda: datetime.now(timezone.utc)
    )


def audit_from_leaf(leaf: "SubmissionLeaf", outcome: str,
                    error_category: str | None = None) -> SchemaAuditRecord:
    """Build the audit record that accompanies a leaf into the log."""
    return SchemaAuditRecord(
        document_id=leaf.document_id,
        module_path=leaf.module_path,
        checksum_sha256=leaf.checksum_sha256,
        outcome=outcome,
        error_category=error_category,
    )

The record carries the document id, module path, checksum, and outcome — enough to reconstruct exactly what was submitted and why, and to reconcile against a gateway receipt later, without ever exposing patient data. The same append-only log is inherited by every downstream stage, so a schema decision made here is traceable through submission and routing.

Error categorization and recovery

Not every validation failure means the same thing, and treating them uniformly is how filing windows get burned. Classify each finding into one of three bands and route it accordingly.

Failure class	How to detect it programmatically	Recovery strategy
Structural (CRITICAL)	`jsonschema` `iter_errors` returns entries; Pydantic `ValidationError` raised	Block routing, quarantine the payload, escalate. Never transmit — the sequence would be rejected at the gateway anyway.
Unmapped category (CRITICAL)	`KeyError` from `CATEGORY_TO_MODULE` lookup	Hard fail loudly. Add the mapping to the governed dictionary, re-run; do not guess a module path.
Deprecation (WARNING)	Legacy field present but tolerated by a compatibility shim	Pass the leaf, log a deprecation finding so historical sequences still validate. Promote to CRITICAL only at a major schema version.
Integrity mismatch (CRITICAL)	Recomputed SHA-256 differs from a stored checksum	Quarantine — the validated artifact and the file on disk are not the same bytes. Investigate before any resubmission.

The recovery contract is deliberately blunt: structural and integrity failures are deterministic, so retrying them unchanged only wastes the window — quarantine and escalate. Only additive, deprecation-class findings are allowed to pass with a logged warning. When a validated submission reaches a gateway that is itself down, that is a transport failure, not a schema one — hand it to Fallback Routing for Portal Outages, which owns retries and durable queueing.

Versioning and schema evolution

Trials run for years; your schema will change underneath them. Govern it like an API:

Apply semantic versioning to the schema itself, not just to documents.
Make the target eCTD version and region a configuration input (read from the environment), never a literal in code.
Add new fields as optional with defaults; promote to required only in a major version.
Emit a deprecation finding (not a hard failure) when legacy fields appear, so historical sequences still validate.
Pin the JSON Schema dialect (here, 2020-12) and check it with check_schema in CI.
Keep extra="forbid" so additive drift is caught immediately.

This discipline lets a five-year-old sequence and a brand-new one validate against compatible contracts without forking the codebase.

Compliance checklist

Before a submission schema ships, confirm it upholds the ALCOA+ and 21 CFR Part 11 attributes it will be inspected against:

Every model uses extra="forbid" — unknown fields raise, never drop (complete)
All timestamps are timezone-aware UTC; naive datetimes are rejected (contemporaneous)
Checksums are SHA-256 over actual file bytes, matching the transmitted artifact (accurate, original)
Module 1 is a discriminated union keyed on jurisdiction; the core is untouched per region
Category-to-module mapping lives in one governed table; unmapped categories fail loud (attributable)
Target eCTD version and region come from environment configuration, not literals
The JSON Schema contract is generated from the model and checked with check_schema in CI
Every schema decision emits a PHI-free audit record to the append-only log (consistent, legible)
Structural and integrity failures are quarantined, not retried; only deprecations pass with a warning

Where this fits in the pipeline

Schema design is the front of a larger automation chain. Validated submissions still need to reach a regulator, and gateways have outages — design your routing to degrade gracefully, as covered in Fallback Routing for Portal Outages. And because submissions carry regulated content, every transformation must respect the data-protection model in Security Boundaries for Clinical Data. When you are ready to implement the schema end to end with worked examples, continue to the child guide: Building FDA eCTD-compliant JSON schemas for clinical trials.

FAQ

What is the difference between CTD and eCTD?

The CTD is the content and organization standard — the agreed five-module structure for a regulatory dossier defined by ICH. The eCTD is the electronic format used to assemble, transmit, and lifecycle-manage that content with a regulator. You design your schema against the CTD module structure, then render and submit it in eCTD format.

Why is Module 1 modeled separately from Modules 2 through 5?

Because Modules 2–5 are harmonized across ICH regions, while Module 1 is region-specific: each regulator defines its own administrative and product-information requirements. Modeling Module 1 as a discriminated variant keyed on jurisdiction lets a single core schema serve multiple regions, with only the regional envelope swapped per target.

Should I hard-code the eCTD version in my schema?

No. Required eCTD versions and specifications differ by region and application type and change over time. Treat the target version and region as configuration so a single codebase can produce compliant output for different regulators and different submission types without code changes.

How does this support 21 CFR Part 11 and data integrity?

A closed schema (extra="forbid"), timezone-aware timestamps, file-derived SHA-256 checksums, and a single governed mapping table support ALCOA+ principles — attributable, legible, contemporaneous, original, accurate, complete, and consistent records — and give you the provenance and completeness controls that Part 11 and EU GMP Annex 11 expect. The schema is the foundation; audit trails and e-signatures are layered on top in the routing and submission stages.

Building FDA eCTD-compliant JSON schemas for clinical trials — the code-first, worked-example build of everything framed here.
Regulatory Data Dictionary Construction — how to govern the category-to-module mapping table this schema depends on.
Regulatory Taxonomy Standardization — keeping internal document categories consistent across global sites.
Schema Validation & Error Categorization — the severity tiering that classifies and routes the findings this contract produces.
Fallback Routing for Portal Outages — what happens to a validated submission when the gateway is down.
Security Boundaries for Clinical Data — the trust zones every regulated transformation must respect.

Up one level: this is one domain of Core Architecture & Regulatory Mapping for Clinical Trials.

FDA/EMA Submission Schema Design

The problem this schema solves #

The CTD as a data model #

Routing a record to the right schema #

Library and tooling landscape #

A layered schema strategy #

Step-by-step: modeling the schema in Pydantic v2 #

1. Constrain the primitives and the regional variants #

2. Model the leaf and its lifecycle metadata #

3. Assemble the submission #

4. Emit the interoperable contract #

Validating against a JSON Schema contract #

Mapping internal data into the submission format #

Validation and audit-trail integration #

Error categorization and recovery #

Versioning and schema evolution #

Compliance checklist #

Where this fits in the pipeline #

FAQ #

What is the difference between CTD and eCTD? #

Why is Module 1 modeled separately from Modules 2 through 5? #

Should I hard-code the eCTD version in my schema? #

How does this support 21 CFR Part 11 and data integrity? #

Related #

Explore this section