Core Architecture & Regulatory Mapping for Clinical Trials

Q: What is the difference between the taxonomy, the data dictionary, and the submission schema?

The taxonomy standardizes vocabulary (the canonical codes for a concept), the data dictionary defines fields (type, value set, lineage for each data element), and the submission schema defines structure (how validated fields are assembled into an eCTD sequence). They form a strict dependency chain: schema depends on the dictionary, which depends on the taxonomy.

Q: Why model activation and IRB review as state machines instead of a checklist?

A checklist records whether steps are done; a state machine enforces order and legality of transitions. That distinction is what prevents an automated trigger from skipping a compliance gate, such as activating a site or shipping drug before a favorable ethics opinion exists.

Clinical trial site activation and regulatory submission automation sits where rigid ICH, FDA, and EMA mandates meet a distributed data ecosystem. This guide is written for the people who have to make that collision deterministic: clinical-operations managers accountable for activation cycle time, regulatory-affairs engineers who own the submission of record, and the Python automation builders who turn 21 CFR Part 11, ALCOA+, ICH E6(R3), and eCTD requirements into an audit-ready system. It maps the reference architecture, the shared regulatory data model, and the production Python patterns that make every artifact reproducible and defensible.

The hard problem is not moving files. It is guaranteeing that every transformation, signature, and routing decision is reproducible, attributable, and defensible during an inspection. The sections below decompose that problem into seven domains, each owning a distinct concern, and show how they compose into one coherent platform. For the document-handling half of the same platform — parsing, OCR, schema validation, and batch ingestion — see the companion guide on Automated Document Ingestion & Validation Workflows, which feeds normalized artifacts into the ingestion layer described here.

How this platform is organized

Seven domains cover the full path from a raw site packet to an accepted submission sequence. Each has a dedicated deep dive; start with the domain that owns the failure you are trying to eliminate.

Domain	What it governs	Deep dive
Submission schema	eCTD structure, JSON/XML backbone, file-level validation	FDA/EMA Submission Schema Design
Taxonomy	Controlled vocabularies, code lists, cross-jurisdiction mapping	Regulatory Taxonomy Standardization
Data dictionary	Field-level definitions, value sets, lineage	Regulatory Data Dictionary Construction
Site readiness	Feasibility, infrastructure, GCP qualification gates	Clinical Site Readiness Assessment Frameworks
IRB/ethics	Submission state machine, human-in-the-loop review	IRB/Ethics Workflow Mapping
Resilience	Portal timeouts, retries, fallback routing	Fallback Routing for Portal Outages
Security	Network segmentation, PHI isolation, zero trust	Security Boundaries for Clinical Data
Scheduling	Timezone-aware deadlines, safety-report clocks, escalation	Submission Deadline & Scheduling Automation
Inspection	Audit-trail reconstruction, hash-chain verification, export	Inspection Readiness & Audit-Trail Reporting

Reference architecture

A maintainable clinical platform abandons the single shared database in favor of layered, event-driven services joined by an append-only audit log. Inputs — site qualification packets, protocol amendments, IRB approvals, regulatory clearances — enter an ingestion layer over authenticated APIs or SFTP, where each artifact is hashed and stamped with provenance metadata before anything else touches it. A normalization layer maps heterogeneous payloads onto the canonical model defined by the data dictionary. A validation layer applies deterministic, jurisdiction-aware rules. A submission layer assembles the eCTD sequence and routes it to the correct portal over the submission gateway, applying a 21 CFR Part 11 electronic signature before transport.

Two properties make this topology compliant rather than merely tidy. First, the audit log is write-once: every layer emits an event but no layer can mutate a prior one, which is what makes the trail trustworthy under 21 CFR Part 11. Second, the boundaries between layers are also trust boundaries — PHI never crosses into a layer that does not need it, a constraint elaborated in Security Boundaries for Clinical Data.

The regulatory contract as non-functional requirements

Four regulatory frameworks bound every design decision on this platform. Treat them not as documentation to cite after the fact but as non-functional requirements the code must satisfy on every run:

21 CFR Part 11 governs electronic records and signatures. It requires attributable, tamper-evident records, secure time stamps, and access controls — implemented here as the hash-chained append-only audit log and authenticated actor identity on every event.
ALCOA+ is the data-integrity standard regulators apply to those records (detailed as a design contract below).
ICH E6(R3), the Good Clinical Practice revision adopted in 2025, sets expectations for computerized systems, data governance, and risk-proportionate validation — which is why activation gates and the IRB/Ethics Workflow Mapping are modeled as enforceable state machines rather than advisory checklists.
eCTD defines the electronic structure the submission layer must produce; encoding it as a schema, covered in FDA/EMA Submission Schema Design, catches structural defects before the portal does.

ALCOA+ as a design contract

ALCOA+ is the most useful checklist for architecture decisions because each attribute maps cleanly onto a testable system property. Treat each as a requirement, not a policy slogan:

Attributable — every event carries an authenticated actor and timestamp
Legible — records are human-readable and machine-parseable (UTF-8, ISO 8601)
Contemporaneous — events are written at the moment of action, server-side
Original — the first capture is preserved; derivations link back to it
Accurate — validated against the data dictionary before persistence
Complete — including failed attempts, retries, and overrides
Consistent — ordered, with timezone-aware, monotonic sequencing
Enduring — retained per protocol and archival policy
Available — retrievable for inspection without reconstruction

A practical way to encode the contract is a typed, immutable audit event. The model below uses Pydantic v2 and never trusts a client-supplied timestamp:

"""Append-only audit event for a 21 CFR Part 11 compliant clinical platform."""
from __future__ import annotations

import hashlib
from datetime import datetime, timezone
from enum import Enum

from pydantic import BaseModel, ConfigDict, Field


class AuditAction(str, Enum):
    INGESTED = "ingested"
    NORMALIZED = "normalized"
    VALIDATED = "validated"
    SUBMITTED = "submitted"
    REROUTED = "rerouted"


class AuditEvent(BaseModel):
    """One immutable entry in the audit trail.

    The event is frozen after construction so application code cannot
    backdate or mutate it, satisfying the ALCOA+ 'original' and
    'contemporaneous' attributes.
    """

    model_config = ConfigDict(frozen=True)

    actor_id: str = Field(..., min_length=1, description="Authenticated user or service identity")
    action: AuditAction
    artifact_sha256: str = Field(..., pattern=r"^[0-9a-f]{64}$")
    recorded_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    prev_hash: str = Field(default="", pattern=r"^([0-9a-f]{64})?$")

    def chain_hash(self) -> str:
        """Hash this event together with its predecessor.

        Linking each entry to the previous hash turns the log into a
        tamper-evident chain: altering any record invalidates every hash
        after it.
        """
        payload = "|".join(
            (
                self.prev_hash,
                self.actor_id,
                self.action.value,
                self.artifact_sha256,
                self.recorded_at.isoformat(),
            )
        )
        return hashlib.sha256(payload.encode("utf-8")).hexdigest()

Regulatory mapping: taxonomy, dictionary, schema

Most rejected submissions fail on data semantics, not transport. The fix is three coordinated layers of regulatory metadata, each owning a distinct concern.

Taxonomy standardizes the controlled vocabularies — sponsor study phases, document types, country codes, IRB decision states — so that a single concept has one canonical code regardless of which site or system produced it. Doing this once, centrally, is what lets a global program reconcile data across regions; the patterns are covered in Regulatory Taxonomy Standardization.

The data dictionary binds each canonical field to its definition, data type, permitted value set, and downstream lineage. It is the authoritative source the normalization and validation layers consult, and it is version-controlled so that a schema change is reviewable against the regulatory update that motivated it. See Regulatory Data Dictionary Construction.

The submission schema expresses structure: the eCTD backbone, module placement, file naming, and PDF metadata that FDA and EMA pre-validation enforce. The U.S. FDA accepts marketing applications in eCTD format via the Electronic Submissions Gateway, and the EMA operates its own gateway for centralized procedures; both organize content into the harmonized Common Technical Document Modules 1 through 5, where Module 1 is the region-specific administrative module. Encoding these rules as Pydantic or JSON Schema validators catches structural defects at the transformation boundary rather than at the portal. See FDA/EMA Submission Schema Design.

The relationship is strictly layered — each layer depends only on the one beneath it:

Core data model shared across every domain

The seven domains do not each invent their own entities; they operate on one small set of shared records, joined by stable identifiers and threaded through the audit log. Getting these entities right once is what keeps normalization, validation, and submission consistent. The table below is the contract every deep dive builds against.

Entity	Key fields	Owns / relates to
`Site`	`site_id`, `country`, `readiness_state`, `sops_version`	Feeds the activation state machine; qualified by site-readiness gates
`Artifact`	`artifact_id`, `sha256`, `doc_type` (taxonomy code), `source`, `received_at`	Every ingested document; hashed once, referenced everywhere
`RegulatoryConcept`	`concept_id`, `canonical_code`, `synonyms[]`, `jurisdiction`	Resolves any artifact’s `doc_type` to one canonical meaning
`ValidationResult`	`artifact_id`, `passed`, `errors[]`, `dictionary_version`	Binds an artifact to the data-dictionary rules it was checked against
`SubmissionSequence`	`sequence_id`, `application_id`, `ctd_module_map`, `target_gateway`	Assembles validated artifacts into an eCTD sequence
`AuditEvent`	`actor_id`, `action`, `artifact_sha256`, `recorded_at`, `prev_hash`	Immutable, hash-chained; references the artifact or sequence acted on

Three relationships carry most of the platform’s semantics: an Artifact resolves its doc_type through a RegulatoryConcept, a ValidationResult pins each Artifact to a specific dictionary_version, and every state change on any entity emits exactly one AuditEvent. Because Artifact.sha256 is the join key into the audit trail, the same hash proves provenance from ingestion all the way to the submitted SubmissionSequence.

Site activation and IRB workflow as state machines

Site activation is inherently stateful: feasibility, contract execution, IRB or ethics approval, and regulatory clearance must complete in order, and an automated trigger should fire only when every prerequisite is genuinely met. Modeling activation and the IRB lifecycle as explicit finite-state machines makes the gates auditable and prevents illegal transitions — for example, dispatching study drug before the ethics committee has issued a favorable opinion.

The IRB transitions in particular must preserve a human-in-the-loop decision point; automation routes reminders and assembles packets but never manufactures an approval. The mapping from real submission workflows to enforceable state machines is detailed in IRB/Ethics Workflow Mapping, and the upstream qualification gates that feed the Feasibility state come from Clinical Site Readiness Assessment Frameworks.

A minimal, correct transition guard keeps the rules in one place:

"""Guarded transitions for the site-activation state machine."""
from __future__ import annotations

ALLOWED: dict[str, frozenset[str]] = {
    "feasibility": frozenset({"contract_execution"}),
    "contract_execution": frozenset({"irb_review"}),
    "irb_review": frozenset({"regulatory_clearance", "irb_review"}),
    "regulatory_clearance": frozenset({"activated"}),
    "activated": frozenset(),
}


def transition(current: str, target: str) -> str:
    """Return the next state or raise if the move is not permitted.

    Centralizing the allow-list prevents skipping a compliance gate such
    as activating a site before IRB clearance.
    """
    if current not in ALLOWED:
        raise ValueError(f"Unknown state: {current!r}")
    if target not in ALLOWED[current]:
        raise ValueError(f"Illegal transition {current!r} -> {target!r}")
    return target

Resilience and operational continuity

Regulatory portals enforce rate limits, maintenance windows, and submission deadlines that do not move because a gateway is down. The platform must degrade safely: distinguish a permanent failure (a malformed sequence — fail fast and surface it) from a transient fault (a gateway timeout — retry with bounded exponential backoff and jitter), and route around an outage to an alternate channel or a durable queue without losing the submission’s state. These patterns, including circuit breaking and dead-letter handling, are the focus of Fallback Routing for Portal Outages.

Backoff is simply a geometric schedule with a cap. For attempt $n$ with base delay $b$ and ceiling $C$ , the deterministic component is:

$t_n = \min\bigl(C,\; b \cdot 2^{\,n}\bigr)$

Adding bounded random jitter on top of $t_n$ prevents synchronized retry storms when many site packets queue behind the same recovering gateway.

Python platform conventions

The same engineering discipline runs through every layer:

Reproducible builds — pin dependencies in pyproject.toml and a lockfile so a submission can be reconstructed from a known toolchain.
Validated input — never persist external data before it passes the data-dictionary rules; reject rather than coerce ambiguous values.
Structured, tamper-evident logging — emit JSON audit events (for example with structlog) and chain them as shown above.
No bare excepts, no swallowed errors — catch specific exceptions, classify them as permanent or transient, and record both outcomes.
No hardcoded secrets — read credentials and keys from the environment or a secrets manager; generate tokens with secrets, never random.
Tested compliance logic — cover validation and state-transition code with pytest, mapping each test to the regulatory requirement it defends.

The environment-variable configuration contract

Every service reads its configuration from the environment, never from source. A single typed settings object makes the contract explicit, fails fast on a missing secret at startup rather than mid-submission, and keeps credentials out of the audit log and the repository:

"""Environment-driven configuration for a clinical submission service."""
from __future__ import annotations

import os
from dataclasses import dataclass


@dataclass(frozen=True)
class Settings:
    """Immutable, validated runtime configuration.

    Secrets are read from the environment (or a mounted secrets manager),
    never hardcoded, satisfying the 'no secrets in source' control.
    """

    fda_esg_endpoint: str
    ema_gateway_endpoint: str
    audit_db_dsn: str
    signing_key: str
    max_retries: int = 5

    @classmethod
    def from_env(cls) -> "Settings":
        def required(name: str) -> str:
            value = os.environ.get(name)
            if not value:
                raise RuntimeError(f"Missing required environment variable: {name}")
            return value

        return cls(
            fda_esg_endpoint=required("FDA_ESG_ENDPOINT"),
            ema_gateway_endpoint=required("EMA_GATEWAY_ENDPOINT"),
            audit_db_dsn=required("AUDIT_DB_DSN"),
            signing_key=required("SUBMISSION_SIGNING_KEY"),
            max_retries=int(os.environ.get("MAX_RETRIES", "5")),
        )

Compliance here is not a layer bolted on at the end; it is expressed as code and enforced in CI, so that schema definitions, transition guards, and audit chaining are verified on every change.

Failure modes and inspection readiness

An inspector does not test the happy path; they probe the seams. The architecture is designed so that each common failure mode is either impossible by construction or leaves an explicit, retrievable record.

Failure mode under audit	What an inspector finds without the design	How the architecture prevents it
Backdated or edited record	A timestamp that cannot be corroborated	Server-side `recorded_at`, frozen events, and the hash chain make any alteration detectable
A skipped approval gate	A site activated before a favorable IRB opinion	The guarded state machine raises on any illegal transition; the gate is code, not convention
Untraceable transformation	A field whose value has no origin	The `sha256` join key ties every derivation back to its original `Artifact` and dictionary version
Silent validation bypass	Data persisted despite failing a rule	Validation runs before persistence and every `ValidationResult` — pass or fail — is written to the trail
Ambiguous terminology	The same document typed differently across sites	Every `doc_type` resolves through one `RegulatoryConcept`, so a concept has a single canonical meaning
Lost submission during an outage	A sequence that vanished when a gateway timed out	Fallback routing persists state to a durable queue and records the reroute, so nothing is dropped

Inspection readiness is therefore a property of the design rather than a documentation exercise: because every state change emits exactly one immutable, attributable event, reconstructing “who did what, when, and against which rule” is a query, not a forensic project.

FAQ

What is the difference between the taxonomy, the data dictionary, and the submission schema?

The taxonomy standardizes vocabulary (the canonical codes for a concept), the data dictionary defines fields (type, value set, lineage for each data element), and the submission schema defines structure (how validated fields are assembled into an eCTD sequence). They form a strict dependency chain: schema depends on the dictionary, which depends on the taxonomy.

How does this architecture satisfy 21 CFR Part 11?

Part 11 governs electronic records and signatures. The append-only, hash-chained audit log provides attributable, tamper-evident records; server-side timestamps enforce contemporaneity; and role-based access plus authenticated actor identity on every event support the electronic-signature and access-control expectations. The state machine ensures records of who approved what, and in what order.

Where does document parsing and OCR fit?

Parsing, OCR, schema validation, and batch ingestion are the document-handling counterpart to this regulatory-mapping work. They are covered in the companion guide Automated Document Ingestion & Validation Workflows, which feeds normalized artifacts into the ingestion layer described here.

Why model activation and IRB review as state machines instead of a checklist?

A checklist records whether steps are done; a state machine enforces order and legality of transitions. That distinction is what prevents an automated trigger from skipping a compliance gate — such as activating a site or shipping drug before a favorable ethics opinion exists.

Core Architecture & Regulatory Mapping for Clinical Trials

How this platform is organized #

Reference architecture #

The regulatory contract as non-functional requirements #

ALCOA+ as a design contract #

Regulatory mapping: taxonomy, dictionary, schema #

Core data model shared across every domain #

Site activation and IRB workflow as state machines #

Resilience and operational continuity #

Python platform conventions #

The environment-variable configuration contract #

Failure modes and inspection readiness #

FAQ #

What is the difference between the taxonomy, the data dictionary, and the submission schema? #

How does this architecture satisfy 21 CFR Part 11? #

Where does document parsing and OCR fit? #

Why model activation and IRB review as state machines instead of a checklist? #

Related #

Explore this section