Clinical Site Readiness Assessment Frameworks

A clinical site readiness assessment framework turns site activation from a subjective judgment into a reproducible, auditable score. This guide maps the moving parts — document completeness, regulatory milestone gating, weighted scoring models, and hard activation gates — and shows how to model them in correct, idiomatic Python so readiness becomes computed rather than declared.

Problem framing

Most trials still track site activation with a spreadsheet of binary checkboxes. That approach hides three failure modes that delay first-patient-in: it treats every artifact as equally important, it ignores dependencies between regulatory milestones, and it produces no defensible record of why a site was deemed ready on a given date. When an inspector asks how a site cleared activation, “the checklist was all green” is not an answer — the record has to reconstruct exactly which items, which weights, and which mandatory approvals produced the decision.

A framework replaces the spreadsheet with three explicit layers:

A completeness layer that knows which artifacts a site owes and whether each is present, valid, and unexpired.
A scoring layer that combines those artifacts into a weighted readiness score, so a missing investigator CV does not weigh the same as a missing IRB approval.
A gating layer that enforces non-negotiable prerequisites — a high aggregate score never substitutes for a mandatory regulatory approval.

This build area sits under the Core Architecture & Regulatory Mapping for Clinical Trials platform. Readiness scoring consumes the structured data those upstream systems produce: it depends on a stable Regulatory Data Dictionary Construction so artifact types mean the same thing across sites, it normalizes site-local artifact names through Regulatory Taxonomy Standardization, and it reads ethics-approval state from the IRB/Ethics Workflow Mapping state machine. Once a site clears its gates, the resulting artifacts feed FDA/EMA Submission Schema Design for downstream filing.

Three dimensions of readiness

A defensible framework scores along distinct, non-overlapping dimensions so that a weakness in one cannot be masked by strength in another:

Dimension	What it measures	Example artifacts
Document completeness	Are required artifacts present, current, and structurally valid?	Investigator CV, financial disclosure, lab certification, delegation log
Regulatory milestones	Have authority and ethics approvals been granted?	IRB/EC approval, regulatory authority acknowledgment, signed protocol
Operational capability	Can the site actually run the protocol?	Trained staff, equipment qualification, executed clinical trial agreement

The first dimension is largely mechanical and automatable. The second and third include hard gates: no weighting scheme should ever let a site activate without IRB approval, regardless of how complete its document binder is. That distinction — soft weighting versus hard gating — is the heart of the model.

Decision flowchart

Scoring is not a one-shot calculation; it runs continuously as artifacts arrive, expire, and get re-validated. The flow below shows the branching logic that turns an incoming artifact into an activation decision: validity routing, the hard-gate check, and the threshold comparison.

Because the score recomputes on every state change, an expiring lab certification automatically pulls a previously ready site back below threshold — readiness decays rather than silently going stale.

The weighted readiness score

We define a site’s readiness as a normalized weighted sum over its assessment items, multiplied by a binary gate term. Let $I$ be the set of scored items. Each item $i$ has a weight $w_i > 0$ and a fractional satisfaction score $s_i \in [0, 1]$ , where $0$ means absent or invalid and $1$ means present, valid, and unexpired. The aggregate readiness $R$ is:

$R = \left( \frac{\sum_{i \in I} w_i \, s_i}{\sum_{i \in I} w_i} \right) \cdot \prod_{g \in G} \mathbb{1}[\,g\ \text{satisfied}\,]$

The left factor is a weighted average bounded in $[0, 1]$ . The right factor is a product of indicator functions over the set of hard gates $G$ : if any mandatory gate is unsatisfied, the product is $0$ and the entire readiness score collapses to zero — exactly the behavior regulation requires. A site is eligible for activation only when:

$R \geq \tau \quad \text{and} \quad \forall g \in G : \mathbb{1}[\,g\ \text{satisfied}\,] = 1$

where $\tau$ is the activation threshold (commonly $\tau = 1.0$ for a “fully ready” policy, or a lower value such as $0.9$ when the sponsor permits conditional activation with tracked open items). Keeping the threshold $\tau$ and the gate set $G$ in version-controlled configuration — never hardcoded — lets regulatory affairs adjust policy without code changes.

Library and tooling landscape

The scoring engine is small, deterministic policy logic — the wrong instinct is to reach for a heavyweight framework. The clinical-grade recommendation is plain typed Python (dataclasses + enum) for the model, pydantic v2 to parse and validate the version-controlled policy file, and standard-library logging for the audit trail. Everything below is version-pinnable and unit-testable, which is what makes an assessment defensible under inspection.

Option	Role	Clinical-grade fit
`dataclasses` + `enum` (stdlib)	Assessment items, statuses, scoring result	Recommended. Transparent, typed, trivially testable; the scoring function reads like the spec.
`pydantic` v2	Parse/validate the policy (weights, threshold, gate set) from config	Recommended. Catches malformed policy at load time; pairs with `pydantic-settings` for env-driven paths.
`pandas`	Ad-hoc tabular scoring across many sites	Situational. Fine for reporting rollups, weak for the gating decision — no static types and easy to mutate silently.
`great-expectations`	Data-quality validation on inbound artifact tables	Situational. Useful upstream on ingestion, overkill and opaque for the activation gate itself.
`pyke`	Forward-chaining rules engine for gate logic	Deprecated — do not use. Python-2-only and unmaintained; externalizing gates into an unaudited rules DSL also weakens traceability. Keep gates in reviewed code.
Excel / manual checklist	The status quo	Not defensible. No reproducibility, no audit trail, no dependency modeling — the exact failure this framework removes.

Deprecated tooling warning. Do not build gate logic on pyke; it targets Python 2 and is unmaintained. Resist the temptation to externalize hard regulatory gates into a rules-engine DSL at all — a mandatory-approval check that lives in reviewed, version-controlled Python is far easier to defend to an inspector than one buried in an opaque rule graph.

Step-by-step implementation

The model separates data (assessment items and their state) from policy (weights, thresholds, gates) and from computation (the scoring function), so each can be tested and audited independently. All code uses type hints and dataclasses, and no policy value is hardcoded — the activation threshold and gate set load from a version-controlled file whose path comes from the environment.

1. Model the assessment item and its satisfaction

Each item carries a stable key (matching the shared data dictionary), a positive weight, a is_gate flag, a validation status, and an optional expiry. Satisfaction is derived deterministically and takes the evaluation date as an explicit argument, so scoring never reads the wall clock and stays reproducible.

"""Weighted clinical site readiness scoring with hard regulatory gates."""
from __future__ import annotations

from dataclasses import dataclass, field
from datetime import date
from enum import Enum


class ItemStatus(str, Enum):
    """Lifecycle status of a single readiness artifact."""

    MISSING = "missing"
    SUBMITTED = "submitted"
    VALIDATED = "validated"
    REJECTED = "rejected"


@dataclass(frozen=True)
class AssessmentItem:
    """One scored readiness requirement for a site.

    Attributes:
        key: Stable identifier matching the regulatory data dictionary.
        weight: Relative importance; must be positive.
        is_gate: If True, the item is a hard activation prerequisite.
        status: Current validation status of the artifact.
        expires_on: Optional expiry date for time-bounded artifacts.
    """

    key: str
    weight: float
    is_gate: bool
    status: ItemStatus
    expires_on: date | None = None

    def __post_init__(self) -> None:
        if self.weight <= 0:
            raise ValueError(f"weight for {self.key!r} must be positive")

    def satisfaction(self, as_of: date) -> float:
        """Return s_i in [0, 1] for this item, evaluated at a fixed date.

        An artifact only contributes if it is VALIDATED and not expired.
        Passing ``as_of`` explicitly keeps scoring reproducible.
        """
        if self.status is not ItemStatus.VALIDATED:
            return 0.0
        if self.expires_on is not None and self.expires_on < as_of:
            return 0.0
        return 1.0

2. Load the activation policy from version-controlled config

Weights, the threshold $\tau$ , and the gate set are policy, not code — regulatory affairs must be able to change them without a deployment, and every assessment must record which policy version was in force. Load them from a reviewed file whose path comes from the environment, and validate on load so a malformed policy fails fast instead of silently mis-scoring.

"""Load and validate the sponsor activation policy from config."""
from __future__ import annotations

import json
import os
from dataclasses import dataclass


@dataclass(frozen=True)
class ReadinessPolicy:
    """Sponsor-configurable activation policy (load from version control)."""

    threshold: float
    version: str

    def __post_init__(self) -> None:
        if not 0.0 < self.threshold <= 1.0:
            raise ValueError("threshold must be in the interval (0, 1]")
        if not self.version:
            raise ValueError("policy must carry a version identifier")


def load_policy(env_var: str = "READINESS_POLICY_PATH") -> ReadinessPolicy:
    """Read the activation policy from a path given by an env var.

    Raises:
        KeyError: If the environment variable is unset.
        ValueError: If the policy file is malformed.
    """
    path = os.environ[env_var]  # no hardcoded paths or secrets
    with open(path, encoding="utf-8") as handle:
        raw = json.load(handle)
    return ReadinessPolicy(
        threshold=float(raw["threshold"]),
        version=str(raw["version"]),
    )

3. Compute the gated readiness score

The scoring function is the direct translation of $R$ : a weighted average multiplied by the hard-gate product. It returns a structured result — not just a number — so the caller can log the weighted average, the failed gates, and the ready flag independently.

"""Aggregate scoring: weighted average times the hard-gate product."""
from __future__ import annotations

from dataclasses import dataclass, field
from datetime import date


@dataclass(frozen=True)
class ReadinessResult:
    """Computed readiness outcome for a single site."""

    score: float
    weighted_average: float
    gates_satisfied: bool
    failed_gates: tuple[str, ...]
    is_ready: bool


@dataclass
class SiteReadiness:
    """Aggregate readiness model for one clinical site."""

    site_id: str
    items: list[AssessmentItem] = field(default_factory=list)

    def assess(self, policy: ReadinessPolicy, as_of: date) -> ReadinessResult:
        """Compute the weighted, gated readiness score for this site.

        Raises:
            ValueError: If the site has no scored items.
        """
        if not self.items:
            raise ValueError(f"site {self.site_id!r} has no assessment items")

        total_weight = sum(item.weight for item in self.items)
        weighted = sum(
            item.weight * item.satisfaction(as_of) for item in self.items
        )
        weighted_average = weighted / total_weight

        failed_gates = tuple(
            item.key
            for item in self.items
            if item.is_gate and item.satisfaction(as_of) < 1.0
        )
        gates_satisfied = not failed_gates

        # Hard gates collapse the score to zero (the product term in R).
        score = weighted_average if gates_satisfied else 0.0
        is_ready = gates_satisfied and score >= policy.threshold

        return ReadinessResult(
            score=round(score, 4),
            weighted_average=round(weighted_average, 4),
            gates_satisfied=gates_satisfied,
            failed_gates=failed_gates,
            is_ready=is_ready,
        )

4. Exercise the gating behavior as an assertion

A short, deterministic exercise of the model — the kind you keep as a unit test — makes the gating behavior explicit and pins it against regression.

from datetime import date

policy = ReadinessPolicy(threshold=0.9, version="2026.06-rev3")
today = date(2026, 6, 18)

site = SiteReadiness(
    site_id="US-014",
    items=[
        AssessmentItem("irb_approval", 5.0, is_gate=True,
                       status=ItemStatus.VALIDATED),
        AssessmentItem("clinical_trial_agreement", 4.0, is_gate=True,
                       status=ItemStatus.VALIDATED),
        AssessmentItem("investigator_cv", 1.0, is_gate=False,
                       status=ItemStatus.SUBMITTED),  # not yet validated
        AssessmentItem("lab_certification", 2.0, is_gate=False,
                       status=ItemStatus.VALIDATED,
                       expires_on=date(2026, 12, 31)),
    ],
)

result = site.assess(policy, as_of=today)
# Gates pass, but the un-validated CV pulls the average below threshold.
assert result.gates_satisfied is True
assert result.is_ready is False
assert result.score < policy.threshold

If irb_approval were anything other than VALIDATED, failed_gates would contain "irb_approval", score would be exactly 0.0, and is_ready would be False no matter how complete the rest of the binder was. That is the single most important property to test, because it is the property regulators care about.

Validation and audit-trail integration

An assessment result is not just returned to a caller — it is a regulated record. Every assess call must append an entry to the append-only audit log that 21 CFR Part 11 requires, capturing the score, the failed gates, and the policy version that produced them. Because the scoring inputs come from validated documents, the readiness layer is a consumer of the ingestion and validation stack: item statuses are set by Schema Validation & Error Categorization, and the open-item deltas that drive SUBMITTED vs MISSING come from Checklist Sync & Gap Analysis.

The audit record uses structured logging and records everything needed to reconstruct the decision: which site, evaluated at what date, under which policy version, with what result. No secrets or environment-specific paths are embedded.

"""Emit a structured, append-only audit record for one assessment."""
from __future__ import annotations

import json
import logging
from datetime import date, datetime, timezone

logger = logging.getLogger("readiness.audit")


def record_assessment(
    *,
    site: SiteReadiness,
    result: ReadinessResult,
    policy: ReadinessPolicy,
    as_of: date,
    assessed_by: str,
) -> dict:
    """Append a reproducible readiness audit record and return it.

    The record answers the inspection questions: which site, on what
    evaluation date, under which policy version, with what outcome, and
    attributable to whom.
    """
    record = {
        "site_id": site.site_id,
        "as_of": as_of.isoformat(),           # reproducible evaluation date
        "assessed_at": datetime.now(timezone.utc).isoformat(),  # Contemporaneous
        "policy_version": policy.version,     # exactly which policy applied
        "threshold": policy.threshold,
        "score": result.score,               # Accurate
        "weighted_average": result.weighted_average,
        "gates_satisfied": result.gates_satisfied,
        "failed_gates": list(result.failed_gates),
        "is_ready": result.is_ready,
        "assessed_by": assessed_by,           # Attributable
    }
    # A single structured line; ship to WORM/append-only storage in prod.
    logger.info("readiness_assessment", extra={"record": record})
    return record

Persist these records as JSON Lines to write-once (WORM) storage so the history cannot be retroactively edited — that is what satisfies the Enduring and Available attributes of the ALCOA+ chain. Recompute and re-log on every artifact state change, so an expiring certification that drops a site below threshold produces its own audit event rather than a silent status flip.

Error categorization and recovery

Not every “not ready” outcome means the same thing, and conflating them either blocks activation needlessly or lets a site slip through with an open mandatory approval. Classify each failure so the recovery path is proportionate — and so downstream tooling can route the blocker to the right owner.

Failure class	How to detect it programmatically	Recovery strategy
Failed hard gate	`result.failed_gates` is non-empty	Block activation and route the named gate (e.g. `irb_approval`) to its owner; no weighting can compensate.
Below threshold, gates pass	`gates_satisfied and not is_ready`	Conditional/pending; surface the specific low-satisfaction items so open work is targeted, not guessed.
Expired artifact	`satisfaction(as_of) == 0` for a previously validated item with a past `expires_on`	Re-request the artifact; the site reverts to pending automatically on recompute.
Rejected document	item `status is ItemStatus.REJECTED`	Return to the submitter via remediation; do not count as merely missing — it failed validation.
Empty or malformed assessment	`assess` raises `ValueError` (no items) or `load_policy` raises on bad config	Fail loudly at load/score time; never emit a zero-item “ready” result.

The governing rule is that no “not ready” outcome is ever resolved by editing the score directly. The state changes only when an underlying artifact changes and the score recomputes — and each of those recomputations is itself an audit event, so the recovery path stays as traceable as the happy path.

Compliance checklist

Source item keys and expected artifact sets from the shared regulatory data dictionary, not per-site spreadsheets.
Keep weights, the threshold $\tau$ , and the gate set $G$ in version-controlled configuration loaded at runtime; never hardcode them.
Load the policy path from an environment variable and validate the policy on load so malformed config fails fast.
Pass the evaluation date explicitly into scoring so results are reproducible and auditable, never reading the wall clock inside the computation.
Recompute on every artifact state change so expirations and rejections lower the score immediately.
Record each assessment outcome — score, failed gates, and the policy version used — to an append-only, WORM-backed audit log aligned with 21 CFR Part 11.
Attribute every assessment to a user and stamp it with a contemporaneous timestamp (ALCOA+ Attributable and Contemporaneous).
Treat a passing score as a recommendation; reserve final activation sign-off for a human reviewer.

FAQ

Should the readiness score ever override a missing IRB approval?

No. IRB/EC approval is a hard gate, modeled by the product term in $R$ . When a gate is unsatisfied the score is forced to zero, so a high weighted average can never compensate for a missing mandatory approval. This is deliberate: weighting expresses priority among optional items, while gating expresses non-negotiable legal prerequisites.

How do expiring documents affect the score?

Each time-bounded artifact carries an expiry date, and its satisfaction value drops to 0.0 once that date passes. Because the model recomputes on every change and accepts the evaluation date as an explicit input, an expired lab certification or investigator license lowers the site’s score automatically — a previously ready site can revert to pending without any manual edit.

What threshold should we set for activation?

A threshold of $\tau = 1.0$ enforces full readiness before activation. Sponsors that permit conditional activation with tracked open items may set a lower value such as $0.9$ , provided every hard gate still passes. Keep $\tau$ in configuration so regulatory affairs can adjust policy without a code deployment, and log which threshold and policy version were in force for each assessment.

Is an automated score a substitute for regulatory sign-off?

No. The framework computes a deterministic, defensible readiness measure, but final activation remains a human-gated decision. Automation’s value is consistency and an audit trail showing exactly which items and gates produced a given score on a given date — not replacing the accountable reviewer.

Regulatory Data Dictionary Construction — the source of stable item keys and expected artifact sets the scorer reads.
IRB/Ethics Workflow Mapping — the state machine that sets the ethics-approval gate this framework depends on.
Checklist Sync & Gap Analysis — computes the open-item deltas that drive artifact statuses.
Schema Validation & Error Categorization — the validator that flips an item to VALIDATED or REJECTED.
FDA/EMA Submission Schema Design — the downstream consumer once a site clears its gates.

Up one level: this is one domain of Core Architecture & Regulatory Mapping for Clinical Trials.

Clinical Site Readiness Assessment Frameworks

Problem framing #

Three dimensions of readiness #

Decision flowchart #

The weighted readiness score #

Library and tooling landscape #

Step-by-step implementation #

1. Model the assessment item and its satisfaction #

2. Load the activation policy from version-controlled config #

3. Compute the gated readiness score #

4. Exercise the gating behavior as an assertion #

Validation and audit-trail integration #

Error categorization and recovery #

Compliance checklist #

FAQ #

Should the readiness score ever override a missing IRB approval? #

How do expiring documents affect the score? #

What threshold should we set for activation? #

Is an automated score a substitute for regulatory sign-off? #

Related #