Extracting Metadata from Scanned Clinical Trial PDFs Using Tesseract

Scanned clinical trial PDFs—site activation packets, signed 1572s, faxed lab certificates—often have no text layer, so a regex over the raw PDF returns nothing. This guide builds a production Python pipeline that detects missing text layers, rasterizes pages at 300+ DPI, preprocesses with OpenCV, runs Tesseract LSTM OCR with word-level confidence, and extracts protocol number, site ID, dates, and investigator name under an ALCOA+ audit trail.

This is a deep how-to under the OCR & Metadata Extraction Pipelines cluster, part of the Automated Document Ingestion & Validation Workflows pillar. For documents that already carry a digital text layer, the techniques in PDF/DOCX Parsing for Clinical Docs are faster and more accurate—use OCR only when no text layer exists.

When to reach for OCR

OCR is a last resort, not a default. It is slower, lossier, and harder to validate than direct text extraction. Reach for it only when a page genuinely lacks a recoverable text layer. The cheapest reliable signal is: try pypdf first, and rasterize only the pages that come back empty.

flowchart TD
    A[Open PDF with pypdf] --> B{Page has text layer}
    B -->|Yes| C[Parse text directly]
    B -->|No| D[Rasterize page at 300 dpi]
    D --> E[OpenCV preprocess deskew threshold denoise]
    E --> F[Tesseract image_to_data oem 1]
    F --> G[Word level confidence scoring]
    G --> H[Anchored regex metadata extraction]
    H --> I{Field confidence tier}
    I -->|High| J[Commit to EDC and CTMS]
    I -->|Low| K[Human review queue]
    C --> H
    J --> L[Append ALCOA plus audit record]
    K --> L

The pipeline below mirrors this flow: detect, rasterize, preprocess, OCR, score, extract, route, audit.

Step 1: Detect pages with no text layer

Rasterizing every page wastes compute and degrades accuracy on pages that already have clean text. Use pypdf to inspect each page; treat a page as scanned when extract_text() returns negligible content. pypdf is the maintained successor to the deprecated PyPDF2, so all examples use it.

"""Detect which PDF pages lack a usable text layer."""
from __future__ import annotations

from pathlib import Path

from pypdf import PdfReader

# A genuine text page usually yields well over this many non-space characters.
MIN_TEXT_CHARS = 20


def scanned_page_indices(pdf_path: Path) -> list[int]:
    """Return zero-based indices of pages that need OCR.

    A page is considered "scanned" when pypdf extracts fewer than
    MIN_TEXT_CHARS non-whitespace characters from it.
    """
    if not pdf_path.is_file():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    reader = PdfReader(str(pdf_path))
    needs_ocr: list[int] = []
    for index, page in enumerate(reader.pages):
        try:
            text = page.extract_text() or ""
        except (ValueError, KeyError) as exc:
            # Malformed page object: extract_text can raise on broken
            # content streams. Treat as scanned and let OCR try.
            text = ""
            _log_extract_error(index, exc)
        if len("".join(text.split())) < MIN_TEXT_CHARS:
            needs_ocr.append(index)
    return needs_ocr


def _log_extract_error(index: int, exc: Exception) -> None:
    import logging

    logging.getLogger("clinical_ocr").warning(
        "pypdf text extraction failed on page %d: %s", index, exc
    )

Step 2: Rasterize at 300 DPI or higher

Tesseract’s accuracy collapses below roughly 300 DPI; the LSTM engine expects an x-height of at least 20–30 pixels. Render scanned pages at 300 DPI (400 for small-font fax artifacts). Two solid options:

  • PyMuPDF (fitz) renders without external dependencies and is fast.
  • pdf2image wraps Poppler—reliable, but requires the poppler-utils system package.

This pipeline uses PyMuPDF to avoid a system dependency and to render single pages on demand (lower memory than converting a whole document at once).

"""Rasterize a single PDF page to an OpenCV BGR image at a target DPI."""
from __future__ import annotations

from pathlib import Path

import cv2
import fitz  # PyMuPDF
import numpy as np

DEFAULT_DPI = 300


def rasterize_page(pdf_path: Path, page_index: int, dpi: int = DEFAULT_DPI) -> np.ndarray:
    """Render one page to a BGR ndarray suitable for OpenCV and Tesseract."""
    if dpi < 300:
        raise ValueError("Use at least 300 DPI for reliable OCR accuracy")

    with fitz.open(str(pdf_path)) as doc:
        if not 0 <= page_index < doc.page_count:
            raise IndexError(f"Page {page_index} out of range for {pdf_path}")
        page = doc.load_page(page_index)
        # 72 is the PDF default DPI; scale the matrix to reach the target.
        matrix = fitz.Matrix(dpi / 72.0, dpi / 72.0)
        pix = page.get_pixmap(matrix=matrix, colorspace=fitz.csRGB, alpha=False)

    rgb = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, 3)
    return cv2.cvtColor(rgb, cv2.COLOR_RGB2BGR)

Step 3: Preprocess with OpenCV

Raw scans carry skew, uneven lighting, speckle, and fax compression noise. A focused preprocessing chain—grayscale, deskew, denoise, adaptive threshold—repairs these before OCR. Order matters: deskew on the grayscale image (rotation interpolates cleanly on continuous tones), then denoise, then binarize.

"""OpenCV preprocessing for degraded clinical scans."""
from __future__ import annotations

import cv2
import numpy as np


def _estimate_skew_angle(gray: np.ndarray) -> float:
    """Estimate page skew in degrees using the minimum-area rectangle.

    Returns a small angle in roughly [-45, 45]; 0.0 when no foreground
    pixels are found.
    """
    # Foreground = dark text on light paper, so invert before thresholding.
    inverted = cv2.bitwise_not(gray)
    _, binary = cv2.threshold(inverted, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    coords = cv2.findNonZero(binary)
    if coords is None:
        return 0.0

    angle = cv2.minAreaRect(coords)[-1]
    # minAreaRect returns angle in (-90, 0]; normalize to a small rotation.
    if angle < -45:
        angle += 90
    return float(angle)


def _deskew(gray: np.ndarray, max_correction_deg: float = 15.0) -> np.ndarray:
    """Rotate the image to remove skew, capping absurd corrections."""
    angle = _estimate_skew_angle(gray)
    if abs(angle) < 0.1 or abs(angle) > max_correction_deg:
        # Tiny skew is not worth interpolating; huge angles usually mean
        # a misdetection (e.g. a landscape form) and should be left alone.
        return gray

    height, width = gray.shape
    center = (width / 2.0, height / 2.0)
    rotation = cv2.getRotationMatrix2D(center, angle, 1.0)
    return cv2.warpAffine(
        gray,
        rotation,
        (width, height),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE,
    )


def preprocess(frame: np.ndarray) -> np.ndarray:
    """Grayscale, deskew, denoise, and adaptively threshold a scan."""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    deskewed = _deskew(gray)
    # Edge-preserving denoise removes fax speckle without smearing strokes.
    denoised = cv2.fastNlMeansDenoising(deskewed, h=10, templateWindowSize=7, searchWindowSize=21)
    # Adaptive (local) threshold handles uneven scanner lighting better than
    # a single global Otsu cut.
    binary = cv2.adaptiveThreshold(
        denoised,
        maxValue=255,
        adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        thresholdType=cv2.THRESH_BINARY,
        blockSize=31,
        C=10,
    )
    return binary

A few practical notes:

  • Do not over-clean. Aggressive morphology can erode thin strokes and merge digits (1572 becomes 572). Validate on a held-out set of real scans.
  • blockSize must be odd and large enough to span a full glyph at your DPI; 31 works well at 300 DPI.
  • Skip deskew on born-clean renders (text-layer pages you chose to rasterize anyway)—the skew estimate adds noise with no benefit.

Step 4: OCR with word-level confidence

Use image_to_data rather than image_to_string. It returns per-word bounding boxes and confidence scores, which are essential for routing low-confidence fields to human review and for building the audit trail. Run the LSTM engine with --oem 1. For full-page forms, --psm 6 (assume a single uniform block of text) is a robust default; for narrow cropped fields, --psm 7 (single line) is better.

Tesseract reports word confidence on a 0–100 scale and emits -1 for layout boxes with no recognized text. Filter those out before aggregating.

"""Run Tesseract and return clean, confidence-bearing words."""
from __future__ import annotations

import shutil
from dataclasses import dataclass

import numpy as np
import pytesseract


@dataclass(frozen=True)
class Word:
    text: str
    confidence: float  # 0-100
    left: int
    top: int
    width: int
    height: int


def assert_tesseract_ready(required_langs: tuple[str, ...] = ("eng",)) -> None:
    """Fail fast if Tesseract or a required language pack is missing."""
    if shutil.which("tesseract") is None:
        raise RuntimeError("Tesseract binary not found on PATH")
    installed = set(pytesseract.get_languages(config=""))
    missing = [lang for lang in required_langs if lang not in installed]
    if missing:
        raise RuntimeError(f"Missing Tesseract language packs: {missing}")


def ocr_words(image: np.ndarray, psm: int = 6, lang: str = "eng") -> list[Word]:
    """OCR an image and return words with valid (non -1) confidence."""
    config = f"--oem 1 --psm {psm}"
    data = pytesseract.image_to_data(
        image, lang=lang, config=config, output_type=pytesseract.Output.DICT
    )

    words: list[Word] = []
    for text, conf, left, top, width, height in zip(
        data["text"], data["conf"], data["left"], data["top"], data["width"], data["height"]
    ):
        cleaned = text.strip()
        confidence = float(conf)
        if cleaned and confidence >= 0:
            words.append(Word(cleaned, confidence, int(left), int(top), int(width), int(height)))
    return words

Step 5: Extract metadata with anchored regex

Keyword matching (“does the word PROTOCOL appear?”) is far too weak for regulated extraction—it cannot tell you which value belongs to the field. Instead, reconstruct the OCR text in reading order, then use anchored regular expressions: a label anchor (Protocol No.) immediately followed by a value pattern. This captures the value, not just the presence of a label, and lets you attach a per-field confidence by averaging the confidences of the words that produced the match.

The patterns below are illustrative; tune them to your sponsor’s document templates. Note that OCR routinely confuses O/0, 1/l/I, and S/5, so value patterns should be permissive and you should normalize before validating.

"""Anchored, confidence-aware metadata extraction from OCR words."""
from __future__ import annotations

import re
from dataclasses import dataclass

# Each pattern has one capturing group for the value. Anchors are the field
# labels as they appear on the form; values are kept permissive on purpose.
FIELD_PATTERNS: dict[str, re.Pattern[str]] = {
    "protocol_number": re.compile(
        r"protocol\s*(?:no\.?|number|#)?\s*[:\-]?\s*([A-Z0-9][A-Z0-9\-]{3,20})",
        re.IGNORECASE,
    ),
    "site_id": re.compile(
        r"site\s*(?:id|no\.?|number|#)\s*[:\-]?\s*([A-Z0-9\-]{2,12})",
        re.IGNORECASE,
    ),
    "investigator": re.compile(
        r"(?:principal\s+investigator|investigator\s+name|pi)\s*[:\-]?\s*"
        r"([A-Z][A-Za-z.\-]+(?:\s+[A-Z][A-Za-z.\-]+){1,3})",
        re.IGNORECASE,
    ),
    # ISO, US, and dd-Mon-yyyy date styles seen on clinical forms.
    "date": re.compile(
        r"(\d{4}-\d{2}-\d{2}|\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
        r"|\d{1,2}\s+[A-Za-z]{3,9}\s+\d{4})"
    ),
}


@dataclass(frozen=True)
class Field:
    name: str
    value: str
    confidence: float  # mean confidence of the words spanning the match


def _spanning_confidence(words, start: int, end: int, joined: str) -> float:
    """Mean confidence of words whose text falls inside [start, end)."""
    cursor = 0
    confidences: list[float] = []
    for word in words:
        token_start = joined.find(word.text, cursor)
        if token_start == -1:
            continue
        token_end = token_start + len(word.text)
        cursor = token_end
        if token_start < end and token_end > start:
            confidences.append(word.confidence)
    return round(sum(confidences) / len(confidences), 2) if confidences else 0.0


def extract_fields(words) -> dict[str, Field]:
    """Apply anchored patterns to OCR words and score each match."""
    joined = " ".join(word.text for word in words)
    fields: dict[str, Field] = {}
    for name, pattern in FIELD_PATTERNS.items():
        match = pattern.search(joined)
        if match is None:
            continue
        value = match.group(1).strip()
        confidence = _spanning_confidence(words, match.start(1), match.end(1), joined)
        fields[name] = Field(name=name, value=value, confidence=confidence)
    return fields

Step 6: Route by confidence and write the audit trail

Extraction confidence drives routing. High-confidence fields can flow straight to the EDC/CTMS; anything below threshold goes to a human-in-the-loop review queue. Critically, nothing is dropped silently—every page produces an immutable audit record whether it succeeds, gets queued, or fails.

Field confidence Action
≥ 90 Auto-accept; commit to downstream system
70–89 Accept value but flag for spot-check
< 70 or field missing Route to human review queue

The audit record is the regulatory backbone. Under 21 CFR Part 11, computer-generated records must carry secure, time-stamped trails of who/what/when. ALCOA+ adds that data be Attributable, Legible, Contemporaneous, Original, Accurate, Complete, Consistent, Enduring, and Available. We satisfy this by hashing the source PDF, the rendered image, and the extracted payload, and appending an append-only JSON Lines record.

"""Confidence routing and an append-only ALCOA+ audit record."""
from __future__ import annotations

import hashlib
import json
import os
from datetime import datetime, timezone
from pathlib import Path

from pydantic import BaseModel, Field as PydanticField

AUTO_ACCEPT = 90.0
REVIEW_FLOOR = 70.0


class PageExtraction(BaseModel):
    pdf_sha256: str
    image_sha256: str
    page_index: int
    used_ocr: bool
    fields: dict[str, dict[str, float | str]]
    needs_human_review: bool
    extracted_at_utc: str
    operator: str
    payload_sha256: str = PydanticField(default="")


def _sha256(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()


def route_and_audit(
    pdf_path: Path,
    page_index: int,
    image_bytes: bytes,
    fields: dict,
    used_ocr: bool,
    audit_log: Path,
) -> PageExtraction:
    """Decide review routing and append an immutable audit record."""
    required = {"protocol_number", "site_id", "investigator"}
    serialized = {
        name: {"value": f.value, "confidence": f.confidence} for name, f in fields.items()
    }

    missing_required = required - serialized.keys()
    low_confidence = any(
        entry["confidence"] < REVIEW_FLOOR for entry in serialized.values()
    )
    needs_review = bool(missing_required) or low_confidence

    record = PageExtraction(
        pdf_sha256=_sha256(pdf_path.read_bytes()),
        image_sha256=_sha256(image_bytes),
        page_index=page_index,
        used_ocr=used_ocr,
        fields=serialized,
        needs_human_review=needs_review,
        extracted_at_utc=datetime.now(timezone.utc).isoformat(),
        # Attributable: read the operator/service identity from config, never hardcode.
        operator=os.environ.get("OCR_OPERATOR_ID", "unknown"),
    )
    record.payload_sha256 = _sha256(
        record.model_dump_json(exclude={"payload_sha256"}).encode("utf-8")
    )

    # Append-only: open in "a" mode and never rewrite prior lines (Enduring + Original).
    with audit_log.open("a", encoding="utf-8") as handle:
        handle.write(record.model_dump_json() + "\n")
    return record

Tying the stages together

"""Top-level driver: detect, rasterize, OCR, extract, route, audit."""
from pathlib import Path


def process_pdf(pdf_path: Path, audit_log: Path) -> list[PageExtraction]:
    assert_tesseract_ready(("eng",))
    ocr_pages = set(scanned_page_indices(pdf_path))
    records: list[PageExtraction] = []

    from pypdf import PdfReader

    reader = PdfReader(str(pdf_path))
    for page_index in range(len(reader.pages)):
        if page_index in ocr_pages:
            frame = rasterize_page(pdf_path, page_index)
            image = preprocess(frame)
            words = ocr_words(image, psm=6)
            image_bytes = image.tobytes()
            used_ocr = True
        else:
            # Text-layer page: reuse the parsing cluster's techniques instead.
            text = reader.pages[page_index].extract_text() or ""
            words = [Word(token, 100.0, 0, 0, 0, 0) for token in text.split()]
            image_bytes = text.encode("utf-8")
            used_ocr = False

        fields = extract_fields(words)
        records.append(
            route_and_audit(pdf_path, page_index, image_bytes, fields, used_ocr, audit_log)
        )
    return records

Validation, security, and PHI handling

  • Validate, don’t trust. After extraction, validate values against expected formats (a protocol number pattern, an ISO date) and against known reference lists (the site IDs assigned to this study). Reject or queue mismatches. The categorization patterns in Categorizing validation errors in regulatory document pipelines help you bucket and triage these failures.
  • Minimize PHI exposure. Scanned consent and source documents may contain subject PHI. Extract only the protocol-level metadata you need; do not persist full OCR text of pages that contain subject identifiers. If you must store raster crops for review, restrict them to the metadata regions.
  • No secrets in code. The operator identity, audit-log location, and any downstream credentials come from environment or config, never literals.
  • Treat the audit log as write-once. Store it on append-only or WORM-backed storage and verify integrity by re-hashing archived payloads on a schedule.

Tuning checklist

  • Confirmed pypdf text detection threshold against real scanned and born-digital samples
  • Rendering at 300 DPI minimum (400 for small-font fax artifacts)
  • Validated deskew on skewed scans and confirmed it is a no-op on clean pages
  • --oem 1 and the right --psm per region (6 for blocks, 7 for cropped fields)
  • Anchored regex tuned to sponsor templates, with O/0 and 1/l normalization
  • Confidence thresholds calibrated against a labeled validation set
  • Audit records written append-only with PDF, image, and payload hashes
  • PHI minimized at extraction; no subject identifiers persisted unnecessarily

FAQ

Why not just use --psm 3, Tesseract’s default?

--psm 3 (fully automatic page segmentation, no orientation detection) assumes a generic document and often reads across the columns and boxed fields common on clinical forms. For full-page forms, --psm 6 treats the page as a single uniform block and is more predictable; for cropped single fields, --psm 7 is better. Test both on your templates.

Should I ever fall back to the legacy engine with --oem 0?

No. The legacy pattern recognizer is less accurate than the LSTM engine and is not even present in the LSTM-only traineddata shipped with Tesseract 4 and later, so requesting it can fail outright. Standardize on --oem 1 and fail fast if a required language pack is missing.

How do I OCR non-English site documents?

Install the relevant language packs (for example deu, fra, jpn) and pass them to Tesseract via the lang argument, optionally combined like eng+deu. Verify availability at startup with pytesseract.get_languages() and refuse to run if a required pack is absent—silent language fallback degrades accuracy invisibly.

When should I prefer direct parsing over OCR?

Whenever a page has a real text layer. Direct extraction is faster, exact, and easier to validate. Use the detection step in this pipeline to route text-layer pages to the methods in PDF/DOCX Parsing for Clinical Docs, and reserve OCR for genuinely scanned pages.