Home
Automated Document Ingestion & Validation Workflows
OCR & Metadata Extraction Pipelines
Extracting Metadata from Scanned Clinical Trial PDFs Using Tesseract

Extracting Metadata from Scanned Clinical Trial PDFs Using Tesseract

Scanned clinical trial PDFs—site activation packets, signed 1572s, faxed lab certificates—often have no text layer, so a regex over the raw PDF returns nothing. This guide builds a production Python pipeline that detects missing text layers, rasterizes pages at 300+ DPI, preprocesses with OpenCV, runs Tesseract LSTM OCR with word-level confidence, and extracts protocol number, site ID, dates, and investigator name under an ALCOA+ audit trail.

This is a deep how-to under OCR & Metadata Extraction Pipelines, part of Automated Document Ingestion & Validation Workflows. For documents that already carry a digital text layer, the techniques in PDF/DOCX Parsing for Clinical Docs are faster and more accurate—use OCR only when no text layer exists. Every field this pipeline extracts must ultimately survive schema validation and land in the append-only audit log that 21 CFR Part 11 requires.

Why naive approaches fail

Three shortcuts look reasonable and quietly corrupt a regulated data set:

Running a regex over the raw PDF bytes. An image-only page has no embedded text stream, so extract_text() returns an empty string and every pattern misses. The scan looks like a document to a human but is a picture to the parser. You must detect the missing text layer and rasterize before any pattern can match.
Rasterizing at screen resolution. Tesseract’s LSTM engine needs an x-height of roughly 20–30 pixels. Below about 300 DPI—exactly what you get from a naive get_pixmap() at the PDF’s native 72 DPI—character shapes collapse, and a faxed 1572 is read as l572 or 572. Sub-300 DPI is the single most common cause of silent OCR garbage on clinical scans.
Keyword matching instead of anchored capture. Asking “does the word PROTOCOL appear?” tells you a label exists but not which value belongs to it. On a form that names the protocol in a header, a footer, and a signature block, keyword logic cannot pick the right token. Regulated extraction needs a label anchor bound to a value pattern, plus a per-field confidence so uncertain reads are queued rather than trusted.

OCR is therefore a last resort, not a default: slower, lossier, and harder to validate than direct text extraction. Reach for it only when a page genuinely lacks a recoverable text layer. The cheapest reliable signal is to try pypdf first and rasterize only the pages that come back empty.

Architecture overview

The pipeline is a linear flow with two branches: pages that already have text skip OCR entirely, and pages whose extracted fields fall below a confidence threshold divert to human review before their data is trusted. Every page—succeed, queue, or fail—appends one immutable audit record.

The stages below mirror this flow: detect, rasterize, preprocess, OCR, score, extract, route, audit.

Setup and configuration

Install the maintained libraries and confirm the Tesseract binary and language packs are present on the host. pytesseract is only a thin wrapper—the engine itself is a system dependency.

# System engine (Debian/Ubuntu). Tesseract 4+ ships the LSTM engine.
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng

# Python side. pypdf is the maintained successor to PyPDF2.
pip install "pypdf>=4.0" "PyMuPDF>=1.24" "opencv-python-headless>=4.9" \
            "pytesseract>=0.3.10" "pydantic>=2.6"

Deprecated library warning. Do not gate OCR with PyPDF2—it is unmaintained and has known extraction and security regressions. Use pypdf, its maintained successor, throughout. Treat any legacy import PyPDF2 in the codebase as technical debt to retire.

Configuration comes from the environment, never from literals in the source. The pipeline reads three values:

Variable	Purpose	Example
`OCR_OPERATOR_ID`	Attributable identity written into every audit record	`svc-ingest-prod`
`OCR_AUDIT_LOG`	Path to the append-only JSON Lines audit trail	`/var/audit/ocr.jsonl`
`OCR_MIN_DPI`	Floor for rasterization; refuse to run below it	`300`

Initialise structured logging once at startup so every stage emits to the same named logger:

"""Logging and configuration bootstrap for the OCR pipeline."""
from __future__ import annotations

import logging
import os
from pathlib import Path

logging.basicConfig(
    level=logging.INFO,
    format='{"ts":"%(asctime)s","level":"%(levelname)s","logger":"%(name)s","msg":"%(message)s"}',
)
LOGGER = logging.getLogger("clinical_ocr")

# Fail fast on missing configuration rather than defaulting silently.
OPERATOR_ID = os.environ.get("OCR_OPERATOR_ID", "unknown")
AUDIT_LOG = Path(os.environ["OCR_AUDIT_LOG"])  # required: raise KeyError if unset
MIN_DPI = int(os.environ.get("OCR_MIN_DPI", "300"))

Full working implementation

Step 1: Detect pages with no text layer

Rasterizing every page wastes compute and degrades accuracy on pages that already have clean text. Use pypdf to inspect each page; treat a page as scanned when extract_text() returns negligible content.

"""Detect which PDF pages lack a usable text layer."""
from __future__ import annotations

import logging
from pathlib import Path

from pypdf import PdfReader

# A genuine text page usually yields well over this many non-space characters.
MIN_TEXT_CHARS = 20


def scanned_page_indices(pdf_path: Path) -> list[int]:
    """Return zero-based indices of pages that need OCR.

    A page is considered "scanned" when pypdf extracts fewer than
    MIN_TEXT_CHARS non-whitespace characters from it.
    """
    if not pdf_path.is_file():
        raise FileNotFoundError(f"PDF not found: {pdf_path}")

    reader = PdfReader(str(pdf_path))
    needs_ocr: list[int] = []
    for index, page in enumerate(reader.pages):
        try:
            text = page.extract_text() or ""
        except (ValueError, KeyError) as exc:
            # Malformed page object: extract_text can raise on broken
            # content streams. Treat as scanned and let OCR try.
            text = ""
            _log_extract_error(index, exc)
        if len("".join(text.split())) < MIN_TEXT_CHARS:
            needs_ocr.append(index)
    return needs_ocr


def _log_extract_error(index: int, exc: Exception) -> None:
    logging.getLogger("clinical_ocr").warning(
        "pypdf text extraction failed on page %d: %s", index, exc
    )

Step 2: Rasterize at 300 DPI or higher

Tesseract’s accuracy collapses below roughly 300 DPI; the LSTM engine expects an x-height of at least 20–30 pixels. Render scanned pages at 300 DPI (400 for small-font fax artifacts). Two solid options:

PyMuPDF (fitz) renders without external dependencies and is fast.
pdf2image wraps Poppler—reliable, but requires the poppler-utils system package.

This pipeline uses PyMuPDF to avoid a system dependency and to render single pages on demand (lower memory than converting a whole document at once).

"""Rasterize a single PDF page to an OpenCV BGR image at a target DPI."""
from __future__ import annotations

from pathlib import Path

import cv2
import fitz  # PyMuPDF
import numpy as np

DEFAULT_DPI = 300


def rasterize_page(pdf_path: Path, page_index: int, dpi: int = DEFAULT_DPI) -> np.ndarray:
    """Render one page to a BGR ndarray suitable for OpenCV and Tesseract."""
    if dpi < 300:
        raise ValueError("Use at least 300 DPI for reliable OCR accuracy")

    with fitz.open(str(pdf_path)) as doc:
        if not 0 <= page_index < doc.page_count:
            raise IndexError(f"Page {page_index} out of range for {pdf_path}")
        page = doc.load_page(page_index)
        # 72 is the PDF default DPI; scale the matrix to reach the target.
        matrix = fitz.Matrix(dpi / 72.0, dpi / 72.0)
        pix = page.get_pixmap(matrix=matrix, colorspace=fitz.csRGB, alpha=False)

    rgb = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, 3)
    return cv2.cvtColor(rgb, cv2.COLOR_RGB2BGR)

Step 3: Preprocess with OpenCV

Raw scans carry skew, uneven lighting, speckle, and fax compression noise. A focused preprocessing chain—grayscale, deskew, denoise, adaptive threshold—repairs these before OCR. Order matters: deskew on the grayscale image (rotation interpolates cleanly on continuous tones), then denoise, then binarize.

"""OpenCV preprocessing for degraded clinical scans."""
from __future__ import annotations

import cv2
import numpy as np


def _estimate_skew_angle(gray: np.ndarray) -> float:
    """Estimate page skew in degrees using the minimum-area rectangle.

    Returns a small angle in roughly [-45, 45]; 0.0 when no foreground
    pixels are found.
    """
    # Foreground = dark text on light paper, so invert before thresholding.
    inverted = cv2.bitwise_not(gray)
    _, binary = cv2.threshold(inverted, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    coords = cv2.findNonZero(binary)
    if coords is None:
        return 0.0

    angle = cv2.minAreaRect(coords)[-1]
    # minAreaRect returns angle in (-90, 0]; normalize to a small rotation.
    if angle < -45:
        angle += 90
    return float(angle)


def _deskew(gray: np.ndarray, max_correction_deg: float = 15.0) -> np.ndarray:
    """Rotate the image to remove skew, capping absurd corrections."""
    angle = _estimate_skew_angle(gray)
    if abs(angle) < 0.1 or abs(angle) > max_correction_deg:
        # Tiny skew is not worth interpolating; huge angles usually mean
        # a misdetection (e.g. a landscape form) and should be left alone.
        return gray

    height, width = gray.shape
    center = (width / 2.0, height / 2.0)
    rotation = cv2.getRotationMatrix2D(center, angle, 1.0)
    return cv2.warpAffine(
        gray,
        rotation,
        (width, height),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE,
    )


def preprocess(frame: np.ndarray) -> np.ndarray:
    """Grayscale, deskew, denoise, and adaptively threshold a scan."""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    deskewed = _deskew(gray)
    # Edge-preserving denoise removes fax speckle without smearing strokes.
    denoised = cv2.fastNlMeansDenoising(deskewed, h=10, templateWindowSize=7, searchWindowSize=21)
    # Adaptive (local) threshold handles uneven scanner lighting better than
    # a single global Otsu cut.
    binary = cv2.adaptiveThreshold(
        denoised,
        maxValue=255,
        adaptiveMethod=cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        thresholdType=cv2.THRESH_BINARY,
        blockSize=31,
        C=10,
    )
    return binary

A few practical notes:

Do not over-clean. Aggressive morphology can erode thin strokes and merge digits (1572 becomes 572). Validate on a held-out set of real scans.
blockSize must be odd and large enough to span a full glyph at your DPI; 31 works well at 300 DPI.
Skip deskew on born-clean renders (text-layer pages you chose to rasterize anyway)—the skew estimate adds noise with no benefit.

Step 4: OCR with word-level confidence

Use image_to_data rather than image_to_string. It returns per-word bounding boxes and confidence scores, which are essential for routing low-confidence fields to human review and for building the audit trail. Run the LSTM engine with --oem 1. For full-page forms, --psm 6 (assume a single uniform block of text) is a robust default; for narrow cropped fields, --psm 7 (single line) is better.

Tesseract reports word confidence on a 0–100 scale and emits -1 for layout boxes with no recognized text. Filter those out before aggregating.

"""Run Tesseract and return clean, confidence-bearing words."""
from __future__ import annotations

import shutil
from dataclasses import dataclass

import numpy as np
import pytesseract


@dataclass(frozen=True)
class Word:
    text: str
    confidence: float  # 0-100
    left: int
    top: int
    width: int
    height: int


def assert_tesseract_ready(required_langs: tuple[str, ...] = ("eng",)) -> None:
    """Fail fast if Tesseract or a required language pack is missing."""
    if shutil.which("tesseract") is None:
        raise RuntimeError("Tesseract binary not found on PATH")
    installed = set(pytesseract.get_languages(config=""))
    missing = [lang for lang in required_langs if lang not in installed]
    if missing:
        raise RuntimeError(f"Missing Tesseract language packs: {missing}")


def ocr_words(image: np.ndarray, psm: int = 6, lang: str = "eng") -> list[Word]:
    """OCR an image and return words with valid (non -1) confidence."""
    config = f"--oem 1 --psm {psm}"
    data = pytesseract.image_to_data(
        image, lang=lang, config=config, output_type=pytesseract.Output.DICT
    )

    words: list[Word] = []
    for text, conf, left, top, width, height in zip(
        data["text"], data["conf"], data["left"], data["top"], data["width"], data["height"]
    ):
        cleaned = text.strip()
        confidence = float(conf)
        if cleaned and confidence >= 0:
            words.append(Word(cleaned, confidence, int(left), int(top), int(width), int(height)))
    return words

Step 5: Extract metadata with anchored regex

Reconstruct the OCR text in reading order, then use anchored regular expressions: a label anchor (Protocol No.) immediately followed by a value pattern. This captures the value, not just the presence of a label, and lets you attach a per-field confidence by averaging the confidences of the words that produced the match.

The patterns below are illustrative; tune them to your sponsor’s document templates. Note that OCR routinely confuses O/0, 1/l/I, and S/5, so value patterns should be permissive and you should normalize before validating.

"""Anchored, confidence-aware metadata extraction from OCR words."""
from __future__ import annotations

import re
from dataclasses import dataclass

# Each pattern has one capturing group for the value. Anchors are the field
# labels as they appear on the form; values are kept permissive on purpose.
FIELD_PATTERNS: dict[str, re.Pattern[str]] = {
    "protocol_number": re.compile(
        r"protocol\s*(?:no\.?|number|#)?\s*[:\-]?\s*([A-Z0-9][A-Z0-9\-]{3,20})",
        re.IGNORECASE,
    ),
    "site_id": re.compile(
        r"site\s*(?:id|no\.?|number|#)\s*[:\-]?\s*([A-Z0-9\-]{2,12})",
        re.IGNORECASE,
    ),
    "investigator": re.compile(
        r"(?:principal\s+investigator|investigator\s+name|pi)\s*[:\-]?\s*"
        r"([A-Z][A-Za-z.\-]+(?:\s+[A-Z][A-Za-z.\-]+){1,3})",
        re.IGNORECASE,
    ),
    # ISO, US, and dd-Mon-yyyy date styles seen on clinical forms.
    "date": re.compile(
        r"(\d{4}-\d{2}-\d{2}|\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
        r"|\d{1,2}\s+[A-Za-z]{3,9}\s+\d{4})"
    ),
}


@dataclass(frozen=True)
class Field:
    name: str
    value: str
    confidence: float  # mean confidence of the words spanning the match


def _spanning_confidence(words, start: int, end: int, joined: str) -> float:
    """Mean confidence of words whose text falls inside [start, end)."""
    cursor = 0
    confidences: list[float] = []
    for word in words:
        token_start = joined.find(word.text, cursor)
        if token_start == -1:
            continue
        token_end = token_start + len(word.text)
        cursor = token_end
        if token_start < end and token_end > start:
            confidences.append(word.confidence)
    return round(sum(confidences) / len(confidences), 2) if confidences else 0.0


def extract_fields(words) -> dict[str, Field]:
    """Apply anchored patterns to OCR words and score each match."""
    joined = " ".join(word.text for word in words)
    fields: dict[str, Field] = {}
    for name, pattern in FIELD_PATTERNS.items():
        match = pattern.search(joined)
        if match is None:
            continue
        value = match.group(1).strip()
        confidence = _spanning_confidence(words, match.start(1), match.end(1), joined)
        fields[name] = Field(name=name, value=value, confidence=confidence)
    return fields

Step 6: Route by confidence and write the audit trail

Extraction confidence drives routing. High-confidence fields can flow straight to the EDC/CTMS; anything below threshold goes to a human-in-the-loop review queue. Critically, nothing is dropped silently—every page produces an immutable audit record whether it succeeds, gets queued, or fails.

Field confidence	Action
≥ 90	Auto-accept; commit to downstream system
70–89	Accept value but flag for spot-check
< 70 or field missing	Route to human review queue

The audit record is the regulatory backbone. Under 21 CFR Part 11, computer-generated records must carry secure, time-stamped trails of who/what/when. The ALCOA+ data-integrity chain adds that data be Attributable, Legible, Contemporaneous, Original, Accurate, Complete, Consistent, Enduring, and Available. We satisfy this by hashing the source PDF, the rendered image, and the extracted payload, and appending an append-only JSON Lines record.

"""Confidence routing and an append-only ALCOA+ audit record."""
from __future__ import annotations

import hashlib
import os
from datetime import datetime, timezone
from pathlib import Path

from pydantic import BaseModel, Field as PydanticField

AUTO_ACCEPT = 90.0
REVIEW_FLOOR = 70.0


class PageExtraction(BaseModel):
    pdf_sha256: str
    image_sha256: str
    page_index: int
    used_ocr: bool
    fields: dict[str, dict[str, float | str]]
    needs_human_review: bool
    extracted_at_utc: str
    operator: str
    payload_sha256: str = PydanticField(default="")


def _sha256(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()


def route_and_audit(
    pdf_path: Path,
    page_index: int,
    image_bytes: bytes,
    fields: dict,
    used_ocr: bool,
    audit_log: Path,
) -> PageExtraction:
    """Decide review routing and append an immutable audit record."""
    required = {"protocol_number", "site_id", "investigator"}
    serialized = {
        name: {"value": f.value, "confidence": f.confidence} for name, f in fields.items()
    }

    missing_required = required - serialized.keys()
    low_confidence = any(
        entry["confidence"] < REVIEW_FLOOR for entry in serialized.values()
    )
    needs_review = bool(missing_required) or low_confidence

    record = PageExtraction(
        pdf_sha256=_sha256(pdf_path.read_bytes()),
        image_sha256=_sha256(image_bytes),
        page_index=page_index,
        used_ocr=used_ocr,
        fields=serialized,
        needs_human_review=needs_review,
        extracted_at_utc=datetime.now(timezone.utc).isoformat(),
        # Attributable: read the operator/service identity from config, never hardcode.
        operator=os.environ.get("OCR_OPERATOR_ID", "unknown"),
    )
    record.payload_sha256 = _sha256(
        record.model_dump_json(exclude={"payload_sha256"}).encode("utf-8")
    )

    # Append-only: open in "a" mode and never rewrite prior lines (Enduring + Original).
    with audit_log.open("a", encoding="utf-8") as handle:
        handle.write(record.model_dump_json() + "\n")
    return record

Tying the stages together

"""Top-level driver: detect, rasterize, OCR, extract, route, audit."""
from pathlib import Path

from pypdf import PdfReader


def process_pdf(pdf_path: Path, audit_log: Path) -> list[PageExtraction]:
    assert_tesseract_ready(("eng",))
    ocr_pages = set(scanned_page_indices(pdf_path))
    records: list[PageExtraction] = []

    reader = PdfReader(str(pdf_path))
    for page_index in range(len(reader.pages)):
        if page_index in ocr_pages:
            frame = rasterize_page(pdf_path, page_index)
            image = preprocess(frame)
            words = ocr_words(image, psm=6)
            image_bytes = image.tobytes()
            used_ocr = True
        else:
            # Text-layer page: reuse the parsing techniques instead of OCR.
            text = reader.pages[page_index].extract_text() or ""
            words = [Word(token, 100.0, 0, 0, 0, 0) for token in text.split()]
            image_bytes = text.encode("utf-8")
            used_ocr = False

        fields = extract_fields(words)
        records.append(
            route_and_audit(pdf_path, page_index, image_bytes, fields, used_ocr, audit_log)
        )
    return records

Validation and edge-case handling

Validate, don’t trust. After extraction, validate values against expected formats (a protocol number pattern, an ISO date) and against known reference lists (the site IDs assigned to this study). Reject or queue mismatches. The categorization patterns in Categorizing validation errors in regulatory document pipelines help you bucket and triage these failures.
Normalize OCR confusables before validating. Map O→0 and l/I→1 inside numeric fields such as protocol and site IDs before you check them against reference lists, or valid values will be rejected as malformed.
Minimize PHI exposure. Scanned consent and source documents may contain subject PHI. Extract only the protocol-level metadata you need; do not persist full OCR text of pages that contain subject identifiers. If you must store raster crops for review, restrict them to the metadata regions.
No secrets in code. The operator identity, audit-log location, and any downstream credentials come from environment or config, never literals.
Treat the audit log as write-once. Store it on append-only or WORM-backed storage and verify integrity by re-hashing archived payloads on a schedule.

Testing and verification

Regression-test the pipeline against a small fixture set of real (de-identified) scans so a preprocessing tweak that helps one template cannot silently break another. Synthesize a scanned page on the fly by rendering text to an image—no binary fixtures needed—and assert that anchored extraction recovers the known values with usable confidence.

"""pytest checks for the scanned-PDF extraction pipeline."""
from __future__ import annotations

import numpy as np
import pytest
from PIL import Image, ImageDraw, ImageFont


def _render_form(lines: list[str], size=(1000, 600)) -> np.ndarray:
    """Render text lines to a white BGR image at a legible size."""
    img = Image.new("RGB", size, "white")
    draw = ImageDraw.Draw(img)
    font = ImageFont.load_default(size=28)
    for i, line in enumerate(lines):
        draw.text((40, 40 + i * 60), line, fill="black", font=font)
    rgb = np.asarray(img)
    return rgb[:, :, ::-1].copy()  # RGB -> BGR for the OpenCV path


def test_anchored_extraction_recovers_known_fields() -> None:
    frame = _render_form(
        [
            "Protocol No: ABC-12345",
            "Site ID: 0042",
            "Principal Investigator: Jane A. Smith",
            "Date: 2026-03-14",
        ]
    )
    words = ocr_words(preprocess(frame), psm=6)
    fields = extract_fields(words)

    assert fields["protocol_number"].value == "ABC-12345"
    assert fields["site_id"].value == "0042"
    assert "Smith" in fields["investigator"].value
    assert fields["date"].value == "2026-03-14"
    # Clean synthetic text should score well above the review floor.
    assert fields["protocol_number"].confidence >= REVIEW_FLOOR


def test_low_dpi_render_is_rejected() -> None:
    with pytest.raises(ValueError):
        rasterize_page(__import__("pathlib").Path("unused.pdf"), 0, dpi=150)


def test_negative_confidence_boxes_are_filtered() -> None:
    # -1 confidence layout boxes must never reach the extractor.
    blank = np.full((200, 600), 255, dtype=np.uint8)
    words = ocr_words(blank, psm=6)
    assert all(w.confidence >= 0 for w in words)


def test_audit_record_is_append_only(tmp_path) -> None:
    log = tmp_path / "audit.jsonl"
    log.write_text('{"prior":"record"}\n', encoding="utf-8")
    pdf = tmp_path / "src.pdf"
    pdf.write_bytes(b"%PDF-1.4 fixture")

    route_and_audit(pdf, 0, b"img", {}, used_ocr=True, audit_log=log)

    lines = log.read_text(encoding="utf-8").splitlines()
    assert lines[0] == '{"prior":"record"}'  # earlier lines untouched
    assert len(lines) == 2

Tuning checklist

Confirmed pypdf text detection threshold against real scanned and born-digital samples
Rendering at 300 DPI minimum (400 for small-font fax artifacts)
Validated deskew on skewed scans and confirmed it is a no-op on clean pages
--oem 1 and the right --psm per region (6 for blocks, 7 for cropped fields)
Anchored regex tuned to sponsor templates, with O/0 and 1/l normalization
Confidence thresholds calibrated against a labeled validation set
Audit records written append-only with PDF, image, and payload hashes
PHI minimized at extraction; no subject identifiers persisted unnecessarily

FAQ

Why not just use `--psm 3`, Tesseract’s default?

--psm 3 (fully automatic page segmentation, no orientation detection) assumes a generic document and often reads across the columns and boxed fields common on clinical forms. For full-page forms, --psm 6 treats the page as a single uniform block and is more predictable; for cropped single fields, --psm 7 is better. Test both on your templates.

Should I ever fall back to the legacy engine with `--oem 0`?

No. The legacy pattern recognizer is less accurate than the LSTM engine and is not even present in the LSTM-only traineddata shipped with Tesseract 4 and later, so requesting it can fail outright. Standardize on --oem 1 and fail fast if a required language pack is missing.

How do I OCR non-English site documents?

Install the relevant language packs (for example deu, fra, jpn) and pass them to Tesseract via the lang argument, optionally combined like eng+deu. Verify availability at startup with pytesseract.get_languages() and refuse to run if a required pack is absent—silent language fallback degrades accuracy invisibly.

When should I prefer direct parsing over OCR?

Whenever a page has a real text layer. Direct extraction is faster, exact, and easier to validate. Use the detection step in this pipeline to route text-layer pages to the methods in PDF/DOCX Parsing for Clinical Docs, and reserve OCR for genuinely scanned pages.

OCR & Metadata Extraction Pipelines — the pipeline architecture, engine choice, and library landscape this build implements.
PDF/DOCX Parsing for Clinical Docs — the faster, exact path for pages that already carry a digital text layer.
Categorizing validation errors in regulatory document pipelines — how to bucket and triage the field mismatches this extractor surfaces.
Handling Async Batch Processing for Multi-Site Document Ingestion — running this OCR path as one bounded stage across hundreds of site packets.
Automated Document Ingestion & Validation Workflows — how OCR fits the wider ingestion and submission lifecycle.

Up one level: this is a deep how-to under OCR & Metadata Extraction Pipelines.

Extracting Metadata from Scanned Clinical Trial PDFs Using Tesseract

Why naive approaches fail #

Architecture overview #

Setup and configuration #

Full working implementation #

Step 1: Detect pages with no text layer #

Step 2: Rasterize at 300 DPI or higher #

Step 3: Preprocess with OpenCV #

Step 4: OCR with word-level confidence #

Step 5: Extract metadata with anchored regex #

Step 6: Route by confidence and write the audit trail #

Tying the stages together #

Validation and edge-case handling #

Testing and verification #

Tuning checklist #

FAQ #

Why not just use --psm 3, Tesseract’s default? #

Should I ever fall back to the legacy engine with --oem 0? #

How do I OCR non-English site documents? #

When should I prefer direct parsing over OCR? #

Related #