PDF Evidence Table Extraction Workflow Design

Project Goal

Develop a repeatable workflow to extract structured evidence tables from ARS (American Radium Society) and similar clinical guideline PDFs, outputting the data to Excel format.


Initial Analysis (December 20, 2025)

Source Files Analyzed

File Size Pages Description
2021_ARS_ReRT_Full_2021_09_1.pdf 1.3 MB 64 Full ARS guideline on head/neck cancer re-irradiation
ReRT_Evidence_Table.pdf 121 KB 7 Extracted "Supplemental Table 1: Evidence Table"

Evidence Table Structure

The evidence table contains clinical study data with the following columns:

Column Data Type Notes
Reference Text Author, Year + superscript reference number
Study Type Text RCT, MA, SR, SAT, RMI, RSI
Topic/Objective Text Multi-line possible
Disease Text Cancer type/site
Arm(s)/Cohort(s) Text Treatment descriptions
N Text/Number Sample size, sometimes "X studies, Y patients"
Median FU (Mo.) Text Follow-up duration
Results Text Outcomes - longest field
Study Quality Number 1-4 scale

Extraction Challenges Identified


Output Format Decision

Target: Excel (.xlsx)

Rationale:


Approach Options Considered

Option 1: Claude Chat (Current Session)

Option 2: Claude Code (Included with Max Plan)

Option 3: Claude API (Pay-as-you-go)


API Cost Analysis

Pricing (per million tokens)

Model Input Output Best For
Haiku 3 $0.25 $1.25 Simple tasks, high volume
Sonnet 4.5 $3.00 $15.00 Balanced, good for parsing
Opus 4.5 $5.00 $25.00 Most capable

Estimated Cost Per Extraction

Based on the sample ARS document (~146,000 characters ≈ 37,000 tokens input):

Model Input Cost Output Cost Total Per Document
Haiku 3 $0.01 $0.006 ~$0.02
Sonnet 4.5 $0.11 $0.08 ~$0.19
Opus 4.5 $0.19 $0.13 ~$0.32

Volume Projections

Volume Haiku Sonnet Opus
10 documents $0.20 $1.90 $3.20
100 documents $2.00 $19.00 $32.00
1,000 documents $20.00 $190.00 $320.00

Recommendation: Sonnet 4.5 (~$0.19/document) offers the best balance of capability and cost for complex medical table parsing.


Important Note: Max Plan vs API

The Claude Max subscription ($100-200/month) does not include API access. The API requires a separate Console account with pay-per-token billing. These are independent systems:


Next Steps

Current Phase: Sample Collection

Before building the extraction workflow, collect 50+ sample PDF files to:

  1. Categorize formats - Group similar table structures together
  2. Identify common patterns - What's consistent across all files
  3. Document variations - What differs and how to handle each case
  4. Estimate complexity - Some may need simple extraction, others may need AI parsing
  5. Refine cost estimates - Based on actual file sizes and complexity
  6. Design flexible workflow - That handles variations gracefully

Questions to Answer During Collection


Proposed Workflow (Pending Sample Analysis)

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Source PDF     │────▶│  Text Extraction │────▶│  AI Parsing     │
│  (Evidence      │     │  (Python/C#)     │     │  (Claude API)   │
│   Table)        │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └────────┬────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Validation     │◀────│  Excel Output    │◀────│  Structured     │
│  & Review       │     │  (.xlsx)         │     │  JSON Data      │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Decision Point

Action Required: Collect 50+ sample PDFs, then resume design work.

When samples are ready, Claude can:

  1. Analyze format variations across all samples
  2. Recommend extraction strategy
  3. Build proof-of-concept
  4. Develop production workflow

Document Created: December 20, 2025
Last Updated: December 20, 2025