RadiumEx - PDF-Extraction-Workflow-Design

# PDF Evidence Table Extraction Workflow Design

## Project Goal

Develop a repeatable workflow to extract structured evidence tables from ARS (American Radium Society) and similar clinical guideline PDFs, outputting the data to Excel format.

---

## Initial Analysis (December 20, 2025)

### Source Files Analyzed

| File | Size | Pages | Description |
|------|------|-------|-------------|
| `2021_ARS_ReRT_Full_2021_09_1.pdf` | 1.3 MB | 64 | Full ARS guideline on head/neck cancer re-irradiation |
| `ReRT_Evidence_Table.pdf` | 121 KB | 7 | Extracted "Supplemental Table 1: Evidence Table" |

### Evidence Table Structure

The evidence table contains clinical study data with the following columns:

| Column | Data Type | Notes |
|--------|-----------|-------|
| Reference | Text | Author, Year + superscript reference number |
| Study Type | Text | RCT, MA, SR, SAT, RMI, RSI |
| Topic/Objective | Text | Multi-line possible |
| Disease | Text | Cancer type/site |
| Arm(s)/Cohort(s) | Text | Treatment descriptions |
| N | Text/Number | Sample size, sometimes "X studies, Y patients" |
| Median FU (Mo.) | Text | Follow-up duration |
| Results | Text | Outcomes - longest field |
| Study Quality | Number | 1-4 scale |

### Extraction Challenges Identified

- Multi-line cells that wrap across rows
- Superscript reference numbers embedded in text
- Text wrapping across page boundaries
- Variable PDF text quality depending on generator
- Column alignment lost in raw text extraction

---

## Output Format Decision

**Target: Excel (.xlsx)**

Rationale:
- Easy to review and validate extracted data
- Supports filtering and sorting for analysis
- Can be further processed or imported to other systems

---

## Approach Options Considered

### Option 1: Claude Chat (Current Session)
- Upload PDFs to conversation
- Claude extracts and parses data
- Generate Excel file for download
- **Limitation:** Manual process each time

### Option 2: Claude Code (Included with Max Plan)
- Command-line tool included with Pro/Max subscription
- Can interact with local files
- Could script a workflow for PDF → Excel
- Uses same usage allocation as chat

### Option 3: Claude API (Pay-as-you-go)
- Separate account at console.anthropic.com
- Pay per token used
- Build fully automated C# tool
- Additional cost beyond Max subscription

---

## API Cost Analysis

### Pricing (per million tokens)

| Model | Input | Output | Best For |
|-------|-------|--------|----------|
| Haiku 3 | $0.25 | $1.25 | Simple tasks, high volume |
| Sonnet 4.5 | $3.00 | $15.00 | Balanced, good for parsing |
| Opus 4.5 | $5.00 | $25.00 | Most capable |

### Estimated Cost Per Extraction

Based on the sample ARS document (~146,000 characters ≈ 37,000 tokens input):

| Model | Input Cost | Output Cost | Total Per Document |
|-------|------------|-------------|-------------------|
| Haiku 3 | $0.01 | $0.006 | ~$0.02 |
| Sonnet 4.5 | $0.11 | $0.08 | ~$0.19 |
| Opus 4.5 | $0.19 | $0.13 | ~$0.32 |

### Volume Projections

| Volume | Haiku | Sonnet | Opus |
|--------|-------|--------|------|
| 10 documents | $0.20 | $1.90 | $3.20 |
| 100 documents | $2.00 | $19.00 | $32.00 |
| 1,000 documents | $20.00 | $190.00 | $320.00 |

**Recommendation:** Sonnet 4.5 (~$0.19/document) offers the best balance of capability and cost for complex medical table parsing.

---

## Important Note: Max Plan vs API

The Claude Max subscription ($100-200/month) does **not** include API access. The API requires a separate Console account with pay-per-token billing. These are independent systems:

- **Max Plan:** Fixed monthly fee, chat-based access via claude.ai and Claude Desktop
- **API:** Pay-per-token, requires separate account at console.anthropic.com

---

## Next Steps

### Current Phase: Sample Collection

Before building the extraction workflow, collect 50+ sample PDF files to:

1. **Categorize formats** - Group similar table structures together
2. **Identify common patterns** - What's consistent across all files
3. **Document variations** - What differs and how to handle each case
4. **Estimate complexity** - Some may need simple extraction, others may need AI parsing
5. **Refine cost estimates** - Based on actual file sizes and complexity
6. **Design flexible workflow** - That handles variations gracefully

### Questions to Answer During Collection

- Are these all from ARS, or multiple organizations?
- Are they all "Evidence Tables" or other table types too?
- Date range of publications? (Older PDFs may have different formatting)
- Are they all text-based, or some image-based (requiring OCR)?

---

## Proposed Workflow (Pending Sample Analysis)

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Source PDF     │────▶│  Text Extraction │────▶│  AI Parsing     │
│  (Evidence      │     │  (Python/C#)     │     │  (Claude API)   │
│   Table)        │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └────────┬────────┘
                                                          │
                                                          ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Validation     │◀────│  Excel Output    │◀────│  Structured     │
│  & Review       │     │  (.xlsx)         │     │  JSON Data      │
└─────────────────┘     └──────────────────┘     └─────────────────┘
```

---

## Decision Point

**Action Required:** Collect 50+ sample PDFs, then resume design work.

When samples are ready, Claude can:
1. Analyze format variations across all samples
2. Recommend extraction strategy
3. Build proof-of-concept
4. Develop production workflow

---

**Document Created:** December 20, 2025  
**Last Updated:** December 20, 2025

File	Size	Pages	Description
`2021_ARS_ReRT_Full_2021_09_1.pdf`	1.3 MB	64	Full ARS guideline on head/neck cancer re-irradiation
`ReRT_Evidence_Table.pdf`	121 KB	7	Extracted "Supplemental Table 1: Evidence Table"

Column	Data Type	Notes
Reference	Text	Author, Year + superscript reference number
Study Type	Text	RCT, MA, SR, SAT, RMI, RSI
Topic/Objective	Text	Multi-line possible
Disease	Text	Cancer type/site
Arm(s)/Cohort(s)	Text	Treatment descriptions
N	Text/Number	Sample size, sometimes "X studies, Y patients"
Median FU (Mo.)	Text	Follow-up duration
Results	Text	Outcomes - longest field
Study Quality	Number	1-4 scale

Model	Input	Output	Best For
Haiku 3	$0.25	$1.25	Simple tasks, high volume
Sonnet 4.5	$3.00	$15.00	Balanced, good for parsing
Opus 4.5	$5.00	$25.00	Most capable

Model	Input Cost	Output Cost	Total Per Document
Haiku 3	$0.01	$0.006	~$0.02
Sonnet 4.5	$0.11	$0.08	~$0.19
Opus 4.5	$0.19	$0.13	~$0.32

Volume	Haiku	Sonnet	Opus
10 documents	$0.20	$1.90	$3.20
100 documents	$2.00	$19.00	$32.00
1,000 documents	$20.00	$190.00	$320.00

PDF Evidence Table Extraction Workflow Design 📋

Project Goal 📋

Initial Analysis (December 20, 2025) 📋

Source Files Analyzed 📋

Evidence Table Structure 📋

Extraction Challenges Identified 📋

Output Format Decision 📋

Approach Options Considered 📋

Option 1: Claude Chat (Current Session) 📋

Option 2: Claude Code (Included with Max Plan) 📋

Option 3: Claude API (Pay-as-you-go) 📋

API Cost Analysis 📋

Pricing (per million tokens) 📋

Estimated Cost Per Extraction 📋

Volume Projections 📋

Important Note: Max Plan vs API 📋

Next Steps 📋

Current Phase: Sample Collection 📋

Questions to Answer During Collection 📋

Proposed Workflow (Pending Sample Analysis) 📋

Decision Point 📋

PDF Evidence Table Extraction Workflow Design

Project Goal

Initial Analysis (December 20, 2025)

Source Files Analyzed

Evidence Table Structure

Extraction Challenges Identified

Output Format Decision

Approach Options Considered

Option 1: Claude Chat (Current Session)

Option 2: Claude Code (Included with Max Plan)

Option 3: Claude API (Pay-as-you-go)

API Cost Analysis

Pricing (per million tokens)

Estimated Cost Per Extraction

Volume Projections

Important Note: Max Plan vs API

Next Steps

Current Phase: Sample Collection

Questions to Answer During Collection

Proposed Workflow (Pending Sample Analysis)

Decision Point