Extract Structured Data from Scientific PDFs

A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.

Overview

This skill provides an end-to-end workflow for:

Organizing PDF literature and metadata from various sources
Filtering relevant papers based on abstract content (optional)
Extracting structured data from full PDFs using Claude's vision capabilities
Repairing and validating JSON outputs
Enriching data with external scientific databases
Exporting to multiple analysis formats (Python, R, Excel, CSV, SQLite)

Quick Start

1. Installation

Create a conda environment:

conda env create -f environment.yml
conda activate pdf_extraction

Or install with pip:

pip install -r requirements.txt

2. Setup API Keys

Set your Anthropic API key:

export ANTHROPIC_API_KEY='your-api-key-here'

For geographic validation (optional):

export GEONAMES_USERNAME='your-geonames-username'

3. Run the Skill

The easiest way is to use the skill through Claude Code:

claude-code

Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.

Documentation

The skill includes comprehensive reference documentation:

references/setup_guide.md - Installation and configuration
references/workflow_guide.md - Complete step-by-step workflow with examples
references/validation_guide.md - Validation methodology and metrics interpretation
references/api_reference.md - External API integration details

Manual Workflow

You can also run the scripts manually:

Step 1: Organize Metadata

python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source path/to/library.bib \
  --pdf-dir path/to/pdfs \
  --organize-pdfs \
  --output metadata.json

Step 2: Filter Papers (Optional)

First, customize the filtering prompt in scripts/02_filter_abstracts.py for your use case.

Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)

python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

Option B: Local Model via Ollama (FREE)

# One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull model: ollama pull llama3.1:8b
# 3. Start server: ollama serve

python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend ollama \
  --ollama-model llama3.1:8b \
  --output filtered_papers.json

Recommended Ollama models:

llama3.1:8b - Good balance (8GB RAM)
mistral:7b - Fast, good for simple filtering
qwen2.5:7b - Good multilingual support
llama3.1:70b - Better accuracy (64GB RAM)

Step 3: Extract Data from PDFs

First, create your extraction schema by copying and customizing assets/schema_template.json.

python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

Step 4: Repair JSON

python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

Step 5: Validate with APIs

First, create your API configuration by copying and customizing assets/api_config_template.json.

python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

Step 6: Export

# For Python/pandas
python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --flatten \
  --output results

# For R
python scripts/06_export_database.py \
  --input validated_data.json \
  --format r \
  --flatten \
  --output results

# For CSV
python scripts/06_export_database.py \
  --input validated_data.json \
  --format csv \
  --flatten \
  --output results.csv

Validation & Quality Assurance (Optional but Recommended)

Validate extraction quality using precision and recall metrics:

Step 7: Prepare Validation Set

python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

Sampling strategies:

random - Random sample
stratified - Sample by extraction characteristics
diverse - Maximize diversity

Step 8: Manual Annotation

Open validation_set.json
For each sampled paper:
- Read the PDF
- Fill in ground_truth field with correct extraction
- Add annotator name and annotation_date
- Use notes for ambiguous cases
Save the file

Step 9: Calculate Metrics

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt

This produces:

Precision: % of extracted items that are correct
Recall: % of true items that were extracted
F1 Score: Harmonic mean of precision and recall
Per-field metrics: Accuracy by field type

Use these metrics to:

Identify weak points in extraction prompts
Compare models (Haiku vs Sonnet vs Ollama)
Iterate and improve schema
Report quality in publications

Customization

Creating Your Extraction Schema

Copy assets/schema_template.json to my_schema.json
Customize the following sections:
- objective: What you're extracting
- system_context: Your scientific domain
- instructions: Step-by-step guidance for Claude
- output_schema: JSON schema defining your data structure
- output_example: Example of desired output

See assets/example_flower_visitors_schema.json for a real-world example.

Configuring API Validation

Copy assets/api_config_template.json to my_api_config.json
Map your schema fields to appropriate validation APIs
See available APIs in scripts/05_validate_with_apis.py and references/api_reference.md

See assets/example_api_config_ecology.json for an ecology example.

Cost Estimation

PDF processing costs approximately 1,500-3,000 tokens per page:

10-page paper: ~20,000-30,000 tokens
100 papers: ~2-3M tokens
With Sonnet 4.5: ~$6-9 for 100 papers

Tips to reduce costs:

Use abstract filtering (Step 2) to reduce full PDF processing
Enable prompt caching with --use-caching
Use batch processing (--method batches)
Consider using Haiku for simpler extractions

Supported Data Sources

Bibliography Formats

BibTeX (Zotero, JabRef, etc.)
RIS (Mendeley, EndNote, etc.)
Directory of PDFs
List of DOIs

Output Formats

Python (pandas DataFrame pickle)
R (RDS file)
CSV
JSON
Excel
SQLite database

Validation APIs

Biology: GBIF, World Flora Online, NCBI Gene
Geography: GeoNames, OpenStreetMap Nominatim
Chemistry: PubChem
Medicine: (extensible - add your own)

Examples

See the beetle flower visitors repository for a real-world example of this workflow in action.

Troubleshooting

PDF Size Limits

Maximum file size: 32MB
Maximum pages: 100
Solution: Use chunked processing for larger PDFs

JSON Parsing Errors

The json-repair library handles most common issues
Check your schema validation
Review Claude's analysis output for clues

API Rate Limits

Add delays between requests (implemented in scripts)
Use batch processing when available
Check specific API documentation for limits

Contributing

To add support for additional validation APIs:

Add validator function to scripts/05_validate_with_apis.py
Register in API_VALIDATORS dictionary
Update api_config_template.json with examples

Citation

If you use this skill in your research, please cite:

@software{pdf_extraction_skill,
  title = {Extract Structured Data from Scientific PDFs},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/your-repo}
}

License

MIT License - see LICENSE file for details

extract-from-pdfs

安装到 Claude

Category

Risk Level

Details