by brunoasm
适用于Claude、Codex和Claude Code的AI技能
1. 打开 Claude 聊天界面
2. 点击下方 "📋 复制" 按钮
3. 粘贴到 Claude 聊天框中并发送
4. 输入 "使用 extract-from-pdfs 技能" 开始使用
=== extract-from-pdfs 技能 === 作者: brunoasm 描述: 适用于Claude、Codex和Claude Code的AI技能 使用方法: 1. 调用技能: "使用 extract-from-pdfs 技能" 2. 提供相关信息: 根据技能要求提供必要参数 3. 查看结果: 技能会返回处理结果 示例: "使用 extract-from-pdfs 技能,帮我分析一下这段代码"
这种方法适用于所有 Claude 用户,不需要安装额外工具。
productivity
medium
A comprehensive pipeline for extracting standardized data from scientific literature PDFs using Claude AI.
This skill provides an end-to-end workflow for:
Create a conda environment:
conda env create -f environment.yml
conda activate pdf_extraction
Or install with pip:
pip install -r requirements.txt
Set your Anthropic API key:
export ANTHROPIC_API_KEY='your-api-key-here'
For geographic validation (optional):
export GEONAMES_USERNAME='your-geonames-username'
The easiest way is to use the skill through Claude Code:
claude-code
Then activate the skill by mentioning it in your conversation. The skill will guide you through an interactive setup process.
The skill includes comprehensive reference documentation:
references/setup_guide.md - Installation and configurationreferences/workflow_guide.md - Complete step-by-step workflow with examplesreferences/validation_guide.md - Validation methodology and metrics interpretationreferences/api_reference.md - External API integration detailsYou can also run the scripts manually:
python scripts/01_organize_metadata.py \
--source-type bibtex \
--source path/to/library.bib \
--pdf-dir path/to/pdfs \
--organize-pdfs \
--output metadata.json
First, customize the filtering prompt in scripts/02_filter_abstracts.py for your use case.
Option A: Claude Haiku (Fast & Cheap - ~$0.25/M tokens)
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend anthropic-haiku \
--use-batches \
--output filtered_papers.json
Option B: Local Model via Ollama (FREE)
# One-time setup:
# 1. Install Ollama from https://ollama.com
# 2. Pull model: ollama pull llama3.1:8b
# 3. Start server: ollama serve
python scripts/02_filter_abstracts.py \
--metadata metadata.json \
--backend ollama \
--ollama-model llama3.1:8b \
--output filtered_papers.json
Recommended Ollama models:
llama3.1:8b - Good balance (8GB RAM)mistral:7b - Fast, good for simple filteringqwen2.5:7b - Good multilingual supportllama3.1:70b - Better accuracy (64GB RAM)First, create your extraction schema by copying and customizing assets/schema_template.json.
python scripts/03_extract_from_pdfs.py \
--metadata filtered_papers.json \
--schema my_schema.json \
--method batches \
--output extracted_data.json
python scripts/04_repair_json.py \
--input extracted_data.json \
--schema my_schema.json \
--output cleaned_data.json
First, create your API configuration by copying and customizing assets/api_config_template.json.
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--output validated_data.json
# For Python/pandas
python scripts/06_export_database.py \
--input validated_data.json \
--format python \
--flatten \
--output results
# For R
python scripts/06_export_database.py \
--input validated_data.json \
--format r \
--flatten \
--output results
# For CSV
python scripts/06_export_database.py \
--input validated_data.json \
--format csv \
--flatten \
--output results.csv
Validate extraction quality using precision and recall metrics:
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
Sampling strategies:
random - Random samplestratified - Sample by extraction characteristicsdiverse - Maximize diversityvalidation_set.jsonground_truth field with correct extractionannotator name and annotation_datenotes for ambiguous casespython scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
This produces:
Use these metrics to:
assets/schema_template.json to my_schema.jsonobjective: What you're extractingsystem_context: Your scientific domaininstructions: Step-by-step guidance for Claudeoutput_schema: JSON schema defining your data structureoutput_example: Example of desired outputSee assets/example_flower_visitors_schema.json for a real-world example.
assets/api_config_template.json to my_api_config.jsonscripts/05_validate_with_apis.py and references/api_reference.mdSee assets/example_api_config_ecology.json for an ecology example.
PDF processing costs approximately 1,500-3,000 tokens per page:
Tips to reduce costs:
--use-caching--method batches)See the beetle flower visitors repository for a real-world example of this workflow in action.
json-repair library handles most common issuesTo add support for additional validation APIs:
scripts/05_validate_with_apis.pyAPI_VALIDATORS dictionaryapi_config_template.json with examplesIf you use this skill in your research, please cite:
@software{pdf_extraction_skill,
title = {Extract Structured Data from Scientific PDFs},
author = {Your Name},
year = {2025},
url = {https://github.com/your-repo}
}
MIT License - see LICENSE file for details
View Count
0
Download Count
0
Favorite Count
0
Quality Score
65