by sickn33
LLM agent 通常在通过基准测试后仍在生产环境中失败。此技能提供行为测试、能力评估和可靠性指标,以便在部署前发现问题。
1. 打开 Claude 聊天界面
2. 点击下方 "📋 复制" 按钮
3. 粘贴到 Claude 聊天框中并发送
4. 输入 "使用 agent-evaluation 技能" 开始使用
=== agent-evaluation 技能 === 作者: sickn33 描述: LLM agent 通常在通过基准测试后仍在生产环境中失败。此技能提供行为测试、能力评估和可靠性指标,以便在部署前发现问题。 使用方法: 1. 调用技能: "使用 agent-evaluation 技能" 2. 提供相关信息: 根据技能要求提供必要参数 3. 查看结果: 技能会返回处理结果 示例: "使用 agent-evaluation 技能,帮我分析一下这段代码"
这种方法适用于所有 Claude 用户,不需要安装额外工具。
data
safe
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it
Run tests multiple times and analyze result distributions
Define and test agent behavioral invariants
Actively try to break agent behavior
| Issue | Severity | Solution |
|---|---|---|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
Works well with: multi-agent-orchestration, agent-communication, autonomous-agents
View Count
0
Download Count
0
Favorite Count
0
Quality Score
71