[LLM Agents F24] Measuring Agent Capabilities and Anthropic's RSP — Ben Mann

LaTeX 源码 · 观看视频

字段	内容
作者/整理	基于公开课程资料整理
来源	Berkeley RDI
日期	2024年11月25日

引言：Anthropic 与 AI 安全

Ben Mann 是 Anthropic 的联合创始人之一，曾是 GPT-3 论文的首批作者。Anthropic 成立于 2021 年，致力于构建安全的 AI 系统。本次讲座聚焦于两个核心问题：如何衡量 AI Agent 的能力？如何基于能力评估制定负责任的扩展政策？

Anthropic 的研究谱系

Anthropic 的研究覆盖：Transformer 电路可解释性（理解模型内部机制）、Constitutional AI（让 AI 根据原则自我修正）、Claude 系列模型的开发与部署、Responsible Scaling Policy 的制定。

本章小结

Anthropic 的核心理念是：AI 能力越强，安全措施必须越严格。衡量能力是制定安全措施的前提。

Agent 能力的评估

为什么需要评估 Agent 能力

评估的双重目的

了解进展：量化 Agent 在各类任务上的能力水平
预判风险：识别接近危险阈值的能力维度，提前部署安全措施

评估的维度

自主性（Autonomy）：Agent 能否在无人监督下完成长周期任务？
工具使用（Tool Use）：Agent 能多有效地使用外部工具（代码执行、搜索、API）？
说服力（Persuasion）：Agent 能否影响人类决策？这涉及潜在的操纵风险。
CBRN 知识：Agent 是否具备化学、生物、放射性和核领域的专业知识？
网络安全：Agent 能否发现和利用软件漏洞？

评估方法论

设计真实世界任务（非合成基准），观察 Agent 的端到端表现
人类基线对照：让领域专家完成同样任务，比较 Agent 表现
关注“尾部能力”：关注模型在极端情况下可能展现的能力

评估的根本困难

评估是对称的——如果我们能评估一个能力，说明我们已经理解了这个能力。但最危险的可能恰恰是我们尚未识别的能力（unknown unknowns）。

本章小结

Agent 能力评估需要覆盖自主性、工具使用、社会影响和专业知识等多个维度，且需要超越标准基准测试，关注真实世界表现和尾部风险。

Responsible Scaling Policy（RSP）

RSP 的核心框架

ASL（AI Safety Level）分级

Anthropic 提出 AI Safety Level 分级系统（类似生物安全等级 BSL）：

ASL-1：无显著风险的 AI 系统
ASL-2：当前大模型水平，需要标准安全措施
ASL-3：具备显著自主能力或专业知识，需要强化安全措施
ASL-4+：接近超人能力，需要极端安全措施

每个级别对应不同的安全要求——部署条件、监控强度、使用限制。

RSP 的运作机制

定期评估：每次训练新模型或显著更新时，评估其能力是否跨越 ASL 边界
触发机制：如果评估显示模型接近下一级能力，暂停部署直到相应安全措施就位
安全投入：将安全研发预算与模型能力增长挂钩
透明度：定期发布能力评估报告

RSP 的意义

为什么 RSP 重要

RSP 的核心贡献是将“模型能力”与“安全措施”显式绑定——不是“先发布再补救”，而是“先确保安全再推进”。这为整个行业提供了一个可操作的安全框架模板。

本章小结

RSP 通过 ASL 分级和触发机制，实现了“能力驱动的安全升级”。这是目前最具操作性的 AI 安全框架之一。

Long-term Benefit Trust

Anthropic 创新性地设立了 Long-term Benefit Trust：随着 AI 变得越来越强大，公司的控制权将逐步移交给没有经济利益、只关注安全和公共利益的独立董事。这是 AI 行业的首个此类治理结构。

治理创新

在 AI 可能成为史上最强大技术之一的背景下，确保控制者的激励机制与公共利益对齐，是长期安全的制度保障。

Benchmarking Real-World Agents

Benchmark Pipeline

Anthropic 把 Agent 能力拆解成多个指标：自主性、工具调用、推理深度、偏好一致性。每轮评估都会串联以下环节：

Design scenarios：选定长周期任务（调研、计划、执行、汇报）
Establish baselines：用人类团队或前代模型做对照
Instrument data：收集 agent 的行动日志、API 调用、审计 trail
Score metrics：自动化评分 + 人类回顾
Feedback to training：将发现的 failure modes 逐条转化为 dataset/behavioral trace

PDF 图示资源

Agent Benchmark pipeline illustrated on slide 3.

打开 PDF 图示

来源：Slides emphasize multi-step evaluation loops.

Case Study: Decision Support Task

一个典型任务是让 Agent 读完 20 页内部战略文档，生成 POAP（Plan, Options, Actions, Priority）摘要并安排 follow-up 会议。

Agent 需要跨页面保持 coherence（段落嵌套、引用）
输出同时要兼顾事实、推理链、行动建议
最终由人类 reviewer 根据 POAP template 打分

Tailored judgment rubric

Anthropic 为每种任务制定 rubric：事实准确性、结构完整性、工具调用合规性、是否提供可操作建议。只有在所有维度都合格的情况下才会推进到 ASL 升级。

本章小结

Anthropic 的 benchmark pipeline 把多个任务场景、数据采集、评分和训练反馈串成一个闭环；每个 scenario 都配套专门的 rubric，从而把能力评估变成可操作的治理信号。

Evidence Matrix

为了把能力评估转成决策信息，Anthropic 使用evidence matrix：按任务阶段列出关键指标、数据源、当前表现，并给出安全边界（green/orange/red）。

阶段	指标	数据源	当前风险信号
Planning	clarity of plan, dependency awareness	prompt logs, human evaluation	plan misses key stakeholders
Tool invocation	tool success rate, error rate	API logs, retry counts	spike in `api_error` events
Execution	prospect completion quality	manager reviews, outcome diff	5% of tasks reworked
Reporting	factual accuracy	summary diff + external documents	hallucinated reference detected

Evidence matrix used for capability snapshots

Evidence matrix refresh cadence

Matrix rows are refreshed every week or after any significant dataset/model update. Without timely refresh, the safety gates rely on stale performance snapshots.

Monitoring, Telemetry, and Incident Response

Telemetry Stack

部署的 Agent 每一次工具调用、每一条 API 输出都会被写入观测链路，流向三类系统：

Alerting：基于 anomaly detection 设置阈值，如突发的自动化部署或高频 requests
Replay：完整记录行为用于复现，包含 timeline、embeddings、输入/输出
Metrics：自定义指标（alignment drift、policy deviation、human override rate）

Observability isn't optional

一旦 Agent 脱离常规模式，必须能够回放完整的调用链、看到 system prompt 以及外部 context。没有这个可视化，调查和治理几乎不可能。

Incident Response Loop

当监控触发警报时，Anthropic 的响应流程包括：

快速判定风险等级（green/orange/red）
自动切换至木偶式模式，所有输出需人工审批
开启 post-mortem，记录 timeline + root cause
更新 checklist 与 prompt 指令，防止 recurrence

本章小结

Agent 监控需要 telemetry、alert、replay 三条腿支撑；一旦触发警报，组织必须以 structured loop（等级判定、containment、post-mortem、update）迅速响应，防止能力滑向风险区。

Data Pipelines and Instrumentation

Telemetry data is ingested into a multi-tier pipeline: real-time streaming, batch aggregation, and backing datastore for audit. Each agent interaction emits structured events with the following fields:

task_id, agent_variant, score
tools_used list with duration/evidence per tool
alignment_drift delta vs baseline
human_override flag

Why instrumentation matters

Instrumentation not only powers alerts but also enables retrospective capability audits: you can slice by agent version, time window, tool set, or base capability and replay the exact logs that triggered a red flag.

The ingestion stack keeps both raw and derived layers. Raw events are immutable and tagged with event_hash to ensure idempotency. Derived views pre-aggregate per ASL level and feed dashboards that teams monitor in daily standups.

Dashboards themselves include alignment drift, tool success, red-team exposure, and human override latency. When any metric crosses thresholds, the telemetry system automatically tickets the incident response crew.

Deployment Guardrails and Governance

Pre-production Gates

部署前的关键 gate 包括：

Red team test：独立团队寻找潜在被滥用方式
Explainability check：通过 interpretability 工具确认 model 的关键 evidence chain
Configuration hardening：锁定 toolset、ratelimits、system prompts，并写入 CLAUDE.md

Governance Artifacts

Anthropic 维护一套 artifact（policy doc、runbooks、SLAs），并规定：

每次新交付必须附上 Capability Report
Stakeholder 有权 request Transparency Brief
Long-term Benefit Trust trustees 拥有 veto

Artifacts are versioned in a governance repository and cross-referenced with the deployment pipeline. Every update must cite the specific evidence matrix row it addresses and include the slide page demonstrating the capability.

Action Timeline

PDF 图示资源

Slide 7 outlines governance milestones and Trust handoff.

打开 PDF 图示

来源：Trust handoff timeline with independent board oversight.

本章小结

Guardrails 由 pre-production gates、artifact、Trust veto 组成；只有把这些制度落在每次能力递增的节点上，才能避免 AI 安全成为事后补救。

Action Timeline

Deployment is treated as a sequence of checkpoints. A typical timeline:

Phase	Controls
Pre-launch	red team, trust briefing, capability report, artifact sign-off, slides deck review
Soft launch	limited roll-out, parallel human monitor, data capture instrumentation, immediate rollback plan
Full launch	full monitoring, trustee oversight, public communications, community feedback loop

Pre-launch checklist includes red-team signoff, evidence matrix alignment, and trustee update.
Soft launch extends agent access to a limited set of collaborators while instrumentation is verified.
Full launch triggers marketing communication plus continuous monitoring with trustee review cadence.

Action timelines must not shortcut

Skipping any checkpoint (e.g., moving to soft launch before artifacts sign-off) is a common root cause for incidents. Trust briefer explicitly checks for missing evidence pages in slides.pdf before signing off.

Open Questions and Research Agenda

Measurement Gaps

Anthropic 公开提出 several open gaps:

How to quantify “alignment drift” in the wild?
How to estimate human override fatigue when multiple Agent versions run concurrently?
How to instrument slow-moving capabilities (e.g., strategic reasoning) before they surface in benchmarks?

Action Items

Every new capability launch must include (1) measurement plan, (2) failure signature sheet, (3) human-in-the-loop trigger.

Research Directions

Ben Mann highlights:

“Capability contour” research—mapping the Pareto frontier of autonomy vs. controllability
Multimodal evaluation—ensuring Agent maintains alignment when tool outputs are visual/audio
Policy orchestration—how to mix synchronous/asynchronous Agent families safely

本章小结

现阶段最大的研究挑战是让 measurement/safety/governance 三条线同时向前推进：需要更细的 alignment metrics、multi-agent orchestration studies、以及验证治理 artifact 的实际效能。

Metrics Glossary

Key Capability Metrics

Anthropic tracks a catalog of capability metrics that feed into the evidence matrix. The main categories are:

Integrity: factual accuracy, hallucination rate, citation precision
Autonomy: plan depth, tool use depth, self-corrections
Control: human override ratio, rollback events, command adherence
Impact: productive output, decision quality, human satisfaction index

Each metric is tied to a measurement method: Integrity uses reference corpora and automated fact-checkers, Autonomy uses action logs, Control uses human override logs, Impact uses structured surveys.

Glossary Table

Metric	Description	Data Source	Safety Threshold
Fact Density	Share of claims verified by docs	embeddings + text diff	>82%
Plan Completeness	Steps with dependency resolution	plan transcripts	no missing references
Tool Success	Tools used / tools invoked	telemetry logs	>95%
Override Lag	Time between agent action and human override	workflow logs	>3 min to override
Drift Index	Deviation from baseline policy	alignment scores	<0.07 drift
Policy Noise	entropy in action selection	policy logs	<=0.3

Metric calibration

Metrics must be recalibrated when models change architecture, since scores are relative to behavior and can drift even if underlying capability remains stable.

Operational Checklists

Incident Triage Checklist

Anthropic operationalizes incidents using a multi-step checklist:

Document the deviation and gather log artifacts
Estimate likelihood/severity using ASL heuristics
Engage second reviewer and determine containment actions
Log the event in the incident tracker and assign owner
Update capability notes/training data to reflect the failure

Deployment Playbooks

Each deployment uses two playbooks—Launch Playbook and Rollback Playbook. The launch playbook enumerates gating evidence (matrix rows, telemetry dashboards, trustee review). The rollback playbook spells out evaluation metrics, check for repeated incursion, and communication plan.

本章小结

Metrics glossary + checklists keep tooling grounded: each number in the evidence matrix is tied to a data source, each incident follows a documented escalation path. These artifacts make evaluation defensible even under audit scrutiny.

Slide Gallery

The provided slides illustrate the evidence-heavy thinking behind these policies. Each slide screenshot is presented to anchor the textual summary above.

PDF 图示资源

Slide 6: capability contours and instrumentation.

打开 PDF 图示

PDF 图示资源

Slide 8: telemetry architecture sketch.

打开 PDF 图示

PDF 图示资源

Slide 9: incident response steps.

打开 PDF 图示

PDF 图示资源

Slide 11: governance artifact checklist.

打开 PDF 图示

PDF 图示资源

Slide 13: Trust transfer timeline.

打开 PDF 图示

PDF 图示资源

Slide 14: measurement gaps & research agenda.

打开 PDF 图示

PDF 图示资源

Slide 15: operational checklist for deployments.

打开 PDF 图示

来源：Slides exported from the provided deck; pages chosen to illustrate the written narrative.

Operational Deep Dives

Red Team Report Workflow

After every capability upgrade, the red team produces a narrative report. It includes timeline, trigger evidence, and mitigation steps. The narrative is structured as (1) pre-incident context, (2) capability vector, (3) observed behavior, (4) hypothesized failure mode, (5) recommended controls.

The report is reviewed by the governance council and ingested into the Capability Report. When multiple reports cite the same failure mode, an Issue Cluster is created and assigned to a dedicated Capability Owner.

Compliance and Transparency Reporting

Anthropic publishes regular transparency briefs describing: model changes, test results, and outstanding concerns. Each brief is validated against the Long-term Benefit Trust board schedule.

The transparency workflow includes three sign-offs: research lead, safety lead, trustee. Any deviation before sign-off triggers a pause and a formal explanation to the board.

Data Logging Strategy

All instrumentation data (capability metrics, tool calls, overrides) are streamed into a multi-tenant warehouse with [tag, timestamp, event_id, payload] records. The ingestion pipeline tags events with project, agent_version, ASL_level.

Raw logs are retained for 90 days; aggregated dashboards (alignment drift, tool success) are kept for 365 days. This allows retrospective audits long after a deployment.

本章小结

These deep dives illustrate how red teaming, transparency, and logging are not ad-hoc but woven into regular governance rituals. Each ritual produces artifacts that feed benchmarking, monitoring, and deployment guardrails.

总结与延伸

总结表

领域	核心策略	实施手段	进一步问题
Benchmarking	场景化评估＋rubric	Decision support case + POAP + slide rubric	如何量化 alignment drift？
Monitoring	Telemetry + replay + alerts	Metrics/alert logs + incident loop	怎么在多 agent 运行时保持解释性？
Governance	Pre-deploy gates + Trust oversight	red teams + capability reports + trustee veto	怎样把 governance artifact 嵌入 CI/CD？
Research	Capability contour + multi-agent orchestration	Long-term Benefit Trust + measurement plan	如何抓住 tail-end risks before deployment？

Lecture 02 的能力、安全、治理、研究四层面压缩表

进一步阅读

Anthropic, “Responsible Scaling Policy” 2023
Anthropic, “Claude Model Cards and Capability Reports” 2024
Anthropic, “Long-term Benefit Trust” governance brief
Ben Mann, talk materials from Berkeley RDI (slides included in repo)
Shevlane et al., “Model Evaluation for Extreme Risks,” 2023

本章小结

Lecture 02 把能力评估、监控、部署 guardrail 和制度治理串成一条闭环：不断评估 + 监控 + pre-deploy gates + trustee oversight；研究议程则将 benchmark measurement、policy orchestration 与 governance artifact 进一步向前拉，为未来的 Agent 安全奠定行动框架。

Slide Evidence

The lecture slides underline the emphasis on evidence-backed governance. Slide 4 distills the capability matrix, while slide 5 captures the Trust handoff timeline.

PDF 图示资源

Slide 4: Capability matrix and risk signals.

打开 PDF 图示

PDF 图示资源

Slide 5: Long-term Benefit Trust timeline.

打开 PDF 图示

来源：Screenshots extracted from the provided slide deck.