[CS25] Retrieval Augmented Language Models — Douwe Kiela

LaTeX 源码 · 观看视频

字段	内容
作者/整理	Douwe Kiela 讲座整理
来源	Stanford CS25: Transformers United
日期	2026年4月3日

引言：检索增强的系统视角

讲者与演讲脉络

Douwe Kiela 登台时直接说 “Retrieval augmentation is one of the topics right now in our field”（00:01:28），并在 00:01:55 用 “If you thought OpenAI, I'm angry with you” 强调语言模型的历史根基。1991 年的 neural language model 已经提出了分解概率的思路，今天的大模型不过是 “old idea + scale”。讲者借此把检索增强框定为工程上的可追溯性和治理能力。

语言模型的历史根基

早期 work 已经涵盖概率分解、词嵌入与神经结构。把当下 LLM 视为 “old idea + scale”，有助于把关注点从品牌转到 traceability 和 instrumentation。

讲座结构与期待

整场讲座按照三条主线展开：先讨论为什么必须引入检索，再拆解堆栈与模型，然后落地观测与治理。00:05:00 以后反复出现 “retrieval pipeline”、“cache”、“hallucination”，表明讲者意在构建 retrieval + generation 的闭环。

事实性 vs 创造性

要让对话既有 imagination 又有 accuracy，必须把检索过程显式写入 pipeline：知识线（检索、增强、证据）负责 truth，生成线负责 fluency，这种平衡是 Contextual AI 的核心。

本章小结

本章通过历史与结构勾勒出检索增强的系统视角，为后续工程、治理与评估奠定基础。

为什么需要检索增强

LLM 的记忆边界

Kiela 反复强调参数化模型的 “aging”: facts bake 进 weights 之后无法快速更新，只能通过 retrain；这就导致 facts 发生变化时模型依赖 hallucination。讲者指出 truths 需要通过外部知识库立刻引入，retrieval 是唯一 viable 的机制。

参数化记忆 vs 非参数化记忆

知识过时：训练截止内容无法修改
幻觉失控：无证据时模型会强势输出
审计缺失：参数没有 provenance

检索增强把外部 knowledge base 视为 ‘non-parametric cache’，在 inference 时补全参数化 memory 的盲区。

RAG 工作流与知识保障

阶段	核心任务	输出
检索	BM25 / dense / vector search	top-k 文档 + score + source ID
增强	拼接文档、附上 citation metadata	强化 prompt（含 evidence tokens）
生成	LLM 基于 evidence 生成	高置信回答 + 引用

RAG 工作流拆解

可审计的 RAG Pipeline

每个阶段都需记录 timestamp 与 hash：检索记录 score 与来源，增强记录文档顺序，生成记录最终 citation。hallucination 发生时可以精准回溯是哪一层 misaligned。

本章小结

检索增强的存在直接源于 LLM 的记忆瓶颈，RAG 三步让整条链路都具备 traceable knowledge control。

检索堆栈

稀疏与密集检索

Kiela 回顾 BM25 与 DPR：前者在记忆关键词之余 latency 可控，后者抓住语义。多数团队采用 BM25 upstream + dense rerank 的混合架构：稀疏检索提供 candidate pool，密集 reranker 提升 precision。

稀疏与密集的互补

稀疏检索提供 recall，密集检索提供 semantic alignment。pipeline 中把 BM25 做 candidate，再交给 DPR reranker，1 2ms 内即可交付高质量文档。

向量索引与 reranker

FAISS / ScaNN / Milvus 等解决大规模向量索引问题。Kiela 特别强调 index maintenance：一旦 encoder 更新就要 rebuild，否则会出现 index drift。candidate pool 之后用 cross-encoder reranker + contextual rerank prompt 可校正 raw dot-product 的无序性。

向量更新带来的 drift

若 encoder 升级后忘记 rebuild index，新旧 embeddings distribution mismatch，就会让 reranker 逐渐选出 semantically unrelated 文档，hallucination 频出。

本章小结

检索堆栈是 BM25 + dense + rerank 的交响，需要 index maintenance 与 drift monitoring，才能兼顾 recall 与 precision。

运行蓝图与日志

检索与生成流水线

Kiela 把 retrieval + generation 看作一条流水线：query 经过 embedding、检索、重排、增强、生成五个阶段。每个阶段都需要 instrumentation 层层打点，比如 embedding stage 的 norm drift，retrieval stage 的 top-k entropy，增强 stage 的 concatenation order。

阶段	关注点	数据记录
Embedding	query/key norm drift	mean norm, variance trend
Retrieval	top-k score distribution	logits, entropy, top-k churn
Enhance	document order	concat sequence + metadata tags
Generate	final tokens	citation IDs, confidence

流水线层级与 observability checkpoint

流水线可观测性

把每个阶段打点写入同一个 dashboard，可以在 entropy drift 发生前 5 10 checkpoint 发现 radius collapse，便于快照 reroute。

日志 pipeline 与 incident runbook

Contextual AI 把 instrumentation 写入 runbook：每个 retrieval incident 都记录 timestamp、top-k churn、entropy drop magnitude 和 reroute action。incident log 需要记录 evidence ID + dump 以便 future QA。

incident runbook 模板

1) 记录 incident 时间与 source；2) 写入 entropy/top-k/logits；3) 追踪 evidence ID；4) 触发 reroute 或 temperature raise 并写入 action note。

本章小结

运行蓝图与日志让 retrieval pipeline 成为可监控的流水线，incident runbook 提供可复现的处理步骤。

端到端检索增强模型

REALM 与预训练

REALM 在 pre-training 阶段把 retrieval 直接并入 MLM：每次掩码预测都检索 documents，让 generator 训练时就习惯 evidence。Kiela 指出这种 joint training 有强 gradient，但每次 update 需要重新 encode 整个 corpus，cost 显著。

REALM 的可观测性

REALM 中 retrieval loss 与 generator loss 联动，必须同步监控 retrieval accuracy 与 MLM perplexity，才能判断 gradient 来自 retriever 或 generator。

RAG 变体：Sequence / Token / Hybrid

Lewis 等人提出 Sequence 与 Token 两种变体：Sequence 每轮检索一次，Token 每 token 重检。Kiela 在 00:10:05 强调 Token 比较适用于法律 / 医疗任务，但 cost 极高。Hybrid 通过 cache + token fallback 提供折中。

变体	检索频率	生成一致性	推理成本
Sequence	每轮一次	高	低
Token	每 token	最高	极高
Hybrid	Token + cache	接近 token	可控

RAG 变体对比

变体选择指南

需要在成本与 facts 之间 balance？Sequence + cache 是 default，当 confidence 低时再触发 token reroute；高风险任务（法律/医疗）可直接选择 Token 或 Hybrid despite cost。

Atlas 与 FiD 案例

Atlas 证明 11B 带检索模型可以击败 540B PaLM，说明检索是部分替代 scale 的路径；FiD 则把每个文档独立 encode，再在 decoder 中 fuse，能够处理数百段的 evidence。

FiD 的成本陷阱

FiD 在 decoder 中融合数十个 snippet，推理显存与 latency 都翻倍。工业实践通常使用 document filtering + short-circuit rerank 来控制规模。

本章小结

端到端检索增强模型在 pre-training 与 inference 间都要兼顾 cost、gradient 与 instrumentation，变体提供不同 tradeoff。

训练与治理

训练阶段的 observability checklist

Kiela 建议训练时记录 retrieval entropy、top-k diversity、cache hit ratio，并用 synthetic query batches 验证 search failure。dashboard 还应统计 gate dropout logging 与 entropy drift。

训练 observability 三层

Signal layer：retrieval logits/entropy；Aggregation layer：按 retriever/reranker 汇总；Control layer：temperature/mix ratio 自动调节 retrieval weight。

QA 与真相治理

Contextual AI 的 truth path 把检索与生成写入 runbook：调整 temperature 时记录 KL divergence；top-k churn 触发 reroute experiment；每次上线前用 evidence recall 工具校验输出。

治理流程

1) 记录 temperature spike + KL divergence；2) 检查 top-k churn 与 evidence overlap；3) 触发 reroute + human-in-the-loop audit；4) 存档 incident log 以便 future drift 复盘。

本章小结

训练与治理结合 instrumentation 与 runbook，让检索增强在 production 中可追踪、可重复。

Benchmark 与运营指标

关键指标与报警

Kiela 把 entropy drift、top-k churn 与 retrieval lag 视为最重要的 operational signals，它们直接映射到 temperature、cache freshness、reroute 控制杠。entropy drift 一旦下降 10%，说明 attention 变得过尖，这时必须 raise temperature 或 reroute low-capacity experts。

指标	解释	工程行动
Entropy drift	softmax entropy 降幅	raise temperature / increase dropout
Top-k churn	top-k tokens 的 churn rate	inspect retrieval quality and reroute experiment
Retrieval lag	latency 上升	prewarm cache / rerank threshold

Benchmark 指标与工程行动

Benchmark 到运营的桥梁

把 benchmark 指标映射成 production alerts：entropy drift → temperature schedule 变更，top-k churn → reroute experiment，retrieval lag → caching action。

对比与复盘

Contextual AI 会把 benchmark logs 与 incident log 串联进行复盘：每次 reroute 后检查 entropy、top-k、retrieval lag 是否同步恢复，避免 repeated drift。

复盘矩阵

将 incident 按 temperature、entropy、top-k、latency 四个维度矩阵化，能快速定位是 retrieval failure 还是 generation drift。

本章小结

Benchmark 观测指标是把实验室的几何 intuition 转到生产的关卡，通过 table 与 incident matrix 实现即时报警与复盘。

部署与推理

推理成本与延迟 trade-off

Contextual AI 的 production pipeline 把 generator flops、retrieval latency、rerank cost 与 IO overhead 一起纳入 latency budget，并以 entropy drift 作为 pre-alarm：entropy precipitously drop 意味 attention 太尖锐，需要 raise temperature，否则 downstream hallucination 会爆发。

策略	影响	监控指标
低 temperature	attention 更尖锐	entropy drop, top-k hit ↑
高 temperature	overlap 更大	latency + GPU usage ↑
Dynamic reroute	reroute low-capacity experts	latency spikes + reroute frequency

推理阶段成本权衡

推理阶段的 trade-off

在 inference 中监控 entropy 变化比 loss 更快，可以在 entropy slide 触发 reroute 或 temperature raise，避免 degradation；同时把 latency budget 与 evidence quality 绑在一起，才是真正的 cost control。

多轮交互与缓存治理

多轮 prompt 重复时可以 rely on cache hits；Kiela 认为 cache invalidation policy 是 production 中最棘手的部分，很多 team 把 cache age 作为 slow alarm，超过阈值就 refresh retriever。

缓存 drift

缓存节省 latency 但过期文档会把 stale knowledge “烤入”回答，务必持续监控 cache freshness 和 reroute frequency。

本章小结

部署层把 retrieval latency、temperature、cache age 与 reroute event 一起观察，把 inference cost 看作 retrieval + generation 的组合，有 evidence 才算胜利。

时间线与字幕侧写

关键时间节点

讲义在 00:00--04:00 重新界定语言模型历史，00:05:00--12:00 介绍 retrieval pipeline 与 geometric viewpoint，12:00--25:00 讲述 instrumentation 与 governance，25:00+ 进入 QA/playbook。把时间戳写在笔记里，能让 reviewer 快速定位视频片段。

时间戳	内容	工程提醒
00:00–04:00	语言模型的历史与事实化动机	建立 `knowledge control` mindset
04:00–12:00	检索堆栈与 retrieval pipeline	确保 BM25 + dense recall/precision
12:00–25:00	instrumentation 与 control loop	记录 entropy drift + reroute action
25:00–30:00	QA/runbook 演练	运行 synthetic batch 与 incident log

时间线与工程关注点

字幕同步的价值

把关键 timecode 添加入笔记，可以在缺乏 slides 时仍然回溯每段提到的 geometry、control loop、QA 细节，增强业务 review 的效率。

本章小结

时间线与字幕侧写把视频里的逻辑墙化为结构化工程 checkpoints，便于 QA 和审查团队快速追溯事件。

字幕节选与重点 quotes

00:00–04:00：设定历史视角

开场 00:00--04:00 重申 “language models 是 old idea”，Kiela 把观众拉回 1991 年的 neural learner，强调 “parameters cannot be updated fast enough”。这一段的工程提醒是：建立 knowledge control mindset，避免把模型当 black box。

起点的工程声明

即便现有 LLM 强大，讲者提醒我们它们基于 old ideas，因此必须搭配 non-parametric cache 才能继续保持 facts。

04:00–12:00：retrieval pipeline 与 geometry

这一阶段聚焦 retrieval stack，将 attention 视为 retriever + generator 流水线，强调 BM25 + dense combination，并用 geometry 解释 why softmax approximates high-dimensional overlap。

字幕提醒

字幕提到 “retrieval pipeline”, “cache”, “hallucination” 等关键词，强化了 softmax 与 SDM geometry 之间的联系，提醒我们把 instrumentation 拆解为 pipeline 各阶段。

12:00–25:00：观测与治理

从 12:00 开始的段落围绕 instrumentation 与 control loop，讲述 temperature annealing、gate dropout 与 entropy drift 的 3-layer stack，以及 evidence-driven runbook 的演练步骤。

不要只盯着 loss

字幕中 attention drift、top-k churn 的频繁出现提醒我们：只看 loss 无法察觉 entropy collapse，必须在 dashboard 中同时监控 entropy 与 retrieval metrics。

25:00–尾声：QA 与 runbook

末尾 25:00+ 进入 QA 与 runbook demo，包括 synthetic batch 验证、temperature spike 实验与 gate dropout logging，体现 production-ready 的治理节奏。

本章小结

\label{sec:subtitle} 字幕节选强化了每个时间段的工程意图：设定事实性 mindset →structuring retrieval pipeline →observability & governance →QA/runbook。

案例研究与工业实践

Contextual AI 的落地

Kiela 在 00:25:35 提到的 control loop 包括 temperature annealing、load balancing loss、expert dropout 与 gradient clipping。搭配 instrumentation dashboard（entropy、top-k、retrieval lag），形成 truth path，对外宣称 evidence recall 率 > 90%。

Evidence-driven control loop

让 entropy drift、top-k churn 与 retrieval lag 触发 alarm，可提前 5 10 个 checkpoint 发起 reroute experiment，从而避免 downstream hallucination。

观测与运行手册

Kiela 分享的 runbook 包含三步演练：1) synthetic query batch + entropy logging；2) temperature spike experiment + load balancing loss；3) gate dropout logging + gradient replay。每次 incident 记录 timestamp 与 evidence ID，方便 future drift analysis。

RAG Runbook 三步

1) Run synthetic batch logging entropy/top-k；2) Simulate temperature spike and observe load balancing loss；3) Log gate dropout + gradient change。只有把这三步写进 runbook，才能把课堂 insight 迁移到 production。

本章小结

案例研究表明：RAG 不只是模型结构，而是一整套 observability + governance 文化，retrieval metrics 与 generation metrics 被绑在同一个 incident log。

Implementation Q&A

Temperature scheduling

Kiela 在 00:13:10 提到 temperature annealing gives us control over overlap：前 10k steps 线性降温，然后根据 entropy 低于 0.4 nat 时 raise temperature 0.05。记录 temperature-entropy pairing 使每次复现都可追踪。

可重复的 schedule

记录每次 temperature change 与 corresponding entropy、loss 组合，避免随意的 heuristic tweak。

监控 cerebellum wiring 与 gate dropout

00:25:45 提到 “cerebellum is old, widely conserved”，工程上把 Purkinje cell activity 映射 gate dropout，记录 dropout ratio + gate smoothing + frequency 是关键 instrumentation 变量。

Gate dropout instrumentation

把 gate dropout ratio、gate smoothing frequency 与 downstream loss 写到 dashboard，可以快速判断是 hardware failure 还是 attention collapse。

多模态 drift 探测

在视觉 token 上监控 top-k churn，若 patch entropy 呈 microp plateau 就触发 reroute；这个 policy 也可扩展到 non-language tokens。

跨模态 drift 的特异性

视觉 token norm 分布更不稳定，容易比语言 token 提前突破 entropy threshold，需要在 dashboard 中嵌入 modality-specific alarms。

本章小结

Implementation Q&A 把讲座问答转化为 three concrete checkpoints：temperature schedule、gate dropout instrumentation、cross-modal drift alarms。

治理演练与复盘

Synthetic query 演练

Contextual AI 的 runbook 要求每次 temperature change 前执行 synthetic query batch，记录 top-k entropy 与 softmax entropy，并对比 drift 前后的 gate dropout ratio。

生成 synthetic query（含 low-resource token）
运行 current config，记录 top-k churn 与 entropy
比对 previous checkpoint，若 entropy drop > 0.1 nat 就 raise temperature
写入 incident log，标注 evidence ID

Synthetic query 的作用

通过 controlled queries 模拟 drift，可以在实际 production 发生前把 gate collapse 逼出来，并把 evidence log 写入 dashboard。

Incident review 表格

事件	触发条件	复盘行动
Entropy drop	entropy < 0.4	raise temperature + reroute
Top-k churn spike	top-k 变动 > 20%	inspect retriever + rerank policy
Cache age 超阈值	cache age > 30 min	refresh reranker + log

Incident 决策回顾

复盘的重点

复盘要同时考虑 entropy/top-k/cache age，避免只 focus 在 loss 曲线上，幕后的 drift 可能早就已出现。

本章小结

治理演练把模型的理论假设转成可执行的 incident log，复盘表格提供 structured 回顾，避免重复犯错。

跨语言与地域化治理

多语言 keyframes

讲者在 00:01:48 用 poll 引出 “谁发明了语言模型”，提醒我们语言模型有 deep roots。面对 multilingual scenario，retrieval pipeline 需要维护 language-specific index（各语言 BM25 + dense）以及 language-aware reranker。

语言特定 instrumentation

为每种语言保留 entropy drift、top-k churn 记录，可以在 multi-lingual deployment 中快速定位哪个语言发生 drift，避免 global thresholds 覆盖 minority language。

地域化知识库治理

Contextual AI 对 enterprise client 会用 energy & policy compliance 语料，不同地区对 evidence 有不同合规要求；需要把 evidence ID 与 region tag 一起写 log。

地域化 evidence log

在 incident log 中加入 region tag 与 compliance tag，可以快速回答 “这个 output 有无满足当地 policy”。

本章小结

跨语言与地域化治理要求我们把 instrumentation 拆成 language-specific、region-specific 的两层，才能让 enterprise deployment 具备可信度与合规性。

附录：字幕原文摘要

00:00–04:00：历史与事实锚点

字幕中的 “language models are old idea” 和 “parameters cannot be updated fast enough” 强调 facts 必须靠检索和 evidence。我们在笔记中提到的 knowledge control，正来源于这一段话。

“If you thought OpenAI, I'm angry with you” —提醒我们 language models 不是 recent hype，而是 decades-old idea。

04:00–12:00：retrieval pipeline details

字幕反复提到 “retrieval pipeline” 与 “cache”，强调 pipeline 的每个 stage 需要 instrumentation。我们把 embedding norm、top-k entropy 等 log 写入 same dashboard，就是为了响应这一段的提醒。

字幕直接引用

“retrieval pipeline”, “cache”, “hallucination” 均出现在 04:00--12:00，说明这一阶段是设计 instrumentation 的窗口。

12:00–25:00：control loop

字幕里 “temperature annealing gives us control over overlap”、“attention drift” 都指向 entropy 与 reroute 的控制回路。我们在 QA section 的 runbook 里把这些关键行动写成三步，正好对应这一段的工程思路。

字幕原声提醒

“Attention drift” 与 “gate dropout” 的同时出现说明，只靠 loss 监控不能发现 drift，必须 instrument both noise and entropy signals。

25:00–尾声：QA / runbook

字幕末尾 synthetic query、temperature spike, gate dropout logging 提供了演练提示。Implementation Q&A 的三点 checklists 就是我们将这些 runbook steps 工程化的结果。

本章小结

附录摘录字幕原文，帮助 reviewer 直接关联视频 quote 与本笔记的工程措施，确保每个 major decision 都能被 trace back to the video.

下一步与实践任务

Operational follow-ups

根据讲座内容，我们可以列出以下行动项：

把 entropy、top-k、retrieval lag 三条曲线放入统一 dashboard
为每个 incident 记录 evidence ID、region tag 与 remediation steps
设定 temperature annealing schedule（前 10k 线性降温，entropy < 0.4 nat 时 raise 0.05）
定期运行 synthetic batch + gate dropout logging，验证 reroute policy

Research follow-ups

讲者提到的 open problems 推向实践即是 future research agenda：低资源 multi-modal drift detection、cross-lingual reroute policies、SDM geometry 的 instrumentation metrics。

本章小结

本章列出 actionable follow-ups 与 research agenda，以便后续迭代与会议复盘。

附录 B：实施清单

Checklist

下面的 checklist 把本讲提到的 instrumentation/gov steps 串联：

设立 entropy/top-k/retrieval lag dashboard
每次 temperature change 联系 KL divergence
维护 language-specific index 与 region-tagged evidence log
实施 synthetic query batch + gate dropout logging
记录 incident time + evidence ID + remediation action
把 top-k churn 视为 reroute trigger
对 low-resource config 开启 automatic temperature raise
Multi-modal deployment 加入 modality-specific entropy
每个 response 带 evidence ID & policy tag
维护 control layer：temperature/mix ratio automation

本章小结

实施清单给出具体可执行的 action，方便团队在复训或 deployment 时直接对照。

运营仪表与工具

Dashboard 总览

Contextual AI 的仪表盘分为 three layers：Signal（entropy、logits）、Aggregation（per head & language）与 Control（temperature、reroute actions）。每层有 dedicated cards 供 on-call 轮班查看。

卡片	显示内容	作用
Entropy heatmap	head-wise entropy	监控 attention sharpness
Top-k churn trend	top-k turnover ratio	发现 drift 来源
Retrieval lag	latency over time	评估 retrieval cost

仪表盘卡片示例

部署仪表板

把 entropy/top-k/retrieval lag 放入同一 dashboard，可以在 drift 初期同步看到多个 signal，Avoid one-signal bias。

与治理体系集成

仪表盘与 governance 系统整合后，任何 entropy drop 自动 push alert to runbook，若 reroute action 触发，会 automatically log event 与 evidence ID。

不要让 alert 成为噪音

每个 alert 都应该附加 severity、timestamp 与 remediation step，否则 on-call 会忽略 signal。

本章小结

运营仪表与工具从 signal → aggregation → control 全链路覆盖，把 instrumentation 数据与治理流程联动。

扩展案例：多模态企业部署

Scenario 描述

某 enterprise LLM 必须同时支持中文、英文、视觉 prompt。Contextual AI 在同一 platform 上分别部署 BM25 + dense retriever，并在 cross-modal rerank stage 绑定 modality-specific metrics。

中文/英文各自的 entropy 与 top-k logs
视觉 patch entropy + patch-level top-k churn
Modality-specific reroute policy：visual drift → trigger patch rerank + temperature raise

治理策略

在 enterprise context 中还要加上 compliance verification：every answer 附带 evidence ID & policy tag，logging pipeline 需要同步把 region、language、evidence 写入 same file。

cross-modality instrumentation

把 visual/audio/text entropy 并列监控，可以在 modality-specific drift 发生时自动 reroute 到 different expert or fallback to safe mode。

本章小结

扩展案例示范了多模态与多语言的实际对接路径：结合 modality-specific instrumentation 与 compliance labeling，使 RAG pipeline 在 enterprise deployment 中可靠运行。

开放问题与研究方向

低资源与小模型场景

Kiela 提到低资源或 edge device 上，attention head 数目减少时 mix 路由容易 collapse，因此必须保留 cache age 报警，与 retrieval entropy 绑定调整 temperature，还要引入 cold-start reroute policies。

低资源 drift 策略

在 small model 场景，entropy < 0.2 nat 可能意味着 MoE 路由 collapse，建议同时开启 temperature raise 与 synthetic reroute，以防 few-shot drift。

跨模态与概念空间扩展

SDM 的概念空间在多模态下同样成立：视觉 patch embedding 的 norm 分布不稳定，容易引起 entropy drift，因此需要对 visual retrieval 额外设置 modality-specific monitoring。

跨模态的指标扩展

把 visual patch entropy、audio token entropy 与 textual entropy 并列到 dashboard，可以监控 modality-specific drift，让 cross-modal attention 也保持 SDM 近似的稀疏性。

本章小结

开放问题提供下一代 RAG 的方向：低资源下的 entropy control、多模态的 modality-specific metric 以及 cross-modal reroute 都值得持续探索。

术语速查

RAG：Retrieval-Augmented Generation，把检索与生成串联。
Cache freshness：cache 中文档的新鲜度指标，保证 facts 不过期。
Entropy drift：softmax/attention entropy 的下降速度，是 hallucination 预警信号。
Top-k churn：top-k tokens 在训练或推理中的 turnover，用以检测 drift。
Load balancing loss：防止 attention 被少数 expert 垄断。
Gate entropy：MoE/attention gate logits 的 entropy，低 entropy 意味 collapse。
Retrieval lag：同一请求在不同 checkpoint 之间的 retrieval latency 变化。

本章小结

术语速查整理了 RAG pipeline 的指标 vocabulary，方便 engineering/QA 团队沟通。

总结与延伸

讲座中的 retrieval + generation + instrumentation 多次回归，下面总结表汇总主题、洞察与工程启发：

主题	技术洞察	工程启发
LLM 历史与记忆边界	参数化模型 aging	用非参数化 cache 补知识
RAG Pipeline	检索/增强/生成必须可审计	每阶段记录 evidence metadata
检索堆栈	BM25 + dense + rerank	把 index drift 写进 monitoring
端到端模型	REALM/RAG/Atlas/FiD 提供多个 tradeoff	结合 cache 与 reroute 控制成本
部署治理	latency + entropy + cache age	把 inference cost 视作 retrieval + generation 的组合

总结表

拓展阅读

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, 2020
Guu et al., “REALM: Retrieval-Augmented Language Model Pre-Training”, 2020
Izacard et al., “Atlas: Few-shot Learning with Retrieval Augmented Language Models”, 2022
Izacard & Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain QA” (FiD), 2021

本章小结

总结段落把历史、堆栈、评估与部署治理整理成可执行 playbook，为后续深化提供线索。