Notes from the first lecture of the Modern Agents series by 五道口纳什 (Wudaokou Nash): Survey of LLM Agent Reasoning & Planning, PDDL. The lecture content serves as the main thread, with my questions, discussions, and reflections interspersed as personal annotations.
Lecture PDF: ai-course-notes/modern-agent/lecture01 Video: Bilibili BV1wjY5zyEki
1. Series Overview and Course Introduction
This is the first installment of the Modern Agents paid series. It provides a macro-level overview of the core topics the series will cover, using Reasoning & Planning as the entry point for a systematic survey of LLMs' capabilities and limitations in the planning domain.
The full series covers:
- Reasoning and Planning: Reasoning and planning capabilities of language models — the foundation for everything that follows
- Classic Workflows: Self-Critique (generate-evaluate-iterate loops), Multi-Agent Systems, with Separation of Concern as the core principle
- Context Engineering: Holistic packaging of LLM-related techniques, with focus on Memory design
- Self-Evolving: Self-evolution techniques represented by AlphaEvolve, GPA, etc.
- Framework Practice: Using LangChain / LangGraph and other frameworks
- Graph-based Methods: GraphRAG and other Neuro-Symbolic approaches
- Tool Design and Wrapping: Designing appropriate Tools / Servers to complement LLM weaknesses
- Test-Time Computing: Parallel thinking (DeepConf), Majority Voting, and other inference-time compute techniques
The series' core philosophy: master more effective know-how through continuous practice, and maximize LLM performance through Agent design and development.
💬 Personal note: What is know-how?
This term appears repeatedly in the lecture and deserves careful understanding. Know-how is not knowledge — knowledge is "knowing the principles"; know-how is "knowing how to actually make things work." Papers tell you what works; know-how tells you how to make it work in reality — there's an order of magnitude between the two.
For example in the Agent space:
- Knowledge: ReAct is a Thought → Action → Observation loop
- Know-how: Long workflows need checkpoints; history must be trimmed or you'll get lost in middle; tool design shouldn't be too fine-grained
These are things you only learn by doing. A useful heuristic: if you're getting better at debugging and predicting bugs — your know-how is growing. If you're reading more papers but systems still keep breaking — know-how hasn't been established.
2. Reasoning: The Foundation of Language Model Intelligence
2.1 What Is Reasoning
Reasoning is the ability to "create something from nothing" — starting from known and limited premises, deriving richer and deeper conclusions along chains of inference.
The classic syllogism is a typical example:
- Major premise: All men are mortal
- Minor premise: Socrates is a man
- Conclusion: Socrates is mortal
A model's true intelligence is manifested through Reasoning. When we feel impressed or disappointed by a model's response, it's fundamentally a perception of its Reasoning ability.
2.2 Reasoning and Agent Capabilities
Reasoning is the foundation of all higher capabilities. Without Reasoning:
- No Planning: Planning requires task decomposition, Think Ahead, multi-step thinking
- No Reflection / Critique: Reflection requires generating new conclusions from existing premises
- No effective Agentic behavior: Agents need reasoning-based decision-making for next actions
💬 Personal note: What is Agentic? How does it differ from Chat?
Chat = single/multi-turn text response system. Input → Output, passive, no real action.
Agentic = closed-loop system with autonomous action capability: Observe → Think → Act → Observe → Think → Act...
Dimension Chat Agentic Essence Answer questions Complete tasks Output Text Actions Control User-driven Goal-driven Time structure Single-step Multi-step Environment interaction No Yes Common trap: Many projects are just chat + tools but call themselves agentic. The real test: does the system decide its own next step?
2.3 Two Classic Reasoning Types
- Deduction: Forward reasoning, deterministic. Derives necessary conclusions from major and minor premises.
- Induction: Probabilistic reasoning. Generalizes from limited observations — more evidence means more confidence, but can never guarantee absolute correctness (e.g., "all swans are white").
2.4 Reasoning Model: Implicit Long Reasoning
Current Reasoning Models (e.g., DeepSeek R1, OpenAI O series) pursue implicit Long Reasoning:
- Complex problems → model spontaneously produces long thinking processes
- Simple problems → produces short thinking processes
- This is a capability learned during training
💬 Personal note: What does "implicit" mean? How is this trained?
"Implicit" contrasts with explicit CoT. Explicit: you write "Let's think step by step" and the model starts reasoning — you command it. Implicit: the model itself knows when to think longer.
Training mechanisms:
- Massive diverse data: When data includes code, math proofs, multi-turn dialogues, reasoning traces — the model must learn logic, reasoning chains, and pattern abstraction just to reduce loss. Capabilities are "forced out by data"
- Sufficient model capacity: Small models can only memorize; large models form latent structure — which is why CoT, tool use, in-context learning emerge only at scale
- RL training: Reward shaping — low reward for unnecessary long reasoning on simple problems, low reward for no reasoning on complex ones, high reward for appropriate reasoning
How does the model judge complexity? It doesn't have a "complexity" concept. It learns: which inputs require more computation to reduce loss. The underlying mechanism is uncertainty — concentrated probability distribution (low entropy) → simple; spread out (high entropy) → complex. Complexity perception is fundamentally a statistical summary of historical error patterns.
2.5 Chapter Summary
Reasoning is the cornerstone of all higher capabilities in language models. Modern Reasoning Models are enhancing implicit long-chain reasoning through RL training.
3. Planning: The Core Challenge for LLM Agents
3.1 What Is Planning
Planning is task decomposition — whether implicit or explicit, it requires:
- Identifying the user's goal
- Decomposing the task into executable sub-steps
- Think Ahead: planning multiple steps forward
- Think Multiple Stages: considering multiple phases
Planning is needed because LLMs have clear deficiencies when solving complex, long-horizon, Multi-Hop problems.
💬 Personal note: What is Multi-Hop?
Multi-hop = multi-step reasoning where you can't get the answer in one step.
- Single-hop: What's the capital of France? → Paris (one step)
- Multi-hop: What's the capital of Einstein's birth country? → Einstein → Germany → Berlin (two steps)
The real difficulty isn't "multiple steps" but dependency — each step depends on previous results, and errors cascade (error propagation).
3.2 Planning Methods Survey
Based on a 2024 survey on LLM Planning, methods can be categorized as follows:
3.2.1 Decomposition
- CoT (Chain of Thought): "Let's think step by step" — model solves step by step
- ReAct: Continuous environment interaction — Thought → Action → Observation loop
- Plan-and-Solve: Explicitly form a Plan, then Solve along it
- PoT (Program of Thought): Convert thinking into executable programs
💬 Personal note: Key differences between these four
Method Essence Has plan? Interacts with env? CoT Step-by-step thinking ❌ ❌ ReAct Think+do loop ❌ ✅ Plan-and-Solve Plan first, execute ✅ Optional PoT Reason via programs Semi-explicit ❌ They're not competing — they're evolving: CoT → ReAct → Plan-and-Solve → Dynamic Plan-and-Solve → Tool-based Agent, progressively approaching real intelligent systems.
3.2.2 Selection
- ToT (Tree of Thoughts): Manages search using tree structure, expanding multiple candidates from each node, scoring to select the most promising for further exploration
- MCTS (Monte Carlo Tree Search): Based on UCB values balancing exploration and exploitation. Key steps are Simulation (scoring nodes) and Selection (choosing optimal nodes)
State explosion problem: Each node can produce multiple proposals, leading to exponential state explosion. Effective scoring (Score) and pruning (Prune) are essential.
💬 Personal note: Deep dive into MCTS
The four MCTS steps:
- Selection: Use UCB formula to find the most worth-exploring node, balancing exploitation (historically good) and exploration (rarely tried)
- Expansion: Generate new child nodes (new reasoning steps or actions)
- Simulation (rollout): Random play-out from current node to completion — like mentally simulating the future
- Backpropagation: Update results back through the path, accumulating statistics (visit count N, average value Q)
Critical clarification: MCTS is not a training algorithm — it's a test-time search algorithm. I initially thought it was a training method because it "updates" things, but it updates search tree statistics (N and Q), not model weights. No gradients, no optimizer. However, it can indirectly participate in training — AlphaGo uses MCTS to generate better decisions, then trains neural networks on those results. MCTS is the "teacher," the model is the "student."
In one sentence: MCTS is a test-time intelligence amplifier — trading computation time for reasoning quality.
3.2.3 External Solver
Such as PDDL Planners, used in Robotics and other planning tasks. Language models often fail when directly generating Plans; introducing external symbolic solvers significantly improves reliability.
💬 Personal note: What is an External Solver?
An external solver is a program that can precisely compute results, replacing the model's guessing. It doesn't train or learn — it only computes or simulates.
Typical external solvers: Python interpreter (math), SQL engine (databases), OS simulator (GUI), Compiler (code), Game simulator (games), Symbolic reasoning engine (geometry).
Important distinction: external solver (real environment, ground truth) vs world model (model predicts the future, approximate learned simulator) — completely different things.
3.2.4 Reflection & Critique
Through Self-Refine mechanisms, models reflect on and improve their own outputs. Reflection can also be introduced at the Memory level to form higher-level conclusions.
In Generative Agents, recording only surface-level facts ("he's reading a paper") has limited value. Through Reflection — synthesizing multiple observations — higher-level conclusions emerge ("this is a person passionate about research"). That's deep memory.
💬 Personal note: Where is Reflection applied in practice?
- Coding Agents (Claude Code / Cursor / Devin): write code → run tests → find errors → fix code — this IS Reflection + External Feedback
- AI NPCs (games): not just "met you today" but Reflection produces "you're reliable," influencing future behavior
- Personal AI Assistants: long-term operation extracts "user prefers Python, often writes data processing code" — not just logging individual sessions
- Agent Learning (Voyager in Minecraft): each failure produces experience rules ("don't explore at night," "make torches first") — accumulated experience via reflection
Reflection's essence isn't just "thinking" — it's information compression + abstraction. It's an architecture-level capability for solving long-term consistency.
3.3 Personal Classification of Planning Methods
The author classifies Planning methods into five categories:
| Category | Representative Methods |
|---|---|
| Reasoning Model (implicit) | DeepSeek R1, OpenAI O series |
| Plan-based (explicit) | ReAct, Plan-and-Solve |
| Search-based | ToT, LLM + MCTS |
| Test-Time Scaling | Majority Voting, DeepConf |
| Neuro-Symbolic | LM+P, Alpha Geometry, PDDL |
💬 Personal note: Intuitive mnemonic for these five categories
- Think (Reasoning): Model thinks on its own
- Write plans (Plan): Plan first, execute second
- Try many (Search): Explore many possibilities
- Compute more (Test-time): Generate multiple times, pick the best
- Call experts (Neuro-symbolic): Hand off to specialized solvers
These aren't mutually exclusive — real systems mix them. When reading a new paper, the first question should be: which of these five categories does it belong to? Or is it a combination?
3.4 Chapter Summary
When LLMs face complex long-horizon problems, directly generating Plans is error-prone. Quality can be improved through decomposition (CoT/ReAct), search (ToT/MCTS), external solvers (PDDL), and reflection (Self-Critique).
4. Classic Case Study Deep Dives
4.1 Alpha Geometry: Neuro-Symbolic Approach
4.1.1 First Generation
Core approach:
- Translate the geometry problem into symbolic language
- Let the symbolic engine attempt direct solving
- If stuck, have the language model propose constructions (e.g., adding auxiliary lines)
- Symbolic engine derives new facts from the new construction
- Loop until a complete solution is formed
Example: Given AB = AC (isosceles triangle), prove two angles are equal. The symbolic engine alone can't derive this from AB = AC; but the language model suggests adding point D at the midpoint of BC and connecting AD, enabling proof through similar triangles and angle equality properties.
💬 Personal note: What is symbolic language?
I got stuck here initially. Symbolic language is a formally rigorous language for describing the world that machines can fully understand and verify.
- Natural language:
AB = AC, so it's isosceles- Symbolic language:
Equal(Length(A,B), Length(A,C))It's not "readable language" but "computable language." Machines can't understand the word "isosceles triangle" — they can only understand the mathematical relation
AB = AC.AlphaGeometry's core division of labor: LLMs excel at creativity (thinking of which auxiliary line to add — the hardest part of geometry), symbolic systems excel at rigorous proof (every step must follow mathematical rules). Think of it as: a creative student + a strictly logical math teacher.
4.1.2 Second Generation
The second generation introduced parallel multi-path search:
- Not just one language model making proposals, but multiple neural models searching in parallel
- New facts generated across different search paths are shared, forming new premises (states) for different models to continue exploring
- This parallel thinking strategy inspired later Gemini Deep Think designs
4.2 DeepConf: Confidence-Based Majority Voting
DeepConf's core innovation:
- Introduces the Token Confidence concept
- Computes confidence for each token, then aggregates to get the confidence of an entire trace (complete solution)
- Does Majority Voting weighted by confidence, rather than simple majority counting
💬 Personal note: Understanding DeepConf in depth
Why is regular Majority Voting insufficient? If 3 out of 5 solutions are wrong and 2 are correct, majority vote picks the wrong answer. But with confidence weighting, correct answers often have much higher confidence (0.9+0.8=1.7 vs 0.3+0.4+0.2=0.9), and weighted voting picks correctly.
How is confidence aggregated? Theoretically multiplication (probability chain rule), but in practice log-prob addition (to avoid numerical underflow), usually with length normalization (divide by token count n) — otherwise short answers always win.
An important caveat: Can token probability really represent confidence? Strictly speaking, no. Token prob is only "statistical confidence," not correctness. A model might output
9.11 > 9.9 → Yeswith 0.97 probability — confidently wrong. DeepConf works because erroneous reasoning tends to show anomalies in token probabilities. Not a theoretical guarantee, but a statistical phenomenon.
4.3 UCLA IMO: Self-Critique Workflow
UCLA's approach to solving IMO problems is a classic Self-Critique workflow:
- Steps 1-2: Generate an initial proposal (solution)
- Step 3: Submit to Critique for verification, producing specific issues and feedback
- Step 4: Language model makes corrections based on feedback
- Loop until 5 consecutive passes (accept) or 10 consecutive failures (reject)
The keyword is capable — the model itself has the ability to solve the problem; it just needs the right workflow to activate it.
💬 Personal note: Why is Self-Critique stronger than DeepConf?
DeepConf is "pick the most confident answer" (probabilistic constraint); Self-Critique is "actively find errors" (logical constraint) — the latter is more reliable.
A key insight: LLMs' verification ability is often stronger than their generation ability. They might not get it right the first time, but when asked to "check Step 3," they can often spot the error. Self-Critique exploits exactly this.
Risk: Errors can self-reinforce — wrong critique → wrong correction → worse. Hence the maximum loop limit (10 consecutive failures → reject).
This pattern is used daily in coding agents: write code → run tests → find errors → fix → retest. Claude Code, Cursor — they're all doing this.
4.4 Chapter Summary
Three classic cases demonstrate Neuro-Symbolic (Alpha Geometry), Test-Time Scaling (DeepConf), and Self-Critique (UCLA IMO) — three Planning strategies that converge on the same principle: maximizing model reasoning potential through carefully designed workflows.
5. From ReAct to Plan-and-Solve
5.1 ReAct's Limitations
ReAct uses Thought → Action → Observation loops for complex problems, but has a critical issue:
When ReAct loops run long, severe context problems emerge:
- Lost in the Middle — the model gets lost in lengthy context
- Forgets the user's original task
- Forgets what it has done and what results it obtained
- Cannot produce a reasonable next Action
5.2 Plan-and-Solve Prompting
Plan-and-Solve improves on classic Zero-shot CoT with:
"Let's first understand the problem and devise a step-by-step plan to solve it. Then let's carry out the plan and solve the problem step by step."
Compared to "Let's think step by step," Plan-and-Solve explicitly introduces the Plan concept, yielding roughly several percentage points of improvement on complex problems.
5.3 Dynamic Plan-and-Solve
The community further improved Plan-and-Solve by adding dynamism:
Plans are formed in the head ("armchair thinking") without real environment interaction. Real execution produces new feedback, requiring Re-Plan — incorporating new feedback into the planning mechanism for dynamic correction.
Complete flow:
- Plan: Planner generates a top-down plan (List of Steps / To-Do List)
- Execute: Each step is handled by a ReAct Agent
- Re-Plan: Based on new feedback, decide whether to finish or re-plan
Re-Plan input includes: original problem + original Plan + previously executed steps and their results.
Economic consideration: Strong-weak model combination — Use strong models for Planner and Re-Plan (planning determines quality ceiling), weak models for concrete ReAct execution (individual steps are relatively simple). This balances cost and effectiveness.
5.4 Chapter Summary
From ReAct to Plan-and-Solve to Dynamic Plan-and-Solve — a clear evolution: from pure reactive interaction, to explicit planning, to dynamic plan adjustment bridging the gap between "armchair thinking" and real-world execution.
6. PDDL: Neuro-Symbolic Planning
6.1 The Difficulty of Direct LLM Planning
In classic Planning problems like Block World, language models easily make errors when directly generating Plans. For example, GPT-4 produced errors by step four — because state constraints exist (a robotic arm can only hold one block at a time), and language models easily generate obvious logical errors internally.
6.2 LM+P: Playing to Strengths
Core ideas:
- What LLMs are bad at: Executing precise Reasoning or solving Planning problems
- What LLMs are good at: Translation — converting natural language to symbolic language
- What PDDL Solvers are good at: Producing reliable execution sequences from complete Domain definitions and Goal States
So: let the language model translate, let the PDDL Solver plan.
Concrete flow:
- A PDDL Domain definition exists (describing available actions, preconditions, effects)
- User gives a natural language instruction (e.g., "put the apple in the fridge")
- Language model translates to Goal State (e.g., apple_1 in fridge_1)
- PDDL Solver solves for a concrete Plan based on Domain and Goal
- Language model translates each step back to natural language
Don't over-expect from language models: Even Gemini 2.5 Pro can make errors on equations as simple as 5.9 = x + 5.11. We shouldn't expect language models to solve everything — a simple calculator tool suffices. Think at the system level: use the right tool at the right step.
💬 Personal note: This "play to strengths" philosophy is everywhere in LLM systems
LM+P represents not just a specific approach but a universal system design principle: let LLMs handle the "soft" parts, let external modules handle the "hard" parts.
LLM handles (soft) External modules handle (hard) Understanding natural language Precise computation (Calculator/Python) Translation and rewriting Fact retrieval (RAG/Search) Semantic compression Correctness verification (Tests/Compiler) Generating candidates Program execution (Interpreter) Local flexible reasoning State consistency (Rule Engine) Multi-module coordination Long-term storage (Database/Memory) More bluntly: truly strong systems aren't built with bigger models — they're built with better-drawn module boundaries.
6.3 Chapter Summary
PDDL represents the essence of the Neuro-Symbolic approach: let the language model do what it's good at (translation), let specialized symbolic solvers handle precise planning. This play-to-strengths design philosophy is key to building reliable Agents.
7. Conclusions and Extensions
7.1 Core Takeaways
- Explicit or implicit Plans are necessary and effective: Plan mechanisms significantly improve LLM performance on complex long-horizon problems
- Combining Top-Down and Bottom-Up: Top-Down for task decomposition into Plans, Bottom-Up through real interaction feedback for Re-Planning, bridging the gap between mental planning and real execution
- Understand model strengths and weaknesses, play to strengths: Design appropriate Workflows and Tools to activate model capabilities while avoiding weaknesses
- System-level design: Language model + Tools + Prompts must be considered holistically for stable and effective solutions
7.2 Extended Thoughts
- Importance of Workflow Design: The UCLA IMO case proves models are capable — the key is whether workflow design can maximally activate model potential
- From Agentic Workflow to Autonomous Agent: Most current Coding Agents (Claude Code, Codex, Cursor) use predefined workflows rather than fully autonomous agents — because they're controllable, reliable, and predictable. Fully autonomous agents are the ultimate goal but are limited by current model capabilities
- Agent design may be as important as model training: From a practical standpoint, achieving better performance through careful Agent design may be more cost-effective than training stronger models
💬 Personal note: Current Coding Agent paradigms
This final observation is crucial. The 2026 reality: mainstream coding agents aren't "pure ReAct" but rather a ReAct-like execution loop wrapped with plan / verify / repair / checkpoint layers — a workflow agent.
A typical real flow:
Task intake → Plan → Localize files → Edit code → Run verifier (tests/build/lint) → Inspect output → Repair (if failed) → Checkpoint → PR or patchWhy not fully autonomous agents? Unstable (prone to detours and loops), uncontrollable (unpredictable, hard to debug), expensive (more tokens), and most tasks don't need it.
This also raises a key question for benchmarking: are you measuring model capability or workflow design quality? A critical distinction.