LLM Agent Reasoning & Planning: A Comprehensive Study

Notes from the first lecture of the Modern Agents series by 五道口纳什 (Wudaokou Nash): Survey of LLM Agent Reasoning & Planning, PDDL. The lecture content serves as the main thread, with my questions, discussions, and reflections interspersed as personal annotations.

Lecture PDF: ai-course-notes/modern-agent/lecture01 Video: Bilibili BV1wjY5zyEki

1. Series Overview and Course Introduction

This is the first installment of the Modern Agents paid series. It provides a macro-level overview of the core topics the series will cover, using Reasoning & Planning as the entry point for a systematic survey of LLMs' capabilities and limitations in the planning domain.

The full series covers:

Reasoning and Planning: Reasoning and planning capabilities of language models — the foundation for everything that follows
Classic Workflows: Self-Critique (generate-evaluate-iterate loops), Multi-Agent Systems, with Separation of Concern as the core principle
Context Engineering: Holistic packaging of LLM-related techniques, with focus on Memory design
Self-Evolving: Self-evolution techniques represented by AlphaEvolve, GPA, etc.
Framework Practice: Using LangChain / LangGraph and other frameworks
Graph-based Methods: GraphRAG and other Neuro-Symbolic approaches
Tool Design and Wrapping: Designing appropriate Tools / Servers to complement LLM weaknesses
Test-Time Computing: Parallel thinking (DeepConf), Majority Voting, and other inference-time compute techniques

The series' core philosophy: master more effective know-how through continuous practice, and maximize LLM performance through Agent design and development.

💬 Personal note: What is know-how?

This term appears repeatedly in the lecture and deserves careful understanding. Know-how is not knowledge — knowledge is "knowing the principles"; know-how is "knowing how to actually make things work." Papers tell you what works; know-how tells you how to make it work in reality — there's an order of magnitude between the two.

For example in the Agent space:

Knowledge: ReAct is a Thought → Action → Observation loop

Know-how: Long workflows need checkpoints; history must be trimmed or you'll get lost in middle; tool design shouldn't be too fine-grained

These are things you only learn by doing. A useful heuristic: if you're getting better at debugging and predicting bugs — your know-how is growing. If you're reading more papers but systems still keep breaking — know-how hasn't been established.

2. Reasoning: The Foundation of Language Model Intelligence

2.1 What Is Reasoning

Reasoning is the ability to "create something from nothing" — starting from known and limited premises, deriving richer and deeper conclusions along chains of inference.

The classic syllogism is a typical example:

Major premise: All men are mortal
Minor premise: Socrates is a man
Conclusion: Socrates is mortal

A model's true intelligence is manifested through Reasoning. When we feel impressed or disappointed by a model's response, it's fundamentally a perception of its Reasoning ability.

2.2 Reasoning and Agent Capabilities

Reasoning is the foundation of all higher capabilities. Without Reasoning:

No Planning: Planning requires task decomposition, Think Ahead, multi-step thinking
No Reflection / Critique: Reflection requires generating new conclusions from existing premises
No effective Agentic behavior: Agents need reasoning-based decision-making for next actions

💬 Personal note: What is Agentic? How does it differ from Chat?

Chat = single/multi-turn text response system. Input → Output, passive, no real action.

Agentic = closed-loop system with autonomous action capability: Observe → Think → Act → Observe → Think → Act...

Dimension Chat Agentic
Essence Answer questions Complete tasks
Output Text Actions
Control User-driven Goal-driven
Time structure Single-step Multi-step
Environment interaction No Yes

Common trap: Many projects are just chat + tools but call themselves agentic. The real test: does the system decide its own next step?

Dimension	Chat	Agentic
Essence	Answer questions	Complete tasks
Output	Text	Actions
Control	User-driven	Goal-driven
Time structure	Single-step	Multi-step
Environment interaction	No	Yes

2.3 Two Classic Reasoning Types

Deduction: Forward reasoning, deterministic. Derives necessary conclusions from major and minor premises.
Induction: Probabilistic reasoning. Generalizes from limited observations — more evidence means more confidence, but can never guarantee absolute correctness (e.g., "all swans are white").

2.4 Reasoning Model: Implicit Long Reasoning

Current Reasoning Models (e.g., DeepSeek R1, OpenAI O series) pursue implicit Long Reasoning:

Complex problems → model spontaneously produces long thinking processes
Simple problems → produces short thinking processes
This is a capability learned during training

💬 Personal note: What does "implicit" mean? How is this trained?

"Implicit" contrasts with explicit CoT. Explicit: you write "Let's think step by step" and the model starts reasoning — you command it. Implicit: the model itself knows when to think longer.

Training mechanisms:

Massive diverse data: When data includes code, math proofs, multi-turn dialogues, reasoning traces — the model must learn logic, reasoning chains, and pattern abstraction just to reduce loss. Capabilities are "forced out by data"

Sufficient model capacity: Small models can only memorize; large models form latent structure — which is why CoT, tool use, in-context learning emerge only at scale

RL training: Reward shaping — low reward for unnecessary long reasoning on simple problems, low reward for no reasoning on complex ones, high reward for appropriate reasoning

How does the model judge complexity? It doesn't have a "complexity" concept. It learns: which inputs require more computation to reduce loss. The underlying mechanism is uncertainty — concentrated probability distribution (low entropy) → simple; spread out (high entropy) → complex. Complexity perception is fundamentally a statistical summary of historical error patterns.

2.5 Chapter Summary

Reasoning is the cornerstone of all higher capabilities in language models. Modern Reasoning Models are enhancing implicit long-chain reasoning through RL training.

3. Planning: The Core Challenge for LLM Agents

3.1 What Is Planning

Planning is task decomposition — whether implicit or explicit, it requires:

Identifying the user's goal
Decomposing the task into executable sub-steps
Think Ahead: planning multiple steps forward
Think Multiple Stages: considering multiple phases

Planning is needed because LLMs have clear deficiencies when solving complex, long-horizon, Multi-Hop problems.

💬 Personal note: What is Multi-Hop?

Multi-hop = multi-step reasoning where you can't get the answer in one step.

Single-hop: What's the capital of France? → Paris (one step)

Multi-hop: What's the capital of Einstein's birth country? → Einstein → Germany → Berlin (two steps)

The real difficulty isn't "multiple steps" but dependency — each step depends on previous results, and errors cascade (error propagation).

3.2 Planning Methods Survey

Based on a 2024 survey on LLM Planning, methods can be categorized as follows:

3.2.1 Decomposition

CoT (Chain of Thought): "Let's think step by step" — model solves step by step
ReAct: Continuous environment interaction — Thought → Action → Observation loop
Plan-and-Solve: Explicitly form a Plan, then Solve along it
PoT (Program of Thought): Convert thinking into executable programs

💬 Personal note: Key differences between these four

Method Essence Has plan? Interacts with env?
CoT Step-by-step thinking ❌ ❌
ReAct Think+do loop ❌ ✅
Plan-and-Solve Plan first, execute ✅ Optional
PoT Reason via programs Semi-explicit ❌

They're not competing — they're evolving: CoT → ReAct → Plan-and-Solve → Dynamic Plan-and-Solve → Tool-based Agent, progressively approaching real intelligent systems.

Method	Essence	Has plan?	Interacts with env?
CoT	Step-by-step thinking	❌	❌
ReAct	Think+do loop	❌	✅
Plan-and-Solve	Plan first, execute	✅	Optional
PoT	Reason via programs	Semi-explicit	❌

3.2.2 Selection

ToT (Tree of Thoughts): Manages search using tree structure, expanding multiple candidates from each node, scoring to select the most promising for further exploration
MCTS (Monte Carlo Tree Search): Based on UCB values balancing exploration and exploitation. Key steps are Simulation (scoring nodes) and Selection (choosing optimal nodes)

State explosion problem: Each node can produce multiple proposals, leading to exponential state explosion. Effective scoring (Score) and pruning (Prune) are essential.

💬 Personal note: Deep dive into MCTS

The four MCTS steps:

Selection: Use UCB formula to find the most worth-exploring node, balancing exploitation (historically good) and exploration (rarely tried)

Expansion: Generate new child nodes (new reasoning steps or actions)

Simulation (rollout): Random play-out from current node to completion — like mentally simulating the future

Backpropagation: Update results back through the path, accumulating statistics (visit count N, average value Q)

Critical clarification: MCTS is not a training algorithm — it's a test-time search algorithm. I initially thought it was a training method because it "updates" things, but it updates search tree statistics (N and Q), not model weights. No gradients, no optimizer. However, it can indirectly participate in training — AlphaGo uses MCTS to generate better decisions, then trains neural networks on those results. MCTS is the "teacher," the model is the "student."

In one sentence: MCTS is a test-time intelligence amplifier — trading computation time for reasoning quality.

3.2.3 External Solver

Such as PDDL Planners, used in Robotics and other planning tasks. Language models often fail when directly generating Plans; introducing external symbolic solvers significantly improves reliability.

💬 Personal note: What is an External Solver?

An external solver is a program that can precisely compute results, replacing the model's guessing. It doesn't train or learn — it only computes or simulates.

Typical external solvers: Python interpreter (math), SQL engine (databases), OS simulator (GUI), Compiler (code), Game simulator (games), Symbolic reasoning engine (geometry).

Important distinction: external solver (real environment, ground truth) vs world model (model predicts the future, approximate learned simulator) — completely different things.

3.2.4 Reflection & Critique

Through Self-Refine mechanisms, models reflect on and improve their own outputs. Reflection can also be introduced at the Memory level to form higher-level conclusions.

In Generative Agents, recording only surface-level facts ("he's reading a paper") has limited value. Through Reflection — synthesizing multiple observations — higher-level conclusions emerge ("this is a person passionate about research"). That's deep memory.

💬 Personal note: Where is Reflection applied in practice?

Coding Agents (Claude Code / Cursor / Devin): write code → run tests → find errors → fix code — this IS Reflection + External Feedback

AI NPCs (games): not just "met you today" but Reflection produces "you're reliable," influencing future behavior

Personal AI Assistants: long-term operation extracts "user prefers Python, often writes data processing code" — not just logging individual sessions

Agent Learning (Voyager in Minecraft): each failure produces experience rules ("don't explore at night," "make torches first") — accumulated experience via reflection

Reflection's essence isn't just "thinking" — it's information compression + abstraction. It's an architecture-level capability for solving long-term consistency.

3.3 Personal Classification of Planning Methods

The author classifies Planning methods into five categories:

Category	Representative Methods
Reasoning Model (implicit)	DeepSeek R1, OpenAI O series
Plan-based (explicit)	ReAct, Plan-and-Solve
Search-based	ToT, LLM + MCTS
Test-Time Scaling	Majority Voting, DeepConf
Neuro-Symbolic	LM+P, Alpha Geometry, PDDL

💬 Personal note: Intuitive mnemonic for these five categories

Think (Reasoning): Model thinks on its own

Write plans (Plan): Plan first, execute second

Try many (Search): Explore many possibilities

Compute more (Test-time): Generate multiple times, pick the best

Call experts (Neuro-symbolic): Hand off to specialized solvers

These aren't mutually exclusive — real systems mix them. When reading a new paper, the first question should be: which of these five categories does it belong to? Or is it a combination?

3.4 Chapter Summary

When LLMs face complex long-horizon problems, directly generating Plans is error-prone. Quality can be improved through decomposition (CoT/ReAct), search (ToT/MCTS), external solvers (PDDL), and reflection (Self-Critique).

4. Classic Case Study Deep Dives

4.1 Alpha Geometry: Neuro-Symbolic Approach

4.1.1 First Generation

Core approach:

Translate the geometry problem into symbolic language
Let the symbolic engine attempt direct solving
If stuck, have the language model propose constructions (e.g., adding auxiliary lines)
Symbolic engine derives new facts from the new construction
Loop until a complete solution is formed

Example: Given AB = AC (isosceles triangle), prove two angles are equal. The symbolic engine alone can't derive this from AB = AC; but the language model suggests adding point D at the midpoint of BC and connecting AD, enabling proof through similar triangles and angle equality properties.

💬 Personal note: What is symbolic language?

I got stuck here initially. Symbolic language is a formally rigorous language for describing the world that machines can fully understand and verify.

Natural language: AB = AC, so it's isosceles

Symbolic language: Equal(Length(A,B), Length(A,C))

It's not "readable language" but "computable language." Machines can't understand the word "isosceles triangle" — they can only understand the mathematical relation AB = AC.

AlphaGeometry's core division of labor: LLMs excel at creativity (thinking of which auxiliary line to add — the hardest part of geometry), symbolic systems excel at rigorous proof (every step must follow mathematical rules). Think of it as: a creative student + a strictly logical math teacher.

4.1.2 Second Generation

The second generation introduced parallel multi-path search:

Not just one language model making proposals, but multiple neural models searching in parallel
New facts generated across different search paths are shared, forming new premises (states) for different models to continue exploring
This parallel thinking strategy inspired later Gemini Deep Think designs

4.2 DeepConf: Confidence-Based Majority Voting

DeepConf's core innovation:

Introduces the Token Confidence concept
Computes confidence for each token, then aggregates to get the confidence of an entire trace (complete solution)
Does Majority Voting weighted by confidence, rather than simple majority counting

💬 Personal note: Understanding DeepConf in depth

Why is regular Majority Voting insufficient? If 3 out of 5 solutions are wrong and 2 are correct, majority vote picks the wrong answer. But with confidence weighting, correct answers often have much higher confidence (0.9+0.8=1.7 vs 0.3+0.4+0.2=0.9), and weighted voting picks correctly.

How is confidence aggregated? Theoretically multiplication (probability chain rule), but in practice log-prob addition (to avoid numerical underflow), usually with length normalization (divide by token count n) — otherwise short answers always win.

An important caveat: Can token probability really represent confidence? Strictly speaking, no. Token prob is only "statistical confidence," not correctness. A model might output 9.11 > 9.9 → Yes with 0.97 probability — confidently wrong. DeepConf works because erroneous reasoning tends to show anomalies in token probabilities. Not a theoretical guarantee, but a statistical phenomenon.

4.3 UCLA IMO: Self-Critique Workflow

UCLA's approach to solving IMO problems is a classic Self-Critique workflow:

Steps 1-2: Generate an initial proposal (solution)
Step 3: Submit to Critique for verification, producing specific issues and feedback
Step 4: Language model makes corrections based on feedback
Loop until 5 consecutive passes (accept) or 10 consecutive failures (reject)

The keyword is capable — the model itself has the ability to solve the problem; it just needs the right workflow to activate it.

💬 Personal note: Why is Self-Critique stronger than DeepConf?

DeepConf is "pick the most confident answer" (probabilistic constraint); Self-Critique is "actively find errors" (logical constraint) — the latter is more reliable.

A key insight: LLMs' verification ability is often stronger than their generation ability. They might not get it right the first time, but when asked to "check Step 3," they can often spot the error. Self-Critique exploits exactly this.

Risk: Errors can self-reinforce — wrong critique → wrong correction → worse. Hence the maximum loop limit (10 consecutive failures → reject).

This pattern is used daily in coding agents: write code → run tests → find errors → fix → retest. Claude Code, Cursor — they're all doing this.

4.4 Chapter Summary

Three classic cases demonstrate Neuro-Symbolic (Alpha Geometry), Test-Time Scaling (DeepConf), and Self-Critique (UCLA IMO) — three Planning strategies that converge on the same principle: maximizing model reasoning potential through carefully designed workflows.

5. From ReAct to Plan-and-Solve

5.1 ReAct's Limitations

ReAct uses Thought → Action → Observation loops for complex problems, but has a critical issue:

When ReAct loops run long, severe context problems emerge:

Lost in the Middle — the model gets lost in lengthy context
Forgets the user's original task
Forgets what it has done and what results it obtained
Cannot produce a reasonable next Action

5.2 Plan-and-Solve Prompting

Plan-and-Solve improves on classic Zero-shot CoT with:

"Let's first understand the problem and devise a step-by-step plan to solve it. Then let's carry out the plan and solve the problem step by step."

Compared to "Let's think step by step," Plan-and-Solve explicitly introduces the Plan concept, yielding roughly several percentage points of improvement on complex problems.

5.3 Dynamic Plan-and-Solve

The community further improved Plan-and-Solve by adding dynamism:

Plans are formed in the head ("armchair thinking") without real environment interaction. Real execution produces new feedback, requiring Re-Plan — incorporating new feedback into the planning mechanism for dynamic correction.

Complete flow:

Plan: Planner generates a top-down plan (List of Steps / To-Do List)
Execute: Each step is handled by a ReAct Agent
Re-Plan: Based on new feedback, decide whether to finish or re-plan

Re-Plan input includes: original problem + original Plan + previously executed steps and their results.

Economic consideration: Strong-weak model combination — Use strong models for Planner and Re-Plan (planning determines quality ceiling), weak models for concrete ReAct execution (individual steps are relatively simple). This balances cost and effectiveness.

5.4 Chapter Summary

From ReAct to Plan-and-Solve to Dynamic Plan-and-Solve — a clear evolution: from pure reactive interaction, to explicit planning, to dynamic plan adjustment bridging the gap between "armchair thinking" and real-world execution.

6. PDDL: Neuro-Symbolic Planning

6.1 The Difficulty of Direct LLM Planning

In classic Planning problems like Block World, language models easily make errors when directly generating Plans. For example, GPT-4 produced errors by step four — because state constraints exist (a robotic arm can only hold one block at a time), and language models easily generate obvious logical errors internally.

6.2 LM+P: Playing to Strengths

Core ideas:

What LLMs are bad at: Executing precise Reasoning or solving Planning problems
What LLMs are good at: Translation — converting natural language to symbolic language
What PDDL Solvers are good at: Producing reliable execution sequences from complete Domain definitions and Goal States

So: let the language model translate, let the PDDL Solver plan.

Concrete flow:

A PDDL Domain definition exists (describing available actions, preconditions, effects)
User gives a natural language instruction (e.g., "put the apple in the fridge")
Language model translates to Goal State (e.g., apple_1 in fridge_1)
PDDL Solver solves for a concrete Plan based on Domain and Goal
Language model translates each step back to natural language

Don't over-expect from language models: Even Gemini 2.5 Pro can make errors on equations as simple as 5.9 = x + 5.11. We shouldn't expect language models to solve everything — a simple calculator tool suffices. Think at the system level: use the right tool at the right step.

💬 Personal note: This "play to strengths" philosophy is everywhere in LLM systems

LM+P represents not just a specific approach but a universal system design principle: let LLMs handle the "soft" parts, let external modules handle the "hard" parts.

LLM handles (soft) External modules handle (hard)
Understanding natural language Precise computation (Calculator/Python)
Translation and rewriting Fact retrieval (RAG/Search)
Semantic compression Correctness verification (Tests/Compiler)
Generating candidates Program execution (Interpreter)
Local flexible reasoning State consistency (Rule Engine)
Multi-module coordination Long-term storage (Database/Memory)

More bluntly: truly strong systems aren't built with bigger models — they're built with better-drawn module boundaries.

LLM handles (soft)	External modules handle (hard)
Understanding natural language	Precise computation (Calculator/Python)
Translation and rewriting	Fact retrieval (RAG/Search)
Semantic compression	Correctness verification (Tests/Compiler)
Generating candidates	Program execution (Interpreter)
Local flexible reasoning	State consistency (Rule Engine)
Multi-module coordination	Long-term storage (Database/Memory)

6.3 Chapter Summary

PDDL represents the essence of the Neuro-Symbolic approach: let the language model do what it's good at (translation), let specialized symbolic solvers handle precise planning. This play-to-strengths design philosophy is key to building reliable Agents.

7. Conclusions and Extensions

7.1 Core Takeaways

Explicit or implicit Plans are necessary and effective: Plan mechanisms significantly improve LLM performance on complex long-horizon problems
Combining Top-Down and Bottom-Up: Top-Down for task decomposition into Plans, Bottom-Up through real interaction feedback for Re-Planning, bridging the gap between mental planning and real execution
Understand model strengths and weaknesses, play to strengths: Design appropriate Workflows and Tools to activate model capabilities while avoiding weaknesses
System-level design: Language model + Tools + Prompts must be considered holistically for stable and effective solutions

7.2 Extended Thoughts

Importance of Workflow Design: The UCLA IMO case proves models are capable — the key is whether workflow design can maximally activate model potential
From Agentic Workflow to Autonomous Agent: Most current Coding Agents (Claude Code, Codex, Cursor) use predefined workflows rather than fully autonomous agents — because they're controllable, reliable, and predictable. Fully autonomous agents are the ultimate goal but are limited by current model capabilities
Agent design may be as important as model training: From a practical standpoint, achieving better performance through careful Agent design may be more cost-effective than training stronger models

💬 Personal note: Current Coding Agent paradigms

This final observation is crucial. The 2026 reality: mainstream coding agents aren't "pure ReAct" but rather a ReAct-like execution loop wrapped with plan / verify / repair / checkpoint layers — a workflow agent.

A typical real flow:
Task intake → Plan → Localize files → Edit code
→ Run verifier (tests/build/lint) → Inspect output
→ Repair (if failed) → Checkpoint → PR or patch
Why not fully autonomous agents? Unstable (prone to detours and loops), uncontrollable (unpredictable, hard to debug), expensive (more tokens), and most tasks don't need it.

This also raises a key question for benchmarking: are you measuring model capability or workflow design quality? A critical distinction.