Agentic Patterns — SD-Guide

What counts as an agent

Gulli's working definition: an agent is an LLM that can decide (1) which action to take next, (2) when to stop. Add tools + memory and you have the minimum: a loop of observe → think → act. Everything else — ReAct, planners, multi-agent — is a pattern on top of that loop.

Three axes define every agentic system:

Determinism — pure prompt chain (deterministic) vs LLM-chosen next step (non-deterministic). Anthropic's "Building Effective Agents" blog post is explicit: prefer workflows (deterministic chains) unless you genuinely need autonomy, because agents cost 4–10× more tokens and are harder to debug.
Single vs multi-agent — one LLM with tools vs a team of specialized agents (planner, coder, critic). Multi-agent adds coordination cost; use it when roles are cleanly separable.
Memory depth — stateless → short-term (in-context) → long-term (vector DB / summary store).

Anchor numbers: a typical ReAct agent takes 4–12 tool calls per task; each call is 1 LLM round-trip (~2s) plus tool latency. Even a "fast" agent is 30–60s end-to-end. Budget accordingly.

Source cross-reference

Gulli's Agentic Design Patterns Ch.1–18 is the canonical taxonomy: prompt chaining, routing, parallelization, reflection, tool use, planning, multi-agent, memory, MCP, monitoring, exception handling, human-in-loop, RAG, A2A, resource optimization, reasoning, guardrails, evaluation. Memorize the chapter names.

Tool use and ReAct

Tool calling

Both Anthropic's tool_use blocks and OpenAI's tools parameter let the model emit a structured JSON call. The server executes the tool, appends the result as tool_result, and loops. The key technical details:

JSON schema on tool inputs: the narrower the schema, the fewer invalid calls. Always include enum constraints and required fields.
Tool descriptions matter more than names: the model chooses by description text. Write them like man pages, with examples.
Parallel tool calls: both APIs support multiple tool calls per turn. Enables fan-out and is a huge latency win (e.g., 5 web searches at once = 5× faster than serial).

ReAct

ReAct (Yao 2022) interleaves Reasoning traces with Actions: Thought → Action → Observation → Thought → .... The "thought" token is where the LLM explains why it's calling a tool. Empirically improves tool choice quality but burns tokens. Claude and GPT-4 both do this natively; you don't need a special prompt.

Planning, reflection, multi-agent

Planner/executor

For tasks with >5 steps, a single-LLM ReAct loop often loses the plot. Separate the planner (expensive model, creates ordered todo) from executors (cheap models, do each task). Pattern from Plan-and-Execute (Wang 2023) and the Devin-style coding agents.

Reflection / self-critique

After the agent produces output, a "critic" LLM call reviews it against criteria and proposes fixes. Reflexion (Shinn 2023) shows 20–30% accuracy improvement on coding and reasoning benchmarks at ~2× cost. Cheap version: ask the same model to "review your answer and fix mistakes".

Multi-agent

Patterns: Supervisor-worker (one planner dispatches to specialists), Debate (two agents argue, judge decides), Society-of-Mind. Interview warning: multi-agent is overhyped. Many production systems called "multi-agent" are just a supervisor + tools under the hood. Don't recommend multi-agent unless there are genuine specialization boundaries (e.g., a coder-reviewer-tester triad, each with distinct system prompts and tools).

flowchart TB
  U[User query] --> P[Planner LLM
Claude-3-Opus]
  P --> T1[Task 1]
  P --> T2[Task 2]
  P --> T3[Task 3]
  T1 --> E1[Executor
Claude-3-Haiku + tools]
  T2 --> E1
  T3 --> E1
  E1 --> M[Memory: scratchpad + vector]
  E1 --> C[Critic LLM]
  C -->|revise| E1
  C -->|ok| A[Aggregator]
  A --> R[Final response]

Memory: short, long, and shared

Short-term (in-context)

The turn buffer. Growing costs fast: Claude-3-Opus at 200k context is $15/M input tokens → $3/request for a full context. Mitigation: compress by summarization after every N turns, keep only last-K raw.

Long-term (vector + structured)

Episodic: past interactions indexed by embedding; retrieved by semantic query. Use vector DB (Pinecone, Milvus, pgvector).
Semantic: distilled facts ("user prefers markdown"). Store as key-value with confidence scores.
Procedural: learned tool sequences cached as skills.

Writing to long-term memory is the hard part. Agents typically write too much (every turn → noise) or too little (nothing persists). Solutions: reflection-gated writes (LLM decides "worth remembering?"), TTL on low-confidence memories, periodic consolidation.

Shared memory

For multi-agent systems, a shared scratchpad (Google Doc-like) or event log lets agents coordinate. Durable store (Postgres row per thread) beats in-process maps for reliability across retries.

MCP, A2A, and interop

Model Context Protocol (MCP)

Anthropic open-sourced MCP in late 2024 as a standard for connecting LLMs to tools and data sources. Architecture: MCP servers (local or remote) expose tools, resources, and prompts; MCP clients (Claude Desktop, IDEs, agents) consume them over JSON-RPC stdio or HTTP+SSE. Wins: tool reuse across models, local filesystem access, structured resource URIs.

Interview hook: MCP separates capability providers from model runners. A company wiring Claude to Jira, Slack, and S3 uses three MCP servers rather than three bespoke tool implementations inside the agent.

Anthropic-specific

MCP is an Anthropic invention. Claude Desktop and Claude Code both use MCP servers natively. Interview-level answer: MCP, combined with computer-use (screen + keyboard/mouse tools released 2024), lets Claude operate arbitrary GUI apps without per-app integrations. This is the core of Claude's "general agent" strategy.

A2A (Agent-to-Agent)

Gulli Ch.15 — standardizing agent-to-agent messaging so a planner can call a specialist across orgs. Less mature than MCP but promising for multi-company agent workflows.

OpenAI-specific

OpenAI's Assistants API provides a hosted agent runtime with threads, tools, and file search. Contrast with Anthropic's approach: OpenAI hides the loop inside their infra; Anthropic exposes tool-use blocks and expects you to run the loop. OpenAI's recent Responses API and Agents SDK narrow the gap.

Guardrails, eval, and interview checklist

Guardrails (Gulli Ch.18)

Two layers: input guardrails (reject prompts that violate policy, e.g., prompt injection from retrieved documents) and output guardrails (reject tool calls targeting dangerous actions, e.g., rm -rf /). Pattern: run a separate lightweight classifier (Llama-Guard, Anthropic's constitutional classifier) in parallel with the main agent. For high-stakes tools (DB writes, money transfer), require human-in-the-loop (Gulli Ch.13).

Evaluation (preview — see evaluation page)

Agent eval needs: (1) end-task success rate on a labeled suite; (2) trajectory analysis — did it take reasonable steps? (3) cost/latency; (4) failure mode taxonomy (wrong tool, loop, hallucinated citation).

Anti-patterns

No max-iteration cap. Agents loop forever on ambiguous tasks. Cap to 20 steps.
Unbounded tool output to context. A 2MB file read bloats context instantly. Truncate and summarize.
Multi-agent by default. Start with one agent + tools; add roles only when specialization is real.
No prompt-injection guard for RAG tools. A retrieved doc containing "ignore previous instructions" can hijack the agent.
Stateless retries on tool failure. If Slack API returns 429, backoff and resume — don't replay the whole thread.

Whiteboard checklist: clarify autonomy vs workflow; pick loop shape (ReAct / plan-execute / multi-agent); define tools with JSON schema; short + long memory design; MCP for external capability surface; input/output guardrails; max-iteration cap; human-in-loop for high-stakes tools; eval harness with trajectory logging.

什么算 agent

Gulli 工作定义：agent 是一个能自行决定 (1) 下一步动作、(2) 何时停止的 LLM。加上工具和记忆即得最小循环：observe → think → act。ReAct、planner、multi-agent 都是这之上的模式。

三个轴定义每个 agentic 系统：

确定性——纯 prompt chain（确定）vs LLM 选下一步（非确定）。Anthropic "Building Effective Agents" 博文明确：除非真需要自主性，优先用 workflow（确定链），因为 agent 多花 4-10 倍 token 且更难调试。
单 vs 多 agent——单 LLM + 工具 vs 专家团队（planner、coder、critic）。multi-agent 加协调成本；角色真正可分时才用。
记忆深度——无状态 → 短期（上下文内）→ 长期（向量 DB / 摘要存储）。

数字锚点：典型 ReAct agent 每任务 4-12 次工具调用；每次 1 个 LLM 往返（~2s）+ 工具延迟。"快"的 agent 也要 30-60s 端到端。按此预算。

参考来源

Gulli《Agentic Design Patterns》Ch.1-18 是标准分类：prompt chaining、routing、parallelization、reflection、tool use、planning、multi-agent、memory、MCP、monitoring、exception handling、human-in-loop、RAG、A2A、resource optimization、reasoning、guardrails、evaluation。背熟章节名。

工具使用与 ReAct

工具调用

Anthropic 的 tool_use 块与 OpenAI 的 tools 参数都让模型发结构化 JSON 调用。服务端执行工具、把结果作为 tool_result 追加、循环。关键技术点：

工具输入的 JSON schema：schema 越窄无效调用越少。加 enum 约束和必填字段。
工具描述比名字重要：模型按描述文字选工具。像 man page 一样写，带示例。
并行工具调用：两家 API 都支持一个 turn 多调用。启用 fan-out，延迟大赢（5 次并行搜索比串行快 5 倍）。

ReAct

ReAct（Yao 2022）交织推理与动作：Thought → Action → Observation → Thought → ...。"thought" token 是 LLM 解释为何调此工具的地方。实测提升工具选择质量但耗 token。Claude 和 GPT-4 原生都这样，不用特殊 prompt。

规划、反思、多 agent

Planner/executor

>5 步任务单 LLM ReAct 常跑偏。分开 planner（贵模型、列有序 todo）与 executor（便宜模型、逐项执行）。Plan-and-Execute（Wang 2023）和 Devin 风代码 agent 的模式。

反思 / 自我批评

agent 输出后，"critic" LLM 按标准审查并提出修正。Reflexion（Shinn 2023）在代码与推理 benchmark 上 +20-30% 准确率、代价 ~2×。便宜版：让同一模型"审查并修正"。

多 agent

模式：supervisor-worker（一 planner 分派专家）、debate（两 agent 争辩、judge 决定）、Society-of-Mind。面试警告：multi-agent 被过度吹捧。许多号称"多 agent"的生产系统本质是 supervisor + 工具。除非真有专业化边界（例如 coder-reviewer-tester 三人各有 system prompt 和工具），否则别推。

flowchart TB
  U[用户查询] --> P[Planner LLM
Claude-3-Opus]
  P --> T1[任务 1]
  P --> T2[任务 2]
  P --> T3[任务 3]
  T1 --> E1[Executor
Claude-3-Haiku + 工具]
  T2 --> E1
  T3 --> E1
  E1 --> M[记忆: scratchpad + 向量]
  E1 --> C[Critic LLM]
  C -->|修改| E1
  C -->|OK| A[聚合]
  A --> R[最终响应]

记忆：短、长、共享

短期（上下文内）

turn buffer。增长极快：Claude-3-Opus 20 万上下文 $15/M 输入 token → 每请求 $3。缓解：每 N 轮摘要压缩，只保留最后 K 轮原文。

长期（向量 + 结构化）

情节：过去互动按 embedding 索引，语义查询召回。用向量 DB（Pinecone、Milvus、pgvector）。
语义：蒸馏事实（"用户偏好 markdown"）。带置信度 KV 存储。
程序性：学到的工具序列作为 skill 缓存。

写长期是难点。Agent 要么写太多（每轮都写 → 噪声）、要么太少（啥也留不下）。方案：反思门控（LLM 判断"值得记吗？"）、低置信度加 TTL、定期合并。

共享记忆

多 agent 系统，共享 scratchpad（像 Google Doc）或事件日志让各方协调。持久化（Postgres 按 thread 一行）比内存 map 在重试时更稳。

MCP、A2A、互操作

Model Context Protocol (MCP)

Anthropic 2024 末开源 MCP 作为 LLM 与工具、数据源互联标准。架构：MCP 服务器（本地或远程）暴露 tools/resources/prompts；MCP 客户端（Claude Desktop、IDE、agent）通过 JSON-RPC stdio 或 HTTP+SSE 消费。好处：跨模型工具复用、本地文件访问、结构化资源 URI。

面试点：MCP 分离能力提供方与模型运行方。把 Claude 接到 Jira、Slack、S3 的公司用三个 MCP 服务器而非在 agent 里写三份工具。

Anthropic 细节

MCP 是 Anthropic 发明。Claude Desktop 和 Claude Code 原生用 MCP。面试级答案：MCP + 计算机使用（2024 发布的屏幕 + 键鼠工具）让 Claude 无须单独集成就能操作任意 GUI 应用，是 Claude "通用 agent" 策略核心。

A2A (Agent-to-Agent)

Gulli 第 15 章——agent 间消息标准化，让 planner 能跨组织调专家。比 MCP 不成熟但对跨公司工作流有潜力。

OpenAI 细节

OpenAI Assistants API 提供托管 agent runtime（thread、tool、file search）。对比 Anthropic：OpenAI 把循环藏在基础设施里；Anthropic 暴露 tool_use 块、期望你跑循环。OpenAI 新 Responses API 与 Agents SDK 在缩小差距。

Guardrail、评估、面试清单

Guardrail（Gulli 第 18 章）

两层：输入 guardrail（拒策略违规 prompt，例如从检索文档来的 prompt injection）和输出 guardrail（拒危险工具调用，例如 rm -rf /）。模式：用独立轻分类器（Llama-Guard、Anthropic 宪法分类器）与主 agent 并行跑。高风险工具（数据库写、转账）要 human-in-the-loop（Gulli 第 13 章）。

评估（见评估页）

Agent 评估要：(1) 标注套件的端任务成功率；(2) 轨迹分析——步骤是否合理？(3) 成本/延迟；(4) 失败模式分类（错工具、循环、幻觉引用）。

反模式

不设 max-iteration 上限。Agent 在模糊任务上无限循环。封 20 步。
无界工具输出进上下文。读 2MB 文件瞬间涨上下文。要截断+摘要。
默认 multi-agent。先单 agent+工具，只在真有专业化时加角色。
RAG 工具无 prompt injection 防护。检索到的文档含"忽略先前指令"能劫持 agent。
工具失败无状态重试。Slack API 429 应退避续跑，别重放整个 thread。

白板清单：确定性 vs workflow；循环形态（ReAct / plan-execute / multi-agent）；JSON schema 定义工具；短+长记忆设计；MCP 外部能力面；输入/输出 guardrail；max-iteration 上限；高风险工具 human-in-loop；带轨迹日志的评估框架。