What counts as an agent
Gulli's working definition: an agent is an LLM that can decide (1) which action to take next, (2) when to stop. Add tools + memory and you have the minimum: a loop of observe → think → act. Everything else — ReAct, planners, multi-agent — is a pattern on top of that loop.
Three axes define every agentic system:
- Determinism — pure prompt chain (deterministic) vs LLM-chosen next step (non-deterministic). Anthropic's "Building Effective Agents" blog post is explicit: prefer workflows (deterministic chains) unless you genuinely need autonomy, because agents cost 4–10× more tokens and are harder to debug.
- Single vs multi-agent — one LLM with tools vs a team of specialized agents (planner, coder, critic). Multi-agent adds coordination cost; use it when roles are cleanly separable.
- Memory depth — stateless → short-term (in-context) → long-term (vector DB / summary store).
Anchor numbers: a typical ReAct agent takes 4–12 tool calls per task; each call is 1 LLM round-trip (~2s) plus tool latency. Even a "fast" agent is 30–60s end-to-end. Budget accordingly.
Source cross-reference
Gulli's Agentic Design Patterns Ch.1–18 is the canonical taxonomy: prompt chaining, routing, parallelization, reflection, tool use, planning, multi-agent, memory, MCP, monitoring, exception handling, human-in-loop, RAG, A2A, resource optimization, reasoning, guardrails, evaluation. Memorize the chapter names.
Tool use and ReAct
Tool calling
Both Anthropic's tool_use blocks and OpenAI's tools parameter let the model emit a structured JSON call. The server executes the tool, appends the result as tool_result, and loops. The key technical details:
- JSON schema on tool inputs: the narrower the schema, the fewer invalid calls. Always include enum constraints and required fields.
- Tool descriptions matter more than names: the model chooses by description text. Write them like man pages, with examples.
- Parallel tool calls: both APIs support multiple tool calls per turn. Enables fan-out and is a huge latency win (e.g., 5 web searches at once = 5× faster than serial).
ReAct
ReAct (Yao 2022) interleaves Reasoning traces with Actions: Thought → Action → Observation → Thought → .... The "thought" token is where the LLM explains why it's calling a tool. Empirically improves tool choice quality but burns tokens. Claude and GPT-4 both do this natively; you don't need a special prompt.
Planning, reflection, multi-agent
Planner/executor
For tasks with >5 steps, a single-LLM ReAct loop often loses the plot. Separate the planner (expensive model, creates ordered todo) from executors (cheap models, do each task). Pattern from Plan-and-Execute (Wang 2023) and the Devin-style coding agents.
Reflection / self-critique
After the agent produces output, a "critic" LLM call reviews it against criteria and proposes fixes. Reflexion (Shinn 2023) shows 20–30% accuracy improvement on coding and reasoning benchmarks at ~2× cost. Cheap version: ask the same model to "review your answer and fix mistakes".
Multi-agent
Patterns: Supervisor-worker (one planner dispatches to specialists), Debate (two agents argue, judge decides), Society-of-Mind. Interview warning: multi-agent is overhyped. Many production systems called "multi-agent" are just a supervisor + tools under the hood. Don't recommend multi-agent unless there are genuine specialization boundaries (e.g., a coder-reviewer-tester triad, each with distinct system prompts and tools).
flowchart TB U[User query] --> P[Planner LLM
Claude-3-Opus] P --> T1[Task 1] P --> T2[Task 2] P --> T3[Task 3] T1 --> E1[Executor
Claude-3-Haiku + tools] T2 --> E1 T3 --> E1 E1 --> M[Memory: scratchpad + vector] E1 --> C[Critic LLM] C -->|revise| E1 C -->|ok| A[Aggregator] A --> R[Final response]
Memory: short, long, and shared
Short-term (in-context)
The turn buffer. Growing costs fast: Claude-3-Opus at 200k context is $15/M input tokens → $3/request for a full context. Mitigation: compress by summarization after every N turns, keep only last-K raw.
Long-term (vector + structured)
- Episodic: past interactions indexed by embedding; retrieved by semantic query. Use vector DB (Pinecone, Milvus, pgvector).
- Semantic: distilled facts ("user prefers markdown"). Store as key-value with confidence scores.
- Procedural: learned tool sequences cached as skills.
Writing to long-term memory is the hard part. Agents typically write too much (every turn → noise) or too little (nothing persists). Solutions: reflection-gated writes (LLM decides "worth remembering?"), TTL on low-confidence memories, periodic consolidation.
Shared memory
For multi-agent systems, a shared scratchpad (Google Doc-like) or event log lets agents coordinate. Durable store (Postgres row per thread) beats in-process maps for reliability across retries.
MCP, A2A, and interop
Model Context Protocol (MCP)
Anthropic open-sourced MCP in late 2024 as a standard for connecting LLMs to tools and data sources. Architecture: MCP servers (local or remote) expose tools, resources, and prompts; MCP clients (Claude Desktop, IDEs, agents) consume them over JSON-RPC stdio or HTTP+SSE. Wins: tool reuse across models, local filesystem access, structured resource URIs.
Interview hook: MCP separates capability providers from model runners. A company wiring Claude to Jira, Slack, and S3 uses three MCP servers rather than three bespoke tool implementations inside the agent.
Anthropic-specific
MCP is an Anthropic invention. Claude Desktop and Claude Code both use MCP servers natively. Interview-level answer: MCP, combined with computer-use (screen + keyboard/mouse tools released 2024), lets Claude operate arbitrary GUI apps without per-app integrations. This is the core of Claude's "general agent" strategy.
A2A (Agent-to-Agent)
Gulli Ch.15 — standardizing agent-to-agent messaging so a planner can call a specialist across orgs. Less mature than MCP but promising for multi-company agent workflows.
OpenAI-specific
OpenAI's Assistants API provides a hosted agent runtime with threads, tools, and file search. Contrast with Anthropic's approach: OpenAI hides the loop inside their infra; Anthropic exposes tool-use blocks and expects you to run the loop. OpenAI's recent Responses API and Agents SDK narrow the gap.
Guardrails, eval, and interview checklist
Guardrails (Gulli Ch.18)
Two layers: input guardrails (reject prompts that violate policy, e.g., prompt injection from retrieved documents) and output guardrails (reject tool calls targeting dangerous actions, e.g., rm -rf /). Pattern: run a separate lightweight classifier (Llama-Guard, Anthropic's constitutional classifier) in parallel with the main agent. For high-stakes tools (DB writes, money transfer), require human-in-the-loop (Gulli Ch.13).
Evaluation (preview — see evaluation page)
Agent eval needs: (1) end-task success rate on a labeled suite; (2) trajectory analysis — did it take reasonable steps? (3) cost/latency; (4) failure mode taxonomy (wrong tool, loop, hallucinated citation).
Anti-patterns
- No max-iteration cap. Agents loop forever on ambiguous tasks. Cap to 20 steps.
- Unbounded tool output to context. A 2MB file read bloats context instantly. Truncate and summarize.
- Multi-agent by default. Start with one agent + tools; add roles only when specialization is real.
- No prompt-injection guard for RAG tools. A retrieved doc containing "ignore previous instructions" can hijack the agent.
- Stateless retries on tool failure. If Slack API returns 429, backoff and resume — don't replay the whole thread.
Whiteboard checklist: clarify autonomy vs workflow; pick loop shape (ReAct / plan-execute / multi-agent); define tools with JSON schema; short + long memory design; MCP for external capability surface; input/output guardrails; max-iteration cap; human-in-loop for high-stakes tools; eval harness with trajectory logging.
什么算 agent
Gulli 工作定义:agent 是一个能自行决定 (1) 下一步动作、(2) 何时停止的 LLM。加上工具和记忆即得最小循环:observe → think → act。ReAct、planner、multi-agent 都是这之上的模式。
三个轴定义每个 agentic 系统:
- 确定性——纯 prompt chain(确定)vs LLM 选下一步(非确定)。Anthropic "Building Effective Agents" 博文明确:除非真需要自主性,优先用 workflow(确定链),因为 agent 多花 4-10 倍 token 且更难调试。
- 单 vs 多 agent——单 LLM + 工具 vs 专家团队(planner、coder、critic)。multi-agent 加协调成本;角色真正可分时才用。
- 记忆深度——无状态 → 短期(上下文内)→ 长期(向量 DB / 摘要存储)。
数字锚点:典型 ReAct agent 每任务 4-12 次工具调用;每次 1 个 LLM 往返(~2s)+ 工具延迟。"快"的 agent 也要 30-60s 端到端。按此预算。
参考来源
Gulli《Agentic Design Patterns》Ch.1-18 是标准分类:prompt chaining、routing、parallelization、reflection、tool use、planning、multi-agent、memory、MCP、monitoring、exception handling、human-in-loop、RAG、A2A、resource optimization、reasoning、guardrails、evaluation。背熟章节名。
工具使用与 ReAct
工具调用
Anthropic 的 tool_use 块与 OpenAI 的 tools 参数都让模型发结构化 JSON 调用。服务端执行工具、把结果作为 tool_result 追加、循环。关键技术点:
- 工具输入的 JSON schema:schema 越窄无效调用越少。加 enum 约束和必填字段。
- 工具描述比名字重要:模型按描述文字选工具。像 man page 一样写,带示例。
- 并行工具调用:两家 API 都支持一个 turn 多调用。启用 fan-out,延迟大赢(5 次并行搜索比串行快 5 倍)。
ReAct
ReAct(Yao 2022)交织推理与动作:Thought → Action → Observation → Thought → ...。"thought" token 是 LLM 解释为何调此工具的地方。实测提升工具选择质量但耗 token。Claude 和 GPT-4 原生都这样,不用特殊 prompt。
规划、反思、多 agent
Planner/executor
>5 步任务单 LLM ReAct 常跑偏。分开 planner(贵模型、列有序 todo)与 executor(便宜模型、逐项执行)。Plan-and-Execute(Wang 2023)和 Devin 风代码 agent 的模式。
反思 / 自我批评
agent 输出后,"critic" LLM 按标准审查并提出修正。Reflexion(Shinn 2023)在代码与推理 benchmark 上 +20-30% 准确率、代价 ~2×。便宜版:让同一模型"审查并修正"。
多 agent
模式:supervisor-worker(一 planner 分派专家)、debate(两 agent 争辩、judge 决定)、Society-of-Mind。面试警告:multi-agent 被过度吹捧。许多号称"多 agent"的生产系统本质是 supervisor + 工具。除非真有专业化边界(例如 coder-reviewer-tester 三人各有 system prompt 和工具),否则别推。
flowchart TB U[用户查询] --> P[Planner LLM
Claude-3-Opus] P --> T1[任务 1] P --> T2[任务 2] P --> T3[任务 3] T1 --> E1[Executor
Claude-3-Haiku + 工具] T2 --> E1 T3 --> E1 E1 --> M[记忆: scratchpad + 向量] E1 --> C[Critic LLM] C -->|修改| E1 C -->|OK| A[聚合] A --> R[最终响应]
记忆:短、长、共享
短期(上下文内)
turn buffer。增长极快:Claude-3-Opus 20 万上下文 $15/M 输入 token → 每请求 $3。缓解:每 N 轮摘要压缩,只保留最后 K 轮原文。
长期(向量 + 结构化)
- 情节:过去互动按 embedding 索引,语义查询召回。用向量 DB(Pinecone、Milvus、pgvector)。
- 语义:蒸馏事实("用户偏好 markdown")。带置信度 KV 存储。
- 程序性:学到的工具序列作为 skill 缓存。
写长期是难点。Agent 要么写太多(每轮都写 → 噪声)、要么太少(啥也留不下)。方案:反思门控(LLM 判断"值得记吗?")、低置信度加 TTL、定期合并。
共享记忆
多 agent 系统,共享 scratchpad(像 Google Doc)或事件日志让各方协调。持久化(Postgres 按 thread 一行)比内存 map 在重试时更稳。
MCP、A2A、互操作
Model Context Protocol (MCP)
Anthropic 2024 末开源 MCP 作为 LLM 与工具、数据源互联标准。架构:MCP 服务器(本地或远程)暴露 tools/resources/prompts;MCP 客户端(Claude Desktop、IDE、agent)通过 JSON-RPC stdio 或 HTTP+SSE 消费。好处:跨模型工具复用、本地文件访问、结构化资源 URI。
面试点:MCP 分离能力提供方与模型运行方。把 Claude 接到 Jira、Slack、S3 的公司用三个 MCP 服务器而非在 agent 里写三份工具。
Anthropic 细节
MCP 是 Anthropic 发明。Claude Desktop 和 Claude Code 原生用 MCP。面试级答案:MCP + 计算机使用(2024 发布的屏幕 + 键鼠工具)让 Claude 无须单独集成就能操作任意 GUI 应用,是 Claude "通用 agent" 策略核心。
A2A (Agent-to-Agent)
Gulli 第 15 章——agent 间消息标准化,让 planner 能跨组织调专家。比 MCP 不成熟但对跨公司工作流有潜力。
OpenAI 细节
OpenAI Assistants API 提供托管 agent runtime(thread、tool、file search)。对比 Anthropic:OpenAI 把循环藏在基础设施里;Anthropic 暴露 tool_use 块、期望你跑循环。OpenAI 新 Responses API 与 Agents SDK 在缩小差距。
Guardrail、评估、面试清单
Guardrail(Gulli 第 18 章)
两层:输入 guardrail(拒策略违规 prompt,例如从检索文档来的 prompt injection)和输出 guardrail(拒危险工具调用,例如 rm -rf /)。模式:用独立轻分类器(Llama-Guard、Anthropic 宪法分类器)与主 agent 并行跑。高风险工具(数据库写、转账)要 human-in-the-loop(Gulli 第 13 章)。
评估(见评估页)
Agent 评估要:(1) 标注套件的端任务成功率;(2) 轨迹分析——步骤是否合理?(3) 成本/延迟;(4) 失败模式分类(错工具、循环、幻觉引用)。
反模式
- 不设 max-iteration 上限。Agent 在模糊任务上无限循环。封 20 步。
- 无界工具输出进上下文。读 2MB 文件瞬间涨上下文。要截断+摘要。
- 默认 multi-agent。先单 agent+工具,只在真有专业化时加角色。
- RAG 工具无 prompt injection 防护。检索到的文档含"忽略先前指令"能劫持 agent。
- 工具失败无状态重试。Slack API 429 应退避续跑,别重放整个 thread。
白板清单:确定性 vs workflow;循环形态(ReAct / plan-execute / multi-agent);JSON schema 定义工具;短+长记忆设计;MCP 外部能力面;输入/输出 guardrail;max-iteration 上限;高风险工具 human-in-loop;带轨迹日志的评估框架。