Survey · 108-paper audit (2023–2026)

They Are Not Static.
A Survey of Dynamic Agent Skills

LLM agents externalize procedural knowledge into reusable skills. We argue that dynamic skill systems are best understood as lifecycle-managed, verified, evolving artifact stores — not just bigger libraries.

Yubo Li
Carnegie Mellon University
108
papers audited
10
edit operators
8
lifecycle stages
7
empirical regularities

§Abstract

Large language model agents increasingly externalize procedural knowledge into reusable skills: invocable code, natural-language procedures, SKILL.md packages, graphs, or parametric adapters. This externalization turns adaptation into a new learning problem. The agent does not only update its prompt or weights; it updates a library of artifacts that changes what future policies can retrieve, compose, execute, and trust.

This survey studies the rapidly growing 2023–2026 literature on dynamic or self-evolving skill systems and argues that such systems are best understood as lifecycle-managed, verified, evolving artifact stores for LLM agents. We extend the options-based skill formalism to a seven-tuple $\langle C,\pi,T,R,\varphi,\nu,\prec \rangle$ that makes edits, admission verification, and provenance explicit, and we lift it to library-level dynamics $\mathcal{L}_t \to \mathcal{L}_{t+1}$ driven by a ten-operator algebra (Add, Refine, Merge, Split, Prune, Distill, Abstract, Compose, Rewrite, Rerank).

We organize 108 modern papers around a skill lifecycle: evidence acquisition, proposal, verification/admission, organization, retrieval/composition, maintenance/repair, distillation/portability, and governance. The most consistent evidence is that admission and repair matter more than raw skill count, verifier quality is often load-bearing in skill-aware RL, flat retrieval degrades in the moderate-library-size regime, and benchmarks still under-report library trajectories.

Contribution 1
Recasts dynamic skills as an editable learning-systems object via the 7-tuple.
Contribution 2
Eight-stage lifecycle reference architecture for dynamic stores.
Contribution 3
A ten-operator algebra with which we taxonomize the corpus.
Contribution 4
Mechanisms: edit · admission · storage · fast-slow consolidation.
Contribution 5
Seven empirical regularities, eight safety surfaces, eight open problems.

§From the SoK 4-tuple to a 7-tuple skill

The canonical options framework gives a 4-tuple. Dynamic libraries need three more components — an edit operator, a verification predicate, and a lineage relation — and a library-level transition rule.

Per-skill object · time-indexed
$$\mathcal{S}_t \;=\; \langle\, C_t,\; \pi_t,\; T_t,\; R_t,\; \boldsymbol{\varphi_t},\; \boldsymbol{\nu_t},\; \boldsymbol{\prec_t} \,\rangle$$
  • C   applicability — when this skill should fire
  • π   executable policy — code · prompt · adapter · lesson
  • T   termination — success / failure / budget
  • R   reusable interface — typed I/O, invocation point
  • φ   edit operator — candidate-generation rule for revisions
  • ν   verification predicate — admission gate
  •   lineage — partial order for rollback & supersession
Library-level transition
$$\mathcal{L}_{t+1} \;=\; \mathrm{Apply}\!\bigl(\,\vec{u}_t(\tau_t,\, r_t),\; \mathcal{L}_t\,\bigr)$$
  • τ   trigger — task end · failure · period · user · RL step
  • r   learning signal — reward · execution · critique · judge · cross-user · teacher
  • 𝓛   library + metadata (call graphs, maturity, verifier cache)
  • 𝐮   vector of operator-instructions drawn from the algebra below
A static library is the special case ut ≡ ∅: edits trivial, admission always passes, no lineage. The dynamic case is what this survey audits.

§The ten-operator library algebra

Almost no system supports all ten operators; the supported subset is one of the clearest taxonomic fingerprints we found.

A
add
Add

Insert a new skill into the library.

e.g. Voyager · SAGE
R
refine
Refine

Edit content without changing the interface.

e.g. Memento · PSN
M
merge
Merge

Combine multiple skills into one.

e.g. Trace2Skill · AutoRefine
S
split
Split

Factor one skill into reusable components.

e.g. SkillX
P
prune
Prune

Remove or quarantine — load-bearing at scale.

e.g. Wild-Skills · ClawSafety
D
distill
Distill

Compress trajectories into a skill.

e.g. CASCADE · MUSE
B
abstract
Abstract

Lift a concrete procedure to a template.

e.g. CUA-Skill · CoEvoSkills
C
compose
Compose

Chain skills into a verified composite.

e.g. SkillCraft · SkillOrchestra
W
rewrite
Rewrite

Change body and possibly the interface.

e.g. EvolveR
K
rerank
Rerank

Change retrieval priors without changing content.

e.g. SkillRouter

§The eight-stage lifecycle

A dynamic skill system is a controlled state machine over an evolving artifact store. Stages are not a waterfall — systems loop between proposal and verification, retrieve before maintaining, and run distillation periodically.

GOVERNANCE · PROVENANCE · ROLLBACK Skill Store 𝓛 — versioned flat · tree · DAG · graph · ontology STAGE 1 Evidence trajectories · failures user edits · cross-user STAGE 2 Proposal Add · Refine · Merge Split · Distill · Abstract STAGE 3 · GATE Verify & Admit tests · judge · audit grounded rollouts STAGE 4 · ORGANIZE topology & index STAGE 5 Retrieve · Compose top-k · graph · rerank → Execute STAGE 6 Maintain · Repair Prune · Merge · Rerank periodic audit STAGE 7 · SLOW Distill · Port → adapters · weights cross-agent transfer execution produces new evidence admission gate maintenance slow loop governance wrap

The most important boundary is the admission gate (Stage 3): dynamic systems can generate many plausible skills cheaply, but the library only improves when admission filters them with an appropriate verifier.

§Seven empirical regularities

Stated as regularities, not theorems: qualitative effects are consistent, but effect sizes and boundary conditions vary across methods, benchmarks, and artifact types.

R1
strong evidence

Curated skills outperform unverified self-generated skills.

ASI: +23.5 pts on WebArena vs static. EvoSkill: +12.1 pts on SealQA via Pareto admission. CoEvoSkills: removing the surrogate verifier drops SkillsBench pass rate 71.1% → 41.1%.

R2
strong evidence

Verification quality is often decisive in skill-aware RL.

CODE-SHARP: 24.3% → 41.0% under a learned gate. Agentic-Proposing: 68.7% → 82.3% problem validity from verifier ensemble + dynamic pruning.

R3
moderate evidence

Flat retrieval often degrades around 64–128 skills.

Single-Agent-Skills: selection at 16–32 skills ≈ 96–98%, drops to 78% at 128, 64% at 256. SRA separates retrieval, loading, and utility over a 26K-skill corpus.

R4
moderate evidence

Larger relative gains for weaker backbones.

SkillWeaver, MetaClaw, EvoSkill, Agentic-Proposing all report skills compress more capability gap on weaker base models — a deployment-facing pattern, not a controlled cross-paper law.

R5
strong evidence

Focused libraries often beat comprehensive ones.

SWE-Skills-Bench finds average gain only +1.2% across 49 public SWE skills. SkillX, Wild-Skills, CASCADE, SkillMOO: distractor load is the failure mechanism.

R6
strong evidence

Maintenance becomes load-bearing at moderate-to-large library sizes.

ContractSkill and SkillDroid show local repair/recompile value. AutoRefine without periodic prune+merge: pass 35.6% → 31.1%, repository 4.5×, utilization 0.71 → 0.08.

R7
moderate evidence

Write-time abstraction usually beats read-time only.

SimpleMem: removing write-time semantic compression drops LoCoMo F1 43.24 → 31.29. XSkill, CASCADE, Trace2Skill agree.

cross-cutting observation

Admission, verifier quality under RL, maintenance, and write-time abstraction are all forms of write-time discipline. Retrieval scaling and focused-library effects both argue that smaller can be better, through different mechanisms — retrieval resolution and distractor load. Lifecycle benchmarks warn that skill usage is not skill utility.

§Surveyed papers

shown · click headers to sort · May 17 2026 cutoff

The master coding table from §6 of the paper, reproduced here as a filterable index of representative dynamic-skill systems. Operators use the ten-letter algebra (A=Add, R=Refine, M=Merge, S=Split, P=Prune, D=Distill, B=Abstract, C=Compose, W=Rewrite, K=Rerank).

System Cluster Year Operators Headline Link

Legend — Artifact: Code · NL · MD (SKILL.md) · LoRA · Mix. Clock: fast (in-context) · slow (parametric) · 2TS (two-timescale). Trigger: Task · Fail · Per (periodic) · User · RL. Storage: flat · tree · DAG · graph · subsp · ontol.

§Citation

If you found this survey useful for your research, please cite:

@article{li2026dynamicskills,
  title   = {They Are Not Static: A Survey of Dynamic Agent Skills},
  author  = {Li, Yubo},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {https://github.com/yubol-bobo/Awesome-Dynamic-Agent-Skills}
}