LLM agents externalize procedural knowledge into reusable skills. We argue that dynamic skill systems are best understood as lifecycle-managed, verified, evolving artifact stores — not just bigger libraries.
Large language model agents increasingly externalize procedural knowledge into reusable skills: invocable code, natural-language procedures, SKILL.md packages, graphs, or parametric adapters. This externalization turns adaptation into a new learning problem. The agent does not only update its prompt or weights; it updates a library of artifacts that changes what future policies can retrieve, compose, execute, and trust.
This survey studies the rapidly growing 2023–2026 literature on dynamic or self-evolving skill systems and argues that such systems are best understood as lifecycle-managed, verified, evolving artifact stores for LLM agents. We extend the options-based skill formalism to a seven-tuple $\langle C,\pi,T,R,\varphi,\nu,\prec \rangle$ that makes edits, admission verification, and provenance explicit, and we lift it to library-level dynamics $\mathcal{L}_t \to \mathcal{L}_{t+1}$ driven by a ten-operator algebra (Add, Refine, Merge, Split, Prune, Distill, Abstract, Compose, Rewrite, Rerank).
We organize 108 modern papers around a skill lifecycle: evidence acquisition, proposal, verification/admission, organization, retrieval/composition, maintenance/repair, distillation/portability, and governance. The most consistent evidence is that admission and repair matter more than raw skill count, verifier quality is often load-bearing in skill-aware RL, flat retrieval degrades in the moderate-library-size regime, and benchmarks still under-report library trajectories.
The canonical options framework gives a 4-tuple. Dynamic libraries need three more components — an edit operator, a verification predicate, and a lineage relation — and a library-level transition rule.
Almost no system supports all ten operators; the supported subset is one of the clearest taxonomic fingerprints we found.
Insert a new skill into the library.
Edit content without changing the interface.
Combine multiple skills into one.
Factor one skill into reusable components.
Remove or quarantine — load-bearing at scale.
Compress trajectories into a skill.
Lift a concrete procedure to a template.
Chain skills into a verified composite.
Change body and possibly the interface.
Change retrieval priors without changing content.
A dynamic skill system is a controlled state machine over an evolving artifact store. Stages are not a waterfall — systems loop between proposal and verification, retrieve before maintaining, and run distillation periodically.
The most important boundary is the admission gate (Stage 3): dynamic systems can generate many plausible skills cheaply, but the library only improves when admission filters them with an appropriate verifier.
Stated as regularities, not theorems: qualitative effects are consistent, but effect sizes and boundary conditions vary across methods, benchmarks, and artifact types.
ASI: +23.5 pts on WebArena vs static. EvoSkill: +12.1 pts on SealQA via Pareto admission. CoEvoSkills: removing the surrogate verifier drops SkillsBench pass rate 71.1% → 41.1%.
CODE-SHARP: 24.3% → 41.0% under a learned gate. Agentic-Proposing: 68.7% → 82.3% problem validity from verifier ensemble + dynamic pruning.
Single-Agent-Skills: selection at 16–32 skills ≈ 96–98%, drops to 78% at 128, 64% at 256. SRA separates retrieval, loading, and utility over a 26K-skill corpus.
SkillWeaver, MetaClaw, EvoSkill, Agentic-Proposing all report skills compress more capability gap on weaker base models — a deployment-facing pattern, not a controlled cross-paper law.
SWE-Skills-Bench finds average gain only +1.2% across 49 public SWE skills. SkillX, Wild-Skills, CASCADE, SkillMOO: distractor load is the failure mechanism.
ContractSkill and SkillDroid show local repair/recompile value. AutoRefine without periodic prune+merge: pass 35.6% → 31.1%, repository 4.5×, utilization 0.71 → 0.08.
SimpleMem: removing write-time semantic compression drops LoCoMo F1 43.24 → 31.29. XSkill, CASCADE, Trace2Skill agree.
Admission, verifier quality under RL, maintenance, and write-time abstraction are all forms of write-time discipline. Retrieval scaling and focused-library effects both argue that smaller can be better, through different mechanisms — retrieval resolution and distractor load. Lifecycle benchmarks warn that skill usage is not skill utility.
The master coding table from §6 of the paper, reproduced here as a filterable index of representative dynamic-skill systems. Operators use the ten-letter algebra (A=Add, R=Refine, M=Merge, S=Split, P=Prune, D=Distill, B=Abstract, C=Compose, W=Rewrite, K=Rerank).
| System | Cluster | Year | Artifact | Clock | Trigger | Operators | Storage | Headline | Link |
|---|
Legend — Artifact: Code · NL · MD (SKILL.md) · LoRA · Mix. Clock: fast (in-context) · slow (parametric) · 2TS (two-timescale). Trigger: Task · Fail · Per (periodic) · User · RL. Storage: flat · tree · DAG · graph · subsp · ontol.
If you found this survey useful for your research, please cite:
@article{li2026dynamicskills,
title = {They Are Not Static: A Survey of Dynamic Agent Skills},
author = {Li, Yubo},
journal = {arXiv preprint},
year = {2026},
url = {https://github.com/yubol-bobo/Awesome-Dynamic-Agent-Skills}
}