Dynamic Agent Skills: A Lifecycle Survey and Taxonomy of Evolving Skill Libraries

§Abstract

Large language model agents increasingly externalize procedural knowledge into reusable skills: invocable code, natural-language procedures, SKILL.md packages, graphs, or parametric adapters. This externalization turns adaptation into a new learning problem. The agent does not only update its prompt or weights; it updates a library of artifacts that changes what future policies can retrieve, compose, execute, and trust.

This survey studies the rapidly growing 2023–2026 literature on dynamic or self-evolving skill systems and argues that such systems are best understood as lifecycle-managed, verified, evolving artifact stores for LLM agents. We extend the options-based skill formalism to a seven-tuple $\langle C,\pi,T,R,\varphi,\nu,\prec \rangle$ that makes edits, admission verification, and provenance explicit, and we lift it to library-level dynamics $\mathcal{L}_t \to \mathcal{L}_{t+1}$ driven by a ten-operator algebra (Add, Refine, Merge, Split, Prune, Distill, Abstract, Compose, Rewrite, Rerank).

We first disambiguate the term with a six-sense taxonomy of what current papers call a "skill". We then organize a 124-paper 2023–2026 audit set around a skill lifecycle: evidence acquisition, proposal, verification/admission, organization, retrieval/composition, maintenance/repair, distillation/portability, and governance. The most consistent evidence is that admission and repair matter more than raw skill count, verifier quality is often load-bearing in skill-aware RL, flat retrieval degrades in the moderate-library-size regime, and benchmarks still under-report library trajectories.

Contribution 1

A six-sense taxonomy that disambiguates what current papers call a "skill".

Contribution 2

Recasts skills as editable artifacts via a skill-record schema + library-transition notation.

Contribution 3

An eight-stage lifecycle reference architecture for dynamic stores.

Contribution 4

A ten-operator vocabulary organizing mechanisms and system families.

Contribution 5

Lifecycle-aware evaluation: seven regularities, eight safety surfaces, eight open problems.

§From the SoK 4-tuple to a 7-tuple skill

The canonical options framework gives a 4-tuple. Dynamic libraries need three more components — an edit operator, a verification predicate, and a lineage relation — and a library-level transition rule.

Per-skill object · time-indexed

$$\mathcal{S}_t \;=\; \langle\, C_t,\; \pi_t,\; T_t,\; R_t,\; \boldsymbol{\varphi_t},\; \boldsymbol{\nu_t},\; \boldsymbol{\prec_t} \,\rangle$$

C applicability — when this skill should fire
π executable policy — code · prompt · adapter · lesson
T termination — success / failure / budget
R reusable interface — typed I/O, invocation point
φ edit operator — candidate-generation rule for revisions
ν verification predicate — admission gate
≺ lineage — partial order for rollback & supersession

Library-level transition

$$\mathcal{L}_{t+1} \;=\; \mathrm{Apply}\!\bigl(\,\vec{u}_t(\tau_t,\, r_t),\; \mathcal{L}_t\,\bigr)$$

τ trigger — task end · failure · period · user · RL step
r learning signal — reward · execution · critique · judge · cross-user · teacher
𝓛 library + metadata (call graphs, maturity, verifier cache)
𝐮 vector of operator-instructions drawn from the algebra below

A static library is the special case u_t ≡ ∅: edits trivial, admission always passes, no lineage. The dynamic case is what this survey audits.

§The ten-operator library algebra

Almost no system supports all ten operators; the supported subset is one of the clearest taxonomic fingerprints we found.

add

Add

Insert a new skill into the library.

e.g. Voyager · SAGE

refine

Refine

Edit content without changing the interface.

e.g. Memento · PSN

merge

Merge

Combine multiple skills into one.

e.g. Trace2Skill · AutoRefine

split

Split

Factor one skill into reusable components.

e.g. SkillX

prune

Prune

Remove or quarantine — load-bearing at scale.

e.g. Wild-Skills · ClawSafety

distill

Distill

Compress trajectories into a skill.

e.g. CASCADE · MUSE

abstract

Abstract

Lift a concrete procedure to a template.

e.g. CUA-Skill · CoEvoSkills

compose

Compose

Chain skills into a verified composite.

e.g. SkillCraft · SkillOrchestra

rewrite

Rewrite

Change body and possibly the interface.

e.g. EvolveR

rerank

Rerank

Change retrieval priors without changing content.

e.g. SkillRouter

§The eight-stage lifecycle

A dynamic skill system is a controlled state machine over an evolving artifact store. Stages are not a waterfall — systems loop between proposal and verification, retrieve before maintaining, and run distillation periodically.

The most important boundary is the admission gate (Stage 3): dynamic systems can generate many plausible skills cheaply, but the library only improves when admission filters them with an appropriate verifier.

§Seven empirical regularities

Stated as regularities, not theorems: qualitative effects are consistent, but effect sizes and boundary conditions vary across methods, benchmarks, and artifact types.

strong evidence

Curated skills outperform unverified self-generated skills.

ASI: +23.5 pts on WebArena vs static. EvoSkill: +12.1 pts on SealQA via Pareto admission. CoEvoSkills: removing the surrogate verifier drops SkillsBench pass rate 71.1% → 41.1%.

strong evidence

Verification quality is often decisive in skill-aware RL.

CODE-SHARP: 24.3% → 41.0% under a learned gate. Agentic-Proposing: 68.7% → 82.3% problem validity from verifier ensemble + dynamic pruning.

moderate evidence

Flat retrieval often degrades around 64–128 skills.

Single-Agent-Skills: selection at 16–32 skills ≈ 96–98%, drops to 78% at 128, 64% at 256. SRA separates retrieval, loading, and utility over a 26K-skill corpus.

moderate evidence

Larger relative gains for weaker backbones.

SkillWeaver, MetaClaw, EvoSkill, Agentic-Proposing all report skills compress more capability gap on weaker base models — a deployment-facing pattern, not a controlled cross-paper law.

strong evidence

Focused libraries often beat comprehensive ones.

SWE-Skills-Bench finds average gain only +1.2% across 49 public SWE skills. SkillX, Wild-Skills, CASCADE, SkillMOO: distractor load is the failure mechanism.

strong evidence

Maintenance becomes load-bearing at moderate-to-large library sizes.

ContractSkill and SkillDroid show local repair/recompile value. AutoRefine without periodic prune+merge: pass 35.6% → 31.1%, repository 4.5×, utilization 0.71 → 0.08.

moderate evidence

Write-time abstraction usually beats read-time only.

SimpleMem: removing write-time semantic compression drops LoCoMo F1 43.24 → 31.29. XSkill, CASCADE, Trace2Skill agree.

cross-cutting observation

Admission, verifier quality under RL, maintenance, and write-time abstraction are all forms of write-time discipline. Retrieval scaling and focused-library effects both argue that smaller can be better, through different mechanisms — retrieval resolution and distractor load. Lifecycle benchmarks warn that skill usage is not skill utility.

§Surveyed papers

— shown · click headers to sort · May 31 2026 cutoff

The master coding table from §6 of the paper, reproduced here as a filterable index of representative dynamic-skill systems. Operators use the ten-letter algebra (A=Add, R=Refine, M=Merge, S=Split, P=Prune, D=Distill, B=Abstract, C=Compose, W=Rewrite, K=Rerank).

Cluster

Year

Operator

System	Cluster	Year	Artifact	Clock	Trigger	Operators	Storage	Headline	Link

Legend — Artifact: Code · NL · MD (SKILL.md) · LoRA · Mix. Clock: fast (in-context) · slow (parametric) · 2TS (two-timescale). Trigger: Task · Fail · Per (periodic) · User · RL. Storage: flat · tree · DAG · graph · subsp · ontol.

Dynamic Agent Skills.
A Lifecycle Survey and Taxonomy of Evolving Skill Libraries

§Abstract

§From the SoK 4-tuple to a 7-tuple skill

§The ten-operator library algebra

§The eight-stage lifecycle

§Seven empirical regularities

Curated skills outperform unverified self-generated skills.

Verification quality is often decisive in skill-aware RL.

Flat retrieval often degrades around 64–128 skills.

Larger relative gains for weaker backbones.

Focused libraries often beat comprehensive ones.

Maintenance becomes load-bearing at moderate-to-large library sizes.

Write-time abstraction usually beats read-time only.

§Surveyed papers

§Citation