description field in your SKILL.md frontmatter is the primary mechanism agents use to decide whether to load a skill for a given task. An under-specified description means the skill won’t trigger when it should; an over-broad description means it triggers when it shouldn’t.
This guide covers how to systematically test and improve your skill’s description for triggering accuracy.
How skill triggering works
Agents use progressive disclosure to manage context. At startup, they load only thename and description of each available skill — just enough to decide when a skill might be relevant. When a user’s task matches a description, the agent reads the full SKILL.md into context and follows its instructions.
This means the description carries the entire burden of triggering. If the description doesn’t convey when the skill is useful, the agent won’t know to reach for it.
One important nuance: agents typically only consult skills for tasks that require knowledge or capabilities beyond what they can handle alone. A simple, one-step request like “read this PDF” may not trigger a PDF skill even if the description matches perfectly, because the agent can handle it with basic tools. Tasks that involve specialized knowledge — an unfamiliar API, a domain-specific workflow, or an uncommon format — are where a well-written description can make the difference.
Writing effective descriptions
Before testing, it helps to know what a good description looks like. A few principles:- Use imperative phrasing. Frame the description as an instruction to the agent: “Use this skill when…” rather than “This skill does…” The agent is deciding whether to act, so tell it when to act.
- Focus on user intent, not implementation. Describe what the user is trying to achieve, not the skill’s internal mechanics. The agent matches against what the user asked for.
- Err on the side of being pushy. Explicitly list contexts where the skill applies, including cases where the user doesn’t name the domain directly: “even if they don’t explicitly mention ‘CSV’ or ‘analysis.’”
- Keep it concise. A few sentences to a short paragraph is usually right — long enough to cover the skill’s scope, short enough that it doesn’t bloat the agent’s context across many skills. The specification enforces a hard limit of 1024 characters.
Designing trigger eval queries
To test triggering, you need a set of eval queries — realistic user prompts labeled with whether they should or shouldn’t trigger your skill.eval_queries.json
Should-trigger queries
These test whether the description captures the skill’s scope. Vary them along several axes:- Phrasing: some formal, some casual, some with typos or abbreviations.
- Explicitness: some name the skill’s domain directly (“analyze this CSV”), others describe the need without naming it (“my boss wants a chart from this data file”).
- Detail: mix terse prompts with context-heavy ones — a short “analyze my sales CSV and make a chart” alongside a longer message with file paths, column names, and backstory.
- Complexity: vary the number of steps and decision points. Include single-step tasks alongside multi-step workflows to test whether the agent can discern the skill is relevant when the task it addresses is buried in a larger chain.
Should-not-trigger queries
The most valuable negative test cases are near-misses — queries that share keywords or concepts with your skill but actually need something different. These test whether the description is precise, not just broad. For a CSV analysis skill, weak negative examples would be:"Write a fibonacci function"— obviously irrelevant, tests nothing."What's the weather today?"— no keyword overlap, too easy.
"I need to update the formulas in my Excel budget spreadsheet"— shares “spreadsheet” and “data” concepts, but needs Excel editing, not CSV analysis."can you write a python script that reads a csv and uploads each row to our postgres database"— involves CSV, but the task is database ETL, not analysis.
Tips for realism
Real user prompts contain context that generic test queries lack. Include:- File paths (
~/Downloads/report_final_v2.xlsx) - Personal context (
"my manager asked me to...") - Specific details (column names, company names, data values)
- Casual language, abbreviations, and occasional typos
Testing whether a description triggers
The basic approach: run each query through your agent with the skill installed and observe whether the agent invokes it. Make sure the skill is registered and discoverable by your agent — how this works varies by client (e.g., a skills directory, a configuration file, or a CLI flag). Most agent clients provide some form of observability — execution logs, tool call histories, or verbose output — that lets you see which skills were consulted during a run. Check your client’s documentation for details. The skill triggered if the agent loaded your skill’sSKILL.md; it didn’t trigger if the agent proceeded without consulting it.
A query “passes” if:
should_triggeristrueand the skill was invoked, orshould_triggerisfalseand the skill was not invoked.
Running multiple times
Model behavior is nondeterministic — the same query might trigger the skill on one run but not the next. Run each query multiple times (3 is a reasonable starting point) and compute a trigger rate: the fraction of runs where the skill was invoked. A should-trigger query passes if its trigger rate is above a threshold (0.5 is a reasonable default). A should-not-trigger query passes if its trigger rate is below that threshold. With 20 queries at 3 runs each, that’s 60 invocations. You’ll want to script this. Here’s the general structure — replace theclaude invocation and detection logic in check_triggered with whatever your agent client provides:
Avoiding overfitting with train/validation splits
If you optimize the description against all your queries, you risk overfitting — crafting a description that works for these specific phrasings but fails on new ones. The solution is to split your query set:- Train set (~60%): the queries you use to identify failures and guide improvements.
- Validation set (~40%): queries you set aside and only use to check whether improvements generalize.
train_queries.json and validation_queries.json — and run the script against each one separately.
The optimization loop
- Evaluate the current description on both train and validation sets. The train results guide your changes; the validation results tell you whether those changes are generalizing.
- Identify failures in the train set: which should-trigger queries didn’t trigger? Which should-not-trigger queries did?
- Only use train set failures to guide your changes — whether you’re revising the description yourself or prompting an LLM, keep validation set results out of the process.
- Revise the description. Focus on generalizing:
- If should-trigger queries are failing, the description may be too narrow. Broaden the scope or add context about when the skill is useful.
- If should-not-trigger queries are false-triggering, the description may be too broad. Add specificity about what the skill does not do, or clarify the boundary between this skill and adjacent capabilities.
- Avoid adding specific keywords from failed queries — that’s overfitting. Instead, find the general category or concept those queries represent and address that.
- If you’re stuck after several iterations, try a structurally different approach to the description rather than incremental tweaks. A different framing or sentence structure may break through where refinement can’t.
- Check that the description stays under the 1024-character limit — descriptions tend to grow during optimization.
- Repeat steps 1-3 until all train set queries pass or you stop seeing meaningful improvement.
- Select the best iteration by its validation pass rate — the fraction of queries in the validation set that passed. Note that the best description may not be the last one you produced; an earlier iteration might have a higher validation pass rate than later ones that overfit to the train set.
Applying the result
Once you’ve selected the best description:- Update the
descriptionfield in yourSKILL.mdfrontmatter. - Verify the description is under the 1024-character limit.
- Verify the description triggers as expected. Try a few prompts manually as a quick sanity check. For a more rigorous test, write 5-10 fresh queries (a mix of should-trigger and should-not-trigger) and run them through the eval script — since these queries were never part of the optimization process, they give you an honest check on whether the description generalizes.