Designing test cases
A test case has three parts:- Prompt: a realistic user message — the kind of thing someone would actually type.
- Expected output: a human-readable description of what success looks like.
- Input files (optional): files the skill needs to work with.
evals/evals.json inside your skill directory:
evals/evals.json
- Start with 2-3 test cases. Don’t over-invest before you’ve seen your first round of results. You can expand the set later.
- Vary the prompts. Use different phrasings, levels of detail, and formality. Some prompts should be casual (“hey can you clean up this csv”), others precise (“Parse the CSV at data/input.csv, drop rows where column B is null, and write the result to data/output.csv”).
- Cover edge cases. Include at least one prompt that tests a boundary condition — a malformed input, an unusual request, or a case where the skill’s instructions might be ambiguous.
- Use realistic context. Real users mention file paths, column names, and personal context. Prompts like “process this data” are too vague to test anything useful.
Running evals
The core pattern is to run each test case twice: once with the skill and once without it (or with a previous version). This gives you a baseline to compare against.Workspace structure
Organize eval results in a workspace directory alongside your skill directory. Each pass through the full eval loop gets its owniteration-N/ directory. Within that, each test case gets an eval directory with with_skill/ and without_skill/ subdirectories:
evals/evals.json. The other JSON files (grading.json, timing.json, benchmark.json) are produced during the eval process — by the agent, by scripts, or by you.
Spawning runs
Each eval run should start with a clean context — no leftover state from previous runs or from the skill development process. This ensures the agent follows only what theSKILL.md tells it. In environments that support subagents (Claude Code, for example), this isolation comes naturally: each child task starts fresh. Without subagents, use a separate session for each run.
For each run, provide:
- The skill path (or no skill for the baseline)
- The test prompt
- Any input files
- The output directory
without_skill/outputs/.
When improving an existing skill, use the previous version as your baseline. Snapshot it before editing (cp -r <skill-path> <workspace>/skill-snapshot/), point the baseline run at the snapshot, and save to old_skill/outputs/ instead of without_skill/.
Capturing timing data
Timing data lets you compare how much time and tokens the skill costs relative to the baseline — a skill that dramatically improves output quality but triples token usage is a different trade-off than one that’s both better and cheaper. When each run completes, record the token count and duration:timing.json
Writing assertions
Assertions are verifiable statements about what the output should contain or achieve. Add them after you see your first round of outputs — you often don’t know what “good” looks like until the skill has run. Good assertions:"The output file is valid JSON"— programmatically verifiable."The bar chart has labeled axes"— specific and observable."The report includes at least 3 recommendations"— countable.
"The output is good"— too vague to grade."The output uses exactly the phrase 'Total Revenue: $X'"— too brittle; correct output with different wording would fail.
evals/evals.json:
evals/evals.json
Grading outputs
Grading means evaluating each assertion against the actual outputs and recording PASS or FAIL with specific evidence. The evidence should quote or reference the output, not just state an opinion. The simplest approach is to give the outputs and assertions to an LLM and ask it to evaluate each one. For assertions that can be checked by code (valid JSON, correct row count, file exists with expected dimensions), use a verification script — scripts are more reliable than LLM judgment for mechanical checks and reusable across iterations.grading.json
Grading principles
- Require concrete evidence for a PASS. Don’t give the benefit of the doubt. If an assertion says “includes a summary” and the output has a section titled “Summary” with one vague sentence, that’s a FAIL — the label is there but the substance isn’t.
- Review the assertions themselves, not just the results. While grading, notice when assertions are too easy (always pass regardless of skill quality), too hard (always fail even when the output is good), or unverifiable (can’t be checked from the output alone). Fix these for the next iteration.
Aggregating results
Once every run in the iteration is graded, compute summary statistics per configuration and save them tobenchmark.json alongside the eval directories (e.g., csv-analyzer-workspace/iteration-1/benchmark.json):
benchmark.json
delta tells you what the skill costs (more time, more tokens) and what it buys (higher pass rate). A skill that adds 13 seconds but improves pass rate by 50 percentage points is probably worth it. A skill that doubles token usage for a 2-point improvement might not be.
Standard deviation (
stddev) is only meaningful with multiple runs per eval. In early iterations with just 2-3 test cases and single runs, focus on the raw pass counts and the delta — the statistical measures become useful as you expand the test set and run each eval multiple times.Analyzing patterns
Aggregate statistics can hide important patterns. After computing the benchmarks:- Remove or replace assertions that always pass in both configurations. These don’t tell you anything useful — the model handles them fine without the skill. They inflate the with-skill pass rate without reflecting actual skill value.
- Investigate assertions that always fail in both configurations. Either the assertion is broken (asking for something the model can’t do), the test case is too hard, or the assertion is checking for the wrong thing. Fix these before the next iteration.
- Study assertions that pass with the skill but fail without. This is where the skill is clearly adding value. Understand why — which instructions or scripts made the difference?
- Tighten instructions when results are inconsistent across runs. If the same eval passes sometimes and fails others (reflected as high
stddevin the benchmark), the eval may be flaky (sensitive to model randomness), or the skill’s instructions may be ambiguous enough that the model interprets them differently each time. Add examples or more specific guidance to reduce ambiguity. - Check time and token outliers. If one eval takes 3x longer than the others, read its execution transcript (the full log of what the model did during the run) to find the bottleneck.
Reviewing results with a human
Assertion grading and pattern analysis catch a lot, but they only check what you thought to write assertions for. A human reviewer brings a fresh perspective — catching issues you didn’t anticipate, noticing when the output is technically correct but misses the point, or spotting problems that are hard to express as pass/fail checks. For each test case, review the actual outputs alongside the grades. Record specific feedback for each test case and save it in the workspace (e.g., as afeedback.json alongside the eval directories):
feedback.json
Iterating on the skill
After grading and reviewing, you have three sources of signal:- Failed assertions point to specific gaps — a missing step, an unclear instruction, or a case the skill doesn’t handle.
- Human feedback points to broader quality issues — the approach was wrong, the output was poorly structured, or the skill produced a technically correct but unhelpful result.
- Execution transcripts reveal why things went wrong. If the agent ignored an instruction, the instruction may be ambiguous. If the agent spent time on unproductive steps, those instructions may need to be simplified or removed.
SKILL.md — to an LLM and ask it to propose changes. The LLM can synthesize patterns across failed assertions, reviewer complaints, and transcript behavior that would be tedious to connect manually. When prompting the LLM, include these guidelines:
- Generalize from feedback. The skill will be used across many different prompts, not just the test cases. Fixes should address underlying issues broadly rather than adding narrow patches for specific examples.
- Keep the skill lean. Fewer, better instructions often outperform exhaustive rules. If transcripts show wasted work (unnecessary validation, unneeded intermediate outputs), remove those instructions. If pass rates plateau despite adding more rules, the skill may be over-constrained — try removing instructions and see if results hold or improve.
- Explain the why. Reasoning-based instructions (“Do X because Y tends to cause Z”) work better than rigid directives (“ALWAYS do X, NEVER do Y”). Models follow instructions more reliably when they understand the purpose.
- Bundle repeated work. If every test run independently wrote a similar helper script (a chart builder, a data parser), that’s a signal to bundle the script into the skill’s
scripts/directory. See Using scripts for how to do this.
The loop
- Give the eval signals and current
SKILL.mdto an LLM and ask it to propose improvements. - Review and apply the changes.
- Rerun all test cases in a new
iteration-<N+1>/directory. - Grade and aggregate the new results.
- Review with a human. Repeat.