tutorials
Build an eval set from captures
Captures become an eval set when they're collected with intent: scoped to one call site, tagged with the dimensions you'll filter by, and sampled so the set is representative rather than just recent. This tutorial is mostly about doing those three things at request time — selection afterward is then trivial.
1 — Decide the slice before collecting
An eval set answers a question — "does the candidate handle production ad-relevance traffic from paying tenants?". Each italicized phrase must be recorded on the capture, or you can't select on it later:
- The call site → the workload (
x-understudy-workload). - Production → either a dedicated key per environment, or an
envtag. - Paying tenants → your dimension, so a tag:
{"tier":"paid"}.
defaultHeaders: {
"x-understudy-project": "concierge",
"x-understudy-workload": "ad-relevance",
"x-understudy-tags": JSON.stringify({ env: "prod", tier: "paid" }),
}2 — Sample for representativeness
Set the workload's capture sample rate so collection spans days, not hours — time-of-day and day-of-week mix matters more than raw volume. Sampling is deterministic per request id, so retried requests never double-enter the set. A few hundred to a few thousand examples is a useful first eval set.
3 — Review a handful by hand
Before treating the stream as a dataset, open ten captures in the dashboard and read them. You're checking that the workload scoping is honest (no stray call site leaking in), tags are present, and the bodies contain what you expect — ten minutes here saves a polluted dataset later.
4 — Export and select
Each capture is one JSON envelope with the raw customer_request_body, response_body, and your tags (full shape in Capture). During the preview, bulk export is operator-assisted — ask and you receive presigned URLs for your date range. Selection is then a filter over envelope fields:
captures
.filter(c => c.tags?.env === "prod" && c.tags?.tier === "paid")
.filter(c => c.status_code === 200)
.map(c => ({
input: JSON.parse(c.customer_request_body),
reference: JSON.parse(c.response_body), // incumbent output as reference
}))5 — Use it
The set now serves both directions of the replacement loop: score a candidate against the incumbent's recorded outputs before routing any traffic, and re-score on each ratchet step of the replacement tutorial. The open-source agent tools run exactly this workflow — capture, eval, optimize — locally, with a coding agent driving.