tutorials

Build an eval set from captures

Captures become an eval set when they're collected with intent: scoped to one call site, tagged with the dimensions you'll filter by, and sampled so the set is representative rather than just recent. This tutorial is mostly about doing those three things at request time — selection afterward is then trivial.

1 — Decide the slice before collecting

An eval set answers a question — "does the candidate handle production ad-relevance traffic from paying tenants?". Each italicized phrase must be recorded on the capture, or you can't select on it later:

The call site → the workload (x-understudy-workload).
Production → either a dedicated key per environment, or an env tag.
Paying tenants → your dimension, so a tag: {"tier":"paid"}.

collection setup

defaultHeaders: {
  "x-understudy-project": "concierge",
  "x-understudy-workload": "ad-relevance",
  "x-understudy-tags": JSON.stringify({ env: "prod", tier: "paid" }),
}

2 — Sample for representativeness

Set the workload's capture sample rate so collection spans days, not hours — time-of-day and day-of-week mix matters more than raw volume. Sampling is deterministic per request id, so retried requests never double-enter the set. A few hundred to a few thousand examples is a useful first eval set.

3 — Review a handful by hand

Before treating the stream as a dataset, open ten captures in the dashboard and read them. You're checking that the workload scoping is honest (no stray call site leaking in), tags are present, and the bodies contain what you expect — ten minutes here saves a polluted dataset later.

4 — Export and select

Each capture is one JSON envelope with the raw customer_request_body, response_body, and your tags (full shape in Capture). During the preview, bulk export is operator-assisted — ask and you receive presigned URLs for your date range. Selection is then a filter over envelope fields:

selection sketch

captures
  .filter(c => c.tags?.env === "prod" && c.tags?.tier === "paid")
  .filter(c => c.status_code === 200)
  .map(c => ({
    input: JSON.parse(c.customer_request_body),
    reference: JSON.parse(c.response_body),   // incumbent output as reference
  }))

5 — Use it

The set now serves both directions of the replacement loop: score a candidate against the incumbent's recorded outputs before routing any traffic, and re-score on each ratchet step of the replacement tutorial. The open-source agent tools run exactly this workflow — capture, eval, optimize — locally, with a coding agent driving.