Evals

Evals

Evals let you test how your agents handle real questions before you put a change live. You build a suite of test cases, run it against a specific agent version, and read the scores so you ship with confidence instead of guessing.

Why use evals

When you tweak a prompt, swap a model, or update your knowledge, it is hard to know whether you made things better or worse. Evals turn that into a measurable check: the same set of questions runs against your agent every time, and Lyro scores each answer so regressions show up clearly.

You manage everything from the Evaluations page in your dashboard.

Eval sets and eval cases

An eval set is a test suite - a named collection of cases that cover one area, for example a refund flow or your product FAQs. Each eval case is a single test: an input question plus the expected outcome you want the agent to deliver.

A case can carry any of these:

  • Input - the customer message the agent will answer.
  • Expected answer - the ideal response. When set, Lyro judges how closely the actual answer matches it.
  • Expected agent - which agent the conversation should be routed to.
  • Tags - labels to help you group and filter cases.

A case without an expected answer still runs, but it is marked "No ground truth" and will not get a correctness score. Add an expected answer when you want a clear pass or fail.

Creating eval sets and cases

You have a few ways to build a suite:

  • New set - create an empty set, give it a name and description, then add cases.
  • Generate from KB - auto-create a starter suite from a knowledge base (see below).
  • Save as eval - capture a real turn from the Playground and stage it as a case.

You can rename or delete sets, and delete individual cases, from the menu next to each item.

Generating eval cases from knowledge

The fastest way to get started is to generate cases from your content. Click Generate from KB, then choose:

  • A knowledge base to draw from.
  • A case count (1 to 20).
  • An optional target agent that becomes the set's default for future runs.

Lyro samples passages from your active knowledge articles and drafts a realistic customer question plus an expected answer grounded in each passage. You get a ready-to-run set in seconds, which you can then edit to taste.

Thin or boilerplate passages are skipped, so the number created may be slightly lower than requested. Make sure your articles are ingested first.

Running an eval set

Open a set, pick the agent to test against, then click Run. Cases run one after another and results stream in live, so you can watch each case flip to pass or fail.

Running against a specific agent version

When the selected agent has versions, a Version picker appears so you can choose exactly what to test:

OptionWhat it runs
PublishedThe live config, through the normal routed path.
DraftThe agent's current open draft.
Older versionAny specific past published version.

This lets you evaluate a draft before you publish it, or re-check an older version, just like the version picker in the Playground. You can also apply a Format preset to the run if you use formatting presets.

Reading results and metrics

Each case is scored on the axes that have ground truth available:

MetricWhat it measures
Matches expected answerHow closely the actual answer matches the expected one.
Grounded in sourcesWhether the answer's claims are backed by the retrieved knowledge.
Routed to right agentWhether routing sent the case to the expected agent.

Lyro turns these into a pass or fail per case, plus a pass rate and average scores for the whole run. Open any case to see the expected answer and the actual answer side by side, the retrieved sources, response latency, and the judge's rationale for the scores.

Like an answer the agent produced? Use Use as expected to promote it into the case's expected answer, so future runs are judged against it.

Run history

Every run is saved, so each set keeps a history under the Runs tab. Each entry shows the agent tested, when it ran, the pass rate, and the per-axis averages, with a status of pending, running, completed, or failed.

Comparing agent versions

Because each run records the exact agent version it tested, you can compare runs to see the impact of a change:

  • Run the set against the Published version, then against your Draft, and compare their pass rates and scores.
  • Run the same set against two different agents to see which handles the area better.

This is how you confirm a change is an improvement - and not a regression - before you publish it. To learn how evals fit into your broader quality loop, see Measure and improve, and review live performance in Analytics.