Darshil Gandhi’s Post

Evals are becoming part of the product manager craft, the same way reading a funnel chart or a SQL query was in 2015. AI agents don't behave like the products we learned to measure. Users type intent into a chat box, the agent calls tools and retrieves context, and the output changes from one run to the next. Click-and-form analytics never sees inside that. Evals are how you measure it: repeatable tests that score an agent's output against your quality bar and run on every change. The part most teams underrate is the last mile. A high pass rate tells you the model performed on a test set. It doesn't tell you whether good agent interactions drive retention, whether failures concentrate in your highest-value segments, or whether your most expensive queries are also your lowest-converting. You answer those by joining eval scores to product engagement under the same user identity. I put together a getting-started guide for PMs covering traces, LLM judges, offline vs online evals, and how to wire eval scores to outcomes. Link in the comments.

  • No alternative text description for this image

Eval scores alone don't close the last mile. Wiring them to traces, so you can see what the model received when it produced each score, is where the "why" surfaces.

Like
Reply

Love this, what a great explanation!

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories