What I look for when evaluating AI features as a PM

There's a gap between how AI features get talked about externally and how they get talked about internally. Externally, the conversation is all about capability and "AI-powered." Internally, on a real product team, the conversation is mostly about failure modes.

Most of the AI features that ship in consumer products today — and certainly the ones I've been involved in — get debated in the same handful of dimensions, every time. This post is the checklist I run through when I'm evaluating whether an AI feature is ready to ship, plus a few of the questions I've learned to ask early.

It's not exhaustive, and it's biased toward consumer mobile, which is what I work on. But the structure should travel.

1. Latency

This is the first question I ask, because it's almost always the most underestimated. A model call that takes 2 seconds in the lab takes 4 in production with cold-start, network variance, and queueing. A 4-second response in a consumer mobile context is the difference between "magical" and "broken."

Things I've learned to ask:

  • What's the p50, p95, and p99 latency? Lab numbers are usually p50 — make sure you know the tail.
  • What does the user see during that latency? "Loading" spinners drop conversion fast.
  • Can the response be streamed? Streaming buys real perceived-latency wins.

2. Hallucination tolerance

Different features tolerate different amounts of model confabulation. A creative-writing assistant is allowed to make things up; a customer service bot answering account questions is not. I categorize features into three buckets:

  • Zero tolerance. The feature touches account state, financial data, legal information, or anything where a confident wrong answer harms the user.
  • Bounded tolerance. The feature is augmentative — the user is the final arbiter — and a wrong answer wastes time but doesn't harm.
  • High tolerance. Creative or generative use cases where "wrong" is in the eye of the user and they have other options.

The bucket determines almost everything else: how much you spend on evals, what fallbacks you build, how prominent the AI labeling is.

3. Fallback behavior

When the model fails — and it will — what does the feature do? "Show an error" is a real answer, but rarely the right one. The good products have thought about this:

  • Is there a non-AI version of the feature the user can fall back to?
  • If the model is slow or down, does the feature gracefully degrade or hard-fail?
  • If a user gets a bad result, is there an explicit feedback path?

I push hard on this in design reviews. If the team can't tell me the fallback story in a sentence, the feature isn't ready.

4. Eval coverage

This is where teams get caught. Shipping an AI feature without an eval suite is like shipping a database feature without integration tests — it works in dev and breaks in ways you can't measure in prod.

What I look for:

  • A test set of at least 100 inputs covering the realistic distribution.
  • Automated scoring on whatever dimension matters (correctness, refusal rate, tone, latency).
  • A way to re-run the eval when prompts change, models change, or data drifts.

Eval cost is real. For a serious feature, the eval suite costs more to maintain than the feature itself. That's a budget conversation I have with engineering early.

5. Cost per interaction

Most teams know the per-call cost of the model they're using. Fewer have done the math on what that translates to at the feature's expected usage level. The math has surprised me twice in my career, both times badly.

The questions:

  • What's the cost per request?
  • What's the expected request volume in the first month, third month, year one?
  • If usage 5x's, does the unit economics still work?

I've killed two features at the spec stage because the per-user cost wouldn't pencil. Both times the engineer who'd built the prototype hadn't done the unit math.

6. Trust signals

This one is the most underrated. AI features in consumer products live or die by whether the user trusts the output. Trust is built by:

  • Clear labeling (the user knows it's AI).
  • Visible source attribution where applicable.
  • A visible feedback affordance (thumbs, "this was wrong," etc.).
  • Conservative claims. The feature should say "draft," "suggestion," or "starting point," not "answer."

The first time we shipped an AI summary feature, we went too confident in the copy. The summaries were good — the feature was killed by users not trusting them, because the language oversold the model. We relabeled, didn't change the model, and engagement doubled.

7. The "what changes about the product if this works" question

This isn't a checklist item exactly, but it's the question I ask before greenlighting an AI feature. If the feature works perfectly, what does the product become?

A lot of AI features pass the technical bar but fail this one. They work, they're well-built, they ship — and they don't change anything for the user. They're features that exist because AI exists, not because they make the product better.

The strongest AI features I've seen, including ones I've worked on, are ones where the feature is genuinely impossible without AI. A summary of a hundred user reviews. A natural-language search across a messy product catalog. Adaptive onboarding that responds to what the user does. Those are the ones worth the eval cost, the latency budget, and the cognitive load on the team.

The takeaway

The pattern I've noticed across teams is that the ones who ship good AI features aren't the ones who use the latest model. They're the ones who treat AI as one more component in a product, with all the failure modes and tradeoffs that implies — not as a magic ingredient that makes a feature done. The checklist above is just an attempt to translate that mindset into a list of questions.

Related: I wrote about how I use AI to draft PRDs, and I'll have a separate post on the specific tools I open every day. If you've shipped AI features and have war stories to add, find me on LinkedIn.

← Back to all posts