Metrics that matter: choosing the right success indicators
I've sat in a lot of metrics reviews where smart people debate the wrong things. The argument is usually about whether a number went up or down, whether the trend is statistically significant, and who owns the fix. What almost never gets debated is whether the metric itself is the right thing to measure. That question got settled months ago, often in a hastily-attended planning meeting, and now we're all committed to it.
This is expensive. Bad metric choices compound. A team that's optimizing for the wrong thing for six months isn't just wasting time — it's actively building evidence that wrong things work, which makes changing direction later even harder.
So before anything else: the most important skill in product work isn't analyzing metrics. It's choosing them.
The metric selection problem
Here's the thing about metrics: any measurable behavior will get gamed once people know it's being measured. This is Goodhart's Law, articulated by a British economist in the 1970s, and it applies to product teams with uncomfortable precision. When a measure becomes a target, it ceases to be a good measure.
I've seen it happen in clean, well-intentioned ways. A team measures feature adoption by tracking whether a user has clicked into a feature at least once. Reasonable enough. Within a quarter, someone in growth is triggering a modal on login that forces users into the feature. Adoption number goes up. Engagement, retention, and actual feature value all stagnate. The metric was gamed, but nobody set out to cheat — they were just optimizing for the thing they were told mattered.
The fix isn't to track the gamed metric more carefully. It's to ask, before you commit to a metric, what behavior you're actually trying to encourage and what would look like success if nobody was playing games. That forces a more honest conversation about what you actually care about.
The hierarchy of metrics
Not all metrics are equal, and treating them that way is one of the most common mistakes I see. There's a natural hierarchy, and if you don't understand where each metric sits, you'll end up optimizing a support metric at the expense of the thing that actually drives your business.
At the top is the North Star metric. This is the one number that best captures the value your product delivers to customers. For Airbnb, it was nights booked. For Spotify, it's time spent listening. For a B2B SaaS product I worked on, it was "active workflows run per week" — not seats purchased, not logins, but the core action that signaled a customer was getting real value. A good North Star is something that goes up when customers succeed, not just when you acquire or retain them.
Below that are input metrics — the leading indicators that drive your North Star. If your North Star is active workflows per week, your inputs might be: number of users who've completed onboarding, number of integrations connected, number of workflow templates saved. These are the levers you can actually pull. They move faster, they're more sensitive to product changes, and they're closer to the decisions your team makes day to day.
At the base are health metrics — the guardrails. These don't go up when you succeed; they go up when something is wrong. Error rates, load times, support ticket volume, churn. You're not trying to optimize health metrics; you're trying to make sure they don't break. If your North Star is climbing but your error rate is also climbing, you have a problem that needs immediate attention even if the headline number looks great.
The hierarchy matters because it tells you what to optimize and what to protect. Input metrics are how you move the North Star. Health metrics are how you know you're not breaking things to get there. When a team loses track of the hierarchy, you get situations where churn is declining because you changed the cancellation flow, not because customers are actually happier — a health metric gaming moment that looks like success until it doesn't.
Common failure modes
Most metric problems I've seen fall into a few familiar categories.
Vanity metrics are the most common. These are numbers that look impressive, move in a satisfying direction, and tell you almost nothing actionable. Registered users when your problem is activation. Page views when your problem is retention. App store ratings when your problem is engagement depth. Vanity metrics make reports look good. They don't help you make decisions.
The telltale sign of a vanity metric: when you ask "what decision does this change?", there's no good answer. If you can't name a specific product, engineering, or business decision that this metric would influence differently depending on its value, you're probably looking at a vanity metric.
Metric proliferation is the second failure mode, and it tends to happen in organizations that have solved the vanity metric problem but overcorrected. They track forty-three things. Every team has their own dashboard. Every stakeholder has their own north star. Nobody knows what to prioritize when the numbers diverge — and they always diverge.
I've found that healthy product teams track fewer metrics than you'd expect. One North Star. Three to five input metrics. A handful of health guardrails. That's it. Everything else is a diagnostic — something you pull when you're investigating a problem, not something on the weekly dashboard.
Lagging indicators are the third trap. Revenue, churn, and NPS all tell you what already happened. They're useful, but if you're only measuring lagging indicators, you're always reacting. By the time your churn number spikes, the customers who churned made their decision weeks or months ago. If you'd been watching the right leading indicators — declining login frequency, dropping feature usage, increasing support tickets — you'd have seen it coming.
The best metric stacks have a healthy ratio of leading to lagging. Lagging indicators tell you whether the strategy worked. Leading indicators tell you whether it's working.
Defining metrics rigorously
A metric that isn't defined precisely isn't a metric — it's an aspiration. I've been in too many planning conversations where two smart people use the same word and mean completely different things.
"Engagement" is the worst offender. Does engagement mean daily active users? Time in app? Number of core actions completed per session? Feature adoption breadth? All four of those can be measured, and they can all move in opposite directions. If your growth team is optimizing for time in app and your product team is trying to reduce friction so users accomplish goals faster, they will be working against each other — and both will think they're winning.
Before any metric goes on a dashboard, I try to define it with four things:
The precise event. Not "engagement" but "number of workflow runs initiated per user per week." Not "retention" but "percentage of users who perform their first meaningful action within 7 days of signup and at least one more in days 8–30." The precision sounds pedantic until you're six months in and someone argues about whether this week's number represents a change.
The denominator. "Active users" means nothing without knowing who counts as active. Active in the last day? The last 30 days? Users with at least one login? Users who've completed onboarding? The denominator changes the shape of the metric entirely.
The time window. A metric that doesn't specify a time window will be measured inconsistently. Same-week retention versus 30-day retention versus 90-day retention are not the same thing and should not be called the same thing.
The exclusions. Which users or events should be excluded? New signups who churned before activating? Internal test accounts? Events triggered by automated processes rather than real users? These aren't minor details — they're the difference between a metric that tells the truth and one that flatters.
Connecting metrics to decisions
Here's the test I use for any metric under consideration: if this number changes, what do we do differently?
If you can't answer that question concretely, the metric doesn't belong on your dashboard. Monitoring is not a strategy. A metric that you look at, nod at, and don't change anything based on is just noise that takes up attention.
The decision-forcing version of this question is even sharper: what would you have to see in this metric to change your roadmap? If the answer is "we'd have to see engagement drop significantly for an extended period," then you're describing a health metric — something you monitor, not something you optimize. If the answer is "if this drops by 10% over a two-week period, we stop our current sprint and investigate the activation funnel," then you've got an actionable metric with a real trigger.
I've found that the teams with the best metric cultures are the ones who set these triggers explicitly, in writing, before the quarter starts. Not "we'll watch engagement carefully" but "if 7-day activation drops below 35% by March 15th, we pause feature development and run two weeks of discovery focused on the onboarding experience." The explicitness isn't bureaucracy — it removes the interpretive argument later when the pressure to ship is high and the temptation to explain away a bad number is real.
Metrics should also be owned, not just tracked. Every key metric should have one person who is accountable for it — not a team, not a function, one person. That person is responsible for understanding why it moves, alerting stakeholders when it's at risk, and proposing interventions. Shared ownership is no ownership. Metrics without owners become historical records, not decision tools.
Metrics in the AI era
AI products break several of the intuitions I've built about metrics over the years, and I'm still figuring out the right frameworks.
The obvious challenge is that AI outputs are probabilistic and hard to evaluate at scale. How do you measure the quality of an AI-generated response when quality is contextual, subjective, and often only apparent after the user has acted on it? Traditional error rate metrics don't apply. A model that's technically wrong 15% of the time might still be delivering tremendous value if those errors are in low-stakes contexts and the successes are in high-stakes ones.
What I've landed on for AI products: measure outcomes, not outputs. Don't measure whether the AI generated a response — measure whether the user's underlying goal was accomplished. For a coding assistant, that means tasks completed, time to completion, code that actually runs, and error rates reduced. Not "responses generated" or "tokens served."
There's also a trust layer that's unique to AI products. Users develop a mental model of the AI over time, and that mental model determines how much weight they give to its suggestions. A declining "suggestion acceptance rate" might signal that the model quality dropped — or it might signal that users have learned to trust it less because of a few high-profile errors, even if the underlying quality hasn't changed. You need metrics that track the relationship between user and model, not just the raw accuracy.
The other thing I've noticed: AI products often produce value in ways that are hard to attribute. When a customer success team uses an AI tool to handle tier-1 support, their response time and resolution rate improve — but so does team morale, and the senior people start taking on harder problems. Some of that value shows up in your metrics. Some of it doesn't. Being honest about the limits of your measurement is part of the discipline.
The real problem with metrics
I want to end with something that took me years to internalize: most metric failures aren't measurement failures. They're honesty failures.
Teams know when a number doesn't mean what they're claiming it means. They know when they've gamed an adoption metric by forcing users into a feature. They know when their retention number is being held up by a lock-in mechanism rather than genuine value. They just don't say it out loud, because the number looks good and changing it requires a difficult conversation.
The discipline of good metrics starts with the willingness to be honest about what you're actually seeing, what the number actually means, and what you'd have to believe to think things are going well. That's harder than it sounds in an environment where everyone has a stake in the trajectory of the number.
But that's the work. Pick the right metric. Define it rigorously. Connect it to decisions. And when it's not telling the truth, say so.
This article is part of a series on product management in an AI-transformed landscape.