Back to articles
03.12 — Metrics that matter: choosing the right success indicators
·11 min read

Metrics that matter: choosing the right success indicators

JB · jarodbarrera.comDRAWING NO. MT-01 · 01 OF 01THE METRIC TESTFour questions a metric must survive before it earns space on the dashboard. Fail one and it’s decoration.?TEST 01Is it ACTIONABLE?Can someone look at this number and know what to do differently? If yes → worth tracking. If no → it’s scenery.?TEST 02Is it CONTROLLABLE?Can THIS TEAM actually move this number through the work it does? If not → it belongs on someone else’s dashboard.?TEST 03Does it LEAD or LAG?Lead metrics catch the problem early. Lag metrics report after it’s too late. You need at least one of each, and you need to know which is which.?TEST 04Is it GAMEABLE?Could a smart team hit this number without actually creating value? If yes → add a counter-balance metric or kill it.A metric that passes all four is rare. That’s the point.

I've sat in a lot of metrics reviews where smart people debate the wrong things. The argument is usually about whether a number went up or down, whether the trend is statistically significant, and who owns the fix. What almost never gets debated is whether the metric itself is the right thing to measure. That question got settled months ago, often in a hastily-attended planning meeting, and now we're all committed to it.

This is expensive. Bad metric choices compound. A team that's optimizing for the wrong thing for six months isn't just wasting time — it's actively building evidence that wrong things work, which makes changing direction later even harder.

So before anything else: the most important skill in product work isn't analyzing metrics. It's choosing them.

The metric selection problem

Here's the thing about metrics: any measurable behavior will get gamed once people know it's being measured. This is Goodhart's Law, articulated by a British economist in the 1970s, and it applies to product teams with uncomfortable precision. When a measure becomes a target, it ceases to be a good measure.

I've seen it happen in clean, well-intentioned ways. A team measures feature adoption by tracking whether a user has clicked into a feature at least once. Reasonable enough. Within a quarter, someone in growth is triggering a modal on login that forces users into the feature. Adoption number goes up. Engagement, retention, and actual feature value all stagnate. The metric was gamed, but nobody set out to cheat — they were just optimizing for the thing they were told mattered.

The fix isn't to track the gamed metric more carefully. It's to ask, before you commit to a metric, what behavior you're actually trying to encourage and what would look like success if nobody was playing games. That forces a more honest conversation about what you actually care about.

The hierarchy of metrics

Not all metrics are equal, and treating them that way is one of the most common mistakes I see. There's a natural hierarchy, and if you don't understand where each metric sits, you'll end up optimizing a support metric at the expense of the thing that actually drives your business.

At the top is the North Star metric. This is the one number that best captures the value your product delivers to customers. For Airbnb, it was nights booked. For Spotify, it's time spent listening. For a B2B SaaS product I worked on, it was "active workflows run per week" — not seats purchased, not logins, but the core action that signaled a customer was getting real value. A good North Star is something that goes up when customers succeed, not just when you acquire or retain them.

Below that are input metrics — the leading indicators that drive your North Star. If your North Star is active workflows per week, your inputs might be: number of users who've completed onboarding, number of integrations connected, number of workflow templates saved. These are the levers you can actually pull. They move faster, they're more sensitive to product changes, and they're closer to the decisions your team makes day to day.

At the base are health metrics — the guardrails. These don't go up when you succeed; they go up when something is wrong. Error rates, load times, support ticket volume, churn. You're not trying to optimize health metrics; you're trying to make sure they don't break. If your North Star is climbing but your error rate is also climbing, you have a problem that needs immediate attention even if the headline number looks great.

JB · jarodbarrera.comDRAWING NO. MT-02 · FIG 1THE METRIC TREEOne north star. Inputs that drive it. Instrumentation that detects what moved. Build the tree before tracking the leaves.NORTH STAR1 metric that defines success this yearINPUT METRIC AINPUT METRIC BINPUT METRIC Cinstrumentationinstrumentationinstrumentationinstrumentationinstrumentationinstrumentationthe events, clicks, sessions you record so the input metrics are actually measurableWithout the tree, every dashboard is just a list of numbers.3–5 metrics teams directly influence = the levers that move the north star

The hierarchy matters because it tells you what to optimize and what to protect. Input metrics are how you move the North Star. Health metrics are how you know you're not breaking things to get there. When a team loses track of the hierarchy, you get situations where churn is declining because you changed the cancellation flow, not because customers are actually happier — a health metric gaming moment that looks like success until it doesn't.

Common failure modes

Most metric problems I've seen fall into a few familiar categories.

Vanity metrics are the most common. These are numbers that look impressive, move in a satisfying direction, and tell you almost nothing actionable. Registered users when your problem is activation. Page views when your problem is retention. App store ratings when your problem is engagement depth. Vanity metrics make reports look good. They don't help you make decisions.

The telltale sign of a vanity metric: when you ask "what decision does this change?", there's no good answer. If you can't name a specific product, engineering, or business decision that this metric would influence differently depending on its value, you're probably looking at a vanity metric.

Metric proliferation is the second failure mode, and it tends to happen in organizations that have solved the vanity metric problem but overcorrected. They track forty-three things. Every team has their own dashboard. Every stakeholder has their own north star. Nobody knows what to prioritize when the numbers diverge — and they always diverge.

I've found that healthy product teams track fewer metrics than you'd expect. One North Star. Three to five input metrics. A handful of health guardrails. That's it. Everything else is a diagnostic — something you pull when you're investigating a problem, not something on the weekly dashboard.

Lagging indicators are the third trap. Revenue, churn, and NPS all tell you what already happened. They're useful, but if you're only measuring lagging indicators, you're always reacting. By the time your churn number spikes, the customers who churned made their decision weeks or months ago. If you'd been watching the right leading indicators — declining login frequency, dropping feature usage, increasing support tickets — you'd have seen it coming.

The best metric stacks have a healthy ratio of leading to lagging. Lagging indicators tell you whether the strategy worked. Leading indicators tell you whether it's working.

JB · jarodbarrera.comDRAWING NO. MT-03 · FIG 2FOUR WAYS METRICS GO WRONGEach looks like progress on the slide. None of them actually move the business.VANITY METRICSBig numbers that go up and to the right but don’t connect to anything that matters. Page views, sign-ups, MAU without retention context.fix → pair every count with a quality / outcome metricGOAL REPLACEMENTTeam optimizes the metric instead of the underlying thing. AI accuracy goes up, user trust goes down because the team stopped listening to feedback.fix → always include a counter-balance metricTOO MANY METRICSDashboard has 20 KPIs. None of them are decisive. The team can’t agree on what “doing well” means.fix → pick one north star, 3–5 inputs, the rest goes in the archiveLAGGING ONLYTeam only watches revenue / retention / NPS. By the time they move, the cause is two quarters old.fix → pair every lagging metric with a leading one that predicts itA bad metric system is worse than no metrics. It tells you you’re winning while you lose.

Defining metrics rigorously

A metric that isn't defined precisely isn't a metric — it's an aspiration. I've been in too many planning conversations where two smart people use the same word and mean completely different things.

"Engagement" is the worst offender. Does engagement mean daily active users? Time in app? Number of core actions completed per session? Feature adoption breadth? All four of those can be measured, and they can all move in opposite directions. If your growth team is optimizing for time in app and your product team is trying to reduce friction so users accomplish goals faster, they will be working against each other — and both will think they're winning.

Before any metric goes on a dashboard, I try to define it with four things:

The precise event. Not "engagement" but "number of workflow runs initiated per user per week." Not "retention" but "percentage of users who perform their first meaningful action within 7 days of signup and at least one more in days 8–30." The precision sounds pedantic until you're six months in and someone argues about whether this week's number represents a change.

The denominator. "Active users" means nothing without knowing who counts as active. Active in the last day? The last 30 days? Users with at least one login? Users who've completed onboarding? The denominator changes the shape of the metric entirely.

The time window. A metric that doesn't specify a time window will be measured inconsistently. Same-week retention versus 30-day retention versus 90-day retention are not the same thing and should not be called the same thing.

The exclusions. Which users or events should be excluded? New signups who churned before activating? Internal test accounts? Events triggered by automated processes rather than real users? These aren't minor details — they're the difference between a metric that tells the truth and one that flatters.

JB · jarodbarrera.comDRAWING NO. MT-04 · FIG 3THE METRIC CHECKLISTBefore this metric makes it onto a dashboard, the proposer answers all six. In writing.What DECISION will this metric inform? Write the decision down.if no decision → the metric is decorationWhat THRESHOLD separates good from bad?if you can’t state the line, you can’t evaluate the resultHow EXACTLY is it calculated? Including denominator, segmentation, time window.vague metrics get re-interpreted by every team that reads themWhat COULD GO WRONG if we just optimized this number?if a smart team could game it → add a counter-balance nowWhat’s the COUNTER-BALANCE metric? The one that keeps you honest.every input metric needs a quality / sustainability counterpartWho OWNS it? When does it get reviewed?unowned metrics decay into dashboards nobody looks atSix checked boxes = this metric earned its place. Anything less = not yet.

Connecting metrics to decisions

Here's the test I use for any metric under consideration: if this number changes, what do we do differently?

If you can't answer that question concretely, the metric doesn't belong on your dashboard. Monitoring is not a strategy. A metric that you look at, nod at, and don't change anything based on is just noise that takes up attention.

The decision-forcing version of this question is even sharper: what would you have to see in this metric to change your roadmap? If the answer is "we'd have to see engagement drop significantly for an extended period," then you're describing a health metric — something you monitor, not something you optimize. If the answer is "if this drops by 10% over a two-week period, we stop our current sprint and investigate the activation funnel," then you've got an actionable metric with a real trigger.

I've found that the teams with the best metric cultures are the ones who set these triggers explicitly, in writing, before the quarter starts. Not "we'll watch engagement carefully" but "if 7-day activation drops below 35% by March 15th, we pause feature development and run two weeks of discovery focused on the onboarding experience." The explicitness isn't bureaucracy — it removes the interpretive argument later when the pressure to ship is high and the temptation to explain away a bad number is real.

Metrics should also be owned, not just tracked. Every key metric should have one person who is accountable for it — not a team, not a function, one person. That person is responsible for understanding why it moves, alerting stakeholders when it's at risk, and proposing interventions. Shared ownership is no ownership. Metrics without owners become historical records, not decision tools.

Metrics in the AI era

AI products break several of the intuitions I've built about metrics over the years, and I'm still figuring out the right frameworks.

The obvious challenge is that AI outputs are probabilistic and hard to evaluate at scale. How do you measure the quality of an AI-generated response when quality is contextual, subjective, and often only apparent after the user has acted on it? Traditional error rate metrics don't apply. A model that's technically wrong 15% of the time might still be delivering tremendous value if those errors are in low-stakes contexts and the successes are in high-stakes ones.

What I've landed on for AI products: measure outcomes, not outputs. Don't measure whether the AI generated a response — measure whether the user's underlying goal was accomplished. For a coding assistant, that means tasks completed, time to completion, code that actually runs, and error rates reduced. Not "responses generated" or "tokens served."

There's also a trust layer that's unique to AI products. Users develop a mental model of the AI over time, and that mental model determines how much weight they give to its suggestions. A declining "suggestion acceptance rate" might signal that the model quality dropped — or it might signal that users have learned to trust it less because of a few high-profile errors, even if the underlying quality hasn't changed. You need metrics that track the relationship between user and model, not just the raw accuracy.

The other thing I've noticed: AI products often produce value in ways that are hard to attribute. When a customer success team uses an AI tool to handle tier-1 support, their response time and resolution rate improve — but so does team morale, and the senior people start taking on harder problems. Some of that value shows up in your metrics. Some of it doesn't. Being honest about the limits of your measurement is part of the discipline.

The real problem with metrics

I want to end with something that took me years to internalize: most metric failures aren't measurement failures. They're honesty failures.

Teams know when a number doesn't mean what they're claiming it means. They know when they've gamed an adoption metric by forcing users into a feature. They know when their retention number is being held up by a lock-in mechanism rather than genuine value. They just don't say it out loud, because the number looks good and changing it requires a difficult conversation.

The discipline of good metrics starts with the willingness to be honest about what you're actually seeing, what the number actually means, and what you'd have to believe to think things are going well. That's harder than it sounds in an environment where everyone has a stake in the trajectory of the number.

But that's the work. Pick the right metric. Define it rigorously. Connect it to decisions. And when it's not telling the truth, say so.

This article is part of a series on product management in an AI-transformed landscape.