Engineering AI and Planning 5 min read

AI Is Breaking Story Points. Here's What to Measure Instead.

A simple theory on why your velocity looks strange, and what to track instead.

Story points are one of those concepts that get used everywhere and understood inconsistently. Before we get into what AI does to them, it's worth making sure we're talking about the same thing. Because the confusion around story points is exactly what makes AI's impact on them so easy to misread.

This builds on a broader piece I wrote on AI's impact across the full engineering stack. If you haven't read that one, it's a good place to start.

First

What Story Points Actually Mean

Story points are not time. They are not hours. They are not a measure of how long something will take. They are a measure of complexity, relative to other work your team has done before.

When a team estimates a ticket at 3 story points, they are saying: this feels about as complex as the other things we've called 3. The smallest possible change your team could make becomes the anchor. Call it 1. Something more involved than that is a 3. Something that requires significant thought, coordination, or unknowns is a 5 or an 8. You're not predicting hours. You're comparing problems.

This is important because it means story points are relative to your team, your codebase, and your current capabilities. Change any of those variables significantly, and the scale needs to recalibrate. Most teams don't think about this until something forces them to. AI is one of those things.

"Story points are not time. They are a measure of complexity, relative to other work your team has done before."

· · ·

The Theory

The Pins in the Dark Room

An analogy

You have a box filled with various items. Your task is to find 10 pins hidden inside it. The catch: you have to do it in a completely dark room.

How complex is that task? Let's call it 3 story points.

Now you are handed a torch.

Same box. Same pins. Same room. But the complexity of the task has fundamentally changed. What was a 3 is now less than 1. The torch is AI.

This is the simplest way I can put it. AI doesn't change the task. It changes how hard the task is to complete. And when the complexity of work reduces, the story points you would have assigned to that work reduce with it.

The implication for teams starting to use AI seriously is straightforward: your baseline shifts. The stories you used to anchor your scale at a 3 are now 1s. The things that were 5s start feeling like 3s. The entire scale recalibrates around your new capability.

· · ·

What You'll See

The Velocity Illusion

Here is what most teams observe when they start introducing AI into their engineering workflow, and why it can be confusing if you don't know what to expect.

In the early stages, you will likely see a spike in velocity. The team is picking up more story points per sprint than before. Leadership notices. It looks like a win. And it is, partly. But it is also a measurement artefact.

What's happening is that engineers are completing work faster, but the story point estimates haven't caught up yet. The scale hasn't recalibrated. You're still assigning old complexity scores to work that has become meaningfully easier. The velocity number goes up, but it's measuring the gap between old estimates and new capability, not a sustainable increase in output.

Over time, as teams groom with AI in mind, the baseline adjusts. A story that used to be a 3 gets estimated at 1. The velocity number comes back down to something closer to what it was before. Not because the team got slower. Because the measurement got more honest.

"The velocity spike is real but temporary. What changes permanently is the baseline."

· · ·

What Actually Changes

Throughput Is the Number to Watch

Once the baseline recalibrates, velocity tells you less than it used to. The number that starts to matter more is throughput. How many tickets is your team actually shipping per sprint?

This is where the real impact of AI shows up. Not in inflated story point totals, but in the volume of work that moves from to-do to done. Engineers who are unblocking faster, reviewing faster, and iterating faster are closing more tickets. That's the signal worth tracking.

This also creates a planning challenge. If your historical velocity data was built before AI adoption, it is no longer a clean baseline for forecasting. The assumptions baked into your sprint capacity planning, your quarterly roadmaps, your delivery commitments — all of them were built on a different version of your team's capability.

The honest answer is that you need to re-anchor. Give yourself a recalibration period. Run a few sprints with AI fully embedded, let the estimates settle, and build your new baseline from there. It's a short term disruption for a more accurate long term picture.

· · ·

For Leaders

What This Means for Engineering Leaders

Stop comparing velocity before and after AI adoption. You are not measuring the same thing anymore and the comparison will mislead you.

Reset your planning baselines. Give your team a recalibration window, let the estimates settle with AI embedded, and build your forecasts from the new normal rather than the old one.

Track flow metrics, not estimates. Cycle time, throughput, deployment frequency. These will tell you what is actually changing. Story points, for a period, will just tell you noise.

Expect temporary metric confusion and communicate it upward. If leadership sees velocity drop after AI adoption, they need context. The drop is not regression. It is honesty catching up with capability.

Where to Look Next

The Metrics That Will Actually Tell You Something

Velocity and story points will give you a noisy signal during this transition. The metrics that cut through that noise are the ones that measure flow, not estimation.

Cycle time is the place to start. How long does a ticket take to move from in-progress to done? Break that down further: coding time, review time, time waiting for feedback. These numbers will start showing the impact of AI in ways that story points won't. You'll see coding time compress. Review cycles shorten. The shape of how work moves through your system will change.

DORA metrics sit in the same category. Deployment frequency, lead time for changes, change failure rate. These are the indicators that tell you whether AI is genuinely improving how your team delivers, or just changing how you count the work.

I'll be going deeper on each of these in the next post. How they get impacted, how to measure them in practice, and what to do when the numbers don't move the way you expect.

For now: if your velocity looks strange, don't panic. You're probably just holding a torch in a room that used to be dark. The task hasn't changed. Your capability has.

· · ·