Technology

When Your Sprint Backlog Includes an AI Model

The estimate was 70 seconds per unit. The number went up on the whiteboard with the kind of confidence that clean arithmetic can create. We were standing up a multi-agent system, and the number came out of an honest exercise: add up the model latencies, the retrieval calls, a little overhead, round up to be safe. Seventy seconds. The team committed to it. Then the first real production run came back, and the number was 18.6 minutes.

Date

7/1/2026

Author

Shailesh Patel

Eighteen point six minutes against a 70-second estimate. That is a 16x gap between what we planned and what the system actually did, and it is the reason I am writing this post. Not because the estimate was sloppy, because it was not. The math was clean and the people were good. The problem is that the entire planning apparatus we used to produce that number assumes something that stops being true the moment an AI model lands in your backlog. It assumes the work is deterministic.

I am a SAFe Program Consultant. I have spent more of my career than I would like inside the machinery of estimation, velocity, and Definition of Done. So, I want to be precise about what I am claiming here. I am not saying SAFe is broken. I am saying that a specific assumption underneath it, that a sufficiently decomposed unit of work has a knowable cost, quietly fails when the deliverable is a trained model or a stochastic pipeline. And most of the ceremonies we build on top of that assumption fail with it.

The discourse has this backwards

There is a lot of content right now about AI and agile, and almost all of it points the same direction: how to use AI tools to make your agile practice better. Scaled Agile Inc. itself has published guidance on using AI to improve backlog refinement, generate test cases, and accelerate Product Owner workflows. That is AI for agile, and it is useful. But it is not the hard problem.

The hard problem is agile for AI: how the ceremonies, the estimation, and the planning themselves have to change when the system under development is stochastic rather than deterministic. On that question the discourse is wide and shallow. Plenty of keyword content, almost nothing that engages the mechanical conflicts. That is the territory this post is about, and in my experience it is nearly empty, which is strange given how many teams are now shipping models inside Agile Release Trains and discovering the same 16x surprise I did.

Where 70 seconds went

It helps to know where the time actually went, because the failure was not one big mistake. It was a dozen small, reasonable assumptions compounding. The agents ran sequentially when the estimate implicitly assumed parallelism. The evidence-retrieval step looped more often than the happy-path diagram suggested, because real inputs are messier than test inputs. And the large language model calls sat behind an API with latency variance and rate throttling that no whiteboard number captured, so under real load the calls queued.

Here is the part that matters for planning. Not one of those factors was knowable at estimation time. We could not have story-pointed our way to 18.6 minutes, because the information that produced 18.6 minutes did not exist until the system ran against production data at production scale. The estimate was not wrong because we were careless. It was wrong because the work was non-deterministic and we planned it as if it were not.

Story points were the first casualty

Story points are a relative measure of effort, and they work because effort, for traditional software, is roughly knowable once a story is well understood. A developer who has built three similar endpoints can size the fourth with real confidence. That confidence is the whole point of the mechanism.

Now watch it fail. I was in a PI Planning session where the data science team estimated a model feature at 8 story points. They were experienced, the story was well written, the acceptance criteria were clear. By the end of the PI they had burned 34 points and the model still was not production-ready. Nobody was negligent. The 8-point estimate was a guess about how quickly a stochastic process would converge, dressed up as a measure of effort. Convergence is not effort. You do not estimate it, you discover it, and you discover it by running the experiment.

That is the conceptual move teams keep missing. For an AI feature, the dominant cost is not how hard the team works. It is how the model behaves, and the model does not negotiate with your sprint boundary.

Experiment capacity is not feature velocity

The fix that has held up for me is to stop pretending experiments are features. Feature velocity measures the team's throughput on work whose cost is estimable. AI development also contains a second kind of work, the kind where the deliverable is knowledge rather than a shippable increment: does this architecture converge, does this data improve accuracy, is this approach viable at all. That work is real, it consumes capacity, and it frequently ships nothing. Forcing it into the same velocity number that tracks features corrupts both.

So, plan it separately. I carve out what I call Experiment Capacity, an explicit, bounded allocation of the train's capacity to experiments with defined hypotheses and timeboxes rather than story-point commitments. An experiment is committed to as a question and a deadline (e.g., we will spend two weeks determining whether the hybrid architecture holds latency under the adjudication window), not as a number of points that yields a feature. This does two things. It protects feature velocity from being polluted by experimental variance, and it gives the team honest permission to learn that something does not work, which is itself a valid PI outcome.

Planning a 10-week PI around a training run you cannot time

Here is the scheduling problem in its sharpest form. A SAFe PI is a fixed timebox, typically 8 to 12 weeks. A model training or tuning cycle might take two weeks, or it might take twelve, depending on convergence you cannot predict at planning time. You cannot fit an unknown-length activity into a fixed-length container by committing harder.

What you can do is treat foundation model selection and infrastructure provisioning as architectural runway, the same way SAFe already treats enabling work that has to exist before features can flow. The architectural runway concept maps almost perfectly onto AI: before the train can build features on a model, someone has to have selected the model, provisioned the inference infrastructure, established the evaluation harness, and proven the data pipeline. That is runway work, it belongs to the System Architect, and it should be planned and funded as runway rather than smuggled into feature stories where it will blow up the estimate.

This is where the System Architect role gets more important in an AI-native train, not less. In a deterministic train the architect can hand off and step back. In an AI-native train the architect owns the runway that determines whether the next PI is even feasible, because the choice of model and the shape of the inference infrastructure set the ceiling on everything the feature teams can promise.

Model readiness runs parallel to feature readiness

The last adjustment is to PI objectives themselves. In a normal train, PI objectives track Feature Readiness: is the feature built, tested, and ready to release. For AI work, I run a parallel track, Model Readiness, with its own objectives such as (1) has the model reached its accuracy threshold, (2) has it held that threshold against a regression dataset, and (3) has it been validated under production-like load. A feature can be code-complete while the model behind it is nowhere near ready, and if your PI objectives only track one of those, you will declare victory on a system that cannot ship. Tracking the two in parallel is what keeps the train honest about what done actually means, which is a problem big enough that it is the subject of my next post.

Traditional PI planning, and the AI-adapted version

If you take one artifact from this post into your next planning session, make it this. Here is the side-by-side I now use to brief teams before their first AI-native PI.

Traditional PI Planning Element	AI-Adapted PI Planning Element
Story points estimate effort on knowable work	Experiment capacity timeboxes questions whose answers are discovered, not estimated
Velocity is the single throughput measure	Feature velocity and experiment capacity are tracked separately
Features fit inside the PI timebox	Training and tuning cycles may exceed the timebox; convergence is unpredictable
Enabler work supports feature delivery	Model selection and inference provisioning are first-class architectural runway
PI objectives track feature readiness	Model readiness runs as a parallel objective track
Definition of Done is binary	Definition of Done is a threshold on a distribution (next post in this series)
System Architect can hand off early	System Architect owns the runway that sets the ceiling for the whole train

What I would do first

If you are about to run your first PI with a model in the backlog, do this before the planning event, not during it. Sit down with the System Architect and the data science lead and separate the work into two piles: (1) the features whose effort you can honestly estimate, and (2) the experiments whose outcomes you can only discover. Plan the first pile with story points and velocity the way you always have. Plan the second pile with experiment capacity, defined hypotheses, and timeboxes. Do not let the two piles share a number.

That one act of separation is most of the battle. The 16x surprise does not come from teams being bad at estimation. It comes from teams applying an estimation mechanism, built for deterministic work, to work that is fundamentally stochastic, and then being blindsided when the model does not behave like code. Name the two kinds of work, plan them differently, and the surprise mostly goes away. What is left is honest uncertainty, which you can manage, instead of false precision, which you cannot.

In the federal AI piece earlier in this series I argued that the authorization boundary stops being a line on a diagram the moment agents start calling each other, because the behavior that matters is produced at runtime rather than at design time. This is the same lesson one floor down in the delivery stack. The plan that looks settled on the board is settled only for the deterministic part. The stochastic part reveals itself when the system runs, and our planning has to make room for that instead of pretending it away.

Two posts from here I will take this into the Definition of Done, because once you accept that model behavior is a distribution, done can no longer be a checkbox, and a sprint review of a confusion matrix is a different animal than a feature demo. After that, the convergence piece: where continuous ATO and continuous delivery finally have to meet in one pipeline, which is exactly where the federal thread and this SAFe thread become the same conversation.

The models are already in the backlog. The estimates are already on the boards. The only question is whether we plan for what these systems actually are or keep being surprised by 18.6 minutes.

——

Shailesh Patel is CTO of Keystone International Ventures and a SAFe Program Consultant (SPC). He writes about the intersection of AI architecture, federal technology, and the delivery frameworks that connect them.