No correlation between estimated size and actual time taken

Story size is a relative sizing method commonly calculated using planning poker. The fibonacci scale is one approach but there are others including 1,2,4,8,16, and S, M, L, XL.

Planning poker cards - fibonacci

Cycle time is the measure of how long a story takes to move from the point at which work starts on it, to the point considered to be done.

basic card wall with cycle time label

When teams use planning poker I’ve observed in pretty much every case that they infer a correlation between relative size and how long a story should take to complete. For example, it wouldn’t be unreasonable to assume that a story estimated at 13 points will take longer to complete than a story of 1 points. In other cases I’ve observed an assumption that defects are generally smaller than stories. Viewed in a chart these preconceptions might look something like this:

Story Size and Cycle Time Scatter - Perceived

As you can see, the amount of time it takes for the work to be done is neatly aligned to the relative size estimates.

However, back in the real world the picture is somewhat different. The reality is, there is no correlation between story size and the amount of time it takes to do the work.

I base this statement on data collected from 25 teams across 4 different organisations. Each dataset contained on average 500 data points from a mixture of stories and defects. In total, the data contained estimated size and total time taken for over 12500 work items.

The chart below shows the data for one of the teams. This chart is interesting as it shows the larger the sizing the less predictable the time taken to complete is – apart from the three 8 point stories which should of been sized as 1 point stories??

Story Size and Cycle Time Scatter

I then pondered over the idea that maybe some teams are only considering the dev effort when sizing so I narrowed the definition of done down to just measure the cycle time from Ready for Dev to Dev Done/Ready for Test. Now clearly, considering your code to be done if it hasn’t been tested, UAT’d, and deployed to Live is pretty insane and I’m certainly not recommending it. What I’m interested in here though is the accuracy of story sizing when just considering dev effort. The following chart plots just the cycle time for stories from Ready for Dev to Dev Done.

Dev Only Sizing vs Cycle Time

Again, no correlation. Here’s data from two more teams measuring cycle time from Ready for Dev to Ready for Live:

Team 1  Team 2

So if sizing of stories can’t give us any indication of how long it will take to deliver work then what’s the point of story sizing? In my next article I’ll cover The Purpose of Story Sizing but for this article let’s look at why story size doesn’t correlate with cycle time.


To understand why estimated size rarely aligns with time taken you need to explore what causes of variation you have within your delivery process. Variation comes in many forms and each company and team will present their own nuances. Here’s just a few sources of variation common to most delivery teams:

  • Lack of visibility of work.
  • Waiting for people to become available.
  • Availability of people in different time-zones (for distributed development teams).
  • Story size variation.
  • Story complexity variation.
  • The time it takes to resolve questions when relying upon other team members.
  • Context switching due to having too much work in progress.
  • Unrealistic timelines.
  • Poor prioritisation of work.
  • Hand-offs to other teams or 3rd parties.
  • Lack of automated test coverage resulting in regression and extended manual testing time.
  • Incorrect sequencing of work resulting in local dependencies.
  • Staff churn / turnover.
  • Variation in capabilities of individuals.
  • Misinterpretation of requirements.
  • Effect of productivity variance.
  • Waiting time when people are diverted to deal with incidents or other work.
  • Technical environment issues.
  • Poorly written user stories.
  • Sharing staff across multiple projects.
  • Release contention across multiple teams.
  • Environment contention – availability of technical environments at the right time.
  • Processes external to the team introducing waiting time.
  • Waiting for sign-off from departments external to the team.
  • Waiting for assets to be delivered from other teams, e.g. UI assets.
  • Build up of technical debt.
  • High defect rates resulting in rework.

There are many tools in the Agile / Kanban toolbox that can help us to reduce variation (you can’t eliminate variation). If you can reduce variation you become more predictable. If you become more predictable it makes planning easier. If you make planning easier you can manage expectations more effectively.

What are the sources of variation affecting your team?

5 Comments on "No correlation between estimated size and actual time taken"

  1. Perfect timing Ian. I literally needed this list tomorrow. I keep a similar list but yours has better coverage I think. I wonder if these points could be measured and scored across teams as an indicator of health.

  2. Interesting read. Looking at your stats it supports that there is a correlation between story size estimate and uncertainty. The bigger the story size estimate, the greater the uncertainty. This all comes back to small batches are more predictable. BTW I would generally consider anything with a cycle time from ready to dev to ready to live of more than 14 days as “Too big” to be a single story.

  3. This is great!! I wonder what the correlation (or lack thereof) would look like if you limited it to teams that frequently hit their deadlines (i.e. that were very predictable).

  4. That’s interesting – is the dataset public? I’m learning a bit of R at the moment and I’d love to have an explore of the data.

    What weirds me out most is the very long dev times on stories estimated as 1- and 2-point stories. What is going on in a team when a one-pointer isn’t the sort of thing you agree to work on after a morning standup and have finished by mid-afternoon? If a team had agreed that a story was a one-pointer and the dev doing it took five days, I’d be really unimpressed, but that seems usual in the dataset. It possibly suggests a culture of chronic under-estimation.

    Similarly, the lack of larger stories. Almost everything seems like a 1 or 2. Where are the stories for producing, say, three pages on a website to administer a few entities? The sort of thing you’ve give to a competent developer and expect done comfortably in a 2-week sprint? Shouldn’t these be 8- and 16-pointers that mid- and senior developers are taking these on regularly?

    One reason for the lack of correlation, then, might be in the lack of variety of the predictor variable here. And that might point to something cultural.

Leave a comment

Your email address will not be published.