Estimating Effort – Adaptation

I’ve been running the informed intuition (or if you prefer, “disciplined intuition”)  approach to estimating effort for close to nine months now. For the most part, it has gone very well. The primary objective – inspire and support a conversation around the effort needed to complete a story – has most definitely been realized. Along the way the process has shifted to better support both the conversation and the team’s ability to internalize the process.

Originally, it was proposed that teams rate each of the effort characteristics on a sliding scale – 1 to 10 or 1 to 15, or whatever the team decided was most useful. Feedback from the teams lead to the discovery that it is easier to evaluate each effort characteristic using the modified Fibonacci scale rather than a sliding scale. This provides continuity across the method in that everything about a story’s effort value is considered using the same scale. It also reinforces the rationale behind the use of the Fibonacci scale and seems to facilitating the team’s ability to internalize the method. They are moving more quickly when deriving effort values.

A second adaptation is the use of several sets of characteristics, depending on the type of story, the predominant functional area represented by the team, and the nature of the work. For example, a story that involves the development of a computer board has a different set of criteria from stories that involve the creation of firmware for the board or the UI/UX features of the hardware product. The sets usually contain 3 or 4 common characteristics, such as “complexity” or “dependencies.” However, the hardware board may include something like “part sourcing” or “compliance testing.” This illustrates the importance of having the team deconstruct what “effort” means in the context of their world. When they determine the characteristics, the follow-on conversations about the effort are much more robust and meaningful.

In essence, this method is a reflection of the product owner’s responsibility for the “what” of the story and the team’s responsibility for figuring out the “how” of the story. “What I want,” says the product owner, “is an estimate of the effort involved to complete this story.” The teams effort criteria demonstrate to the product owner how they arrive at any particular value.

Intuition and Effort Estimates

In his book, “Blink,” Malcom Gladwell describes an interview between Gary Klein and a fire department commander. A lieutenant at the time, the firemen were attempting to put out a kitchen fire that didn’t “behave” like a kitchen fire should. The lieutenant ordered his men out of the house moments before the floor collapsed due to the fire being in the basement, not the kitchen. Klein later deconstructed the event with the commander and revealed a surprisingly rich set of experienced-based characteristics about that event the commander used to quickly evaluate the situation and respond. The lieutenant’s quick and well-calibrated-to-the-situation intuition undoubtedly saved them from serious injury or worse.

Intuition, however, is domain-specific. This same experienced-based intuition most probably wouldn’t have served the commander well if he suddenly found himself in a different situation – at the helm of a sailboat in rough water, for example, assuming the commander had never been on a sailboat before.

In the context of a software development environment, a highly experienced individual may have very good intuition on the amount of work needed to complete a specific piece of work assigned to them. But that intuition breaks down when the work effort necessarily includes several people or an entire team. So while intuition can serve a useful role in estimating work effort, that value is generally over-estimated, particularly when it needs to be a team estimate.

Consider work effort estimates when framed by Danial Kahneman’s work with System One and System Two thinking. System One is fast, based on experiences, and automatic. However, it isn’t very flexible and it’s difficult to train. This is the source of intuition. System Two, however, is analytical, methodical, intentional, deliberate, and slower. Also, it’s more trainable. It’s when the things that are trained in System Two sink into System One that new behaviors become automatic. With work effort estimates, we must first deliberately train our System Two using a method that is more deliberate about estimating before we can comfortably rely on our System One abilities.

Once calibrated, any number of changes could signal the need to re-calibrate by employing the deliberate process. Change the team composition and the team will need some measure of re-training of System One via System Two. Change a team’s project and the same re-training will need to occur.

The trained intuition approach to estimating effort develops what Kahneman called “disciplined intuition.” Begin with a deliberate, statistical approach to thinking about work effort. Establish a base rate using the value ranges for the effort characteristics. With experience, the team can begin to integrate their intuition later in the project process. If teams lead with their intuition (as is the case with planning poker and t-shirt sizes), they will filter for things that confirm their System One evaluation. With experience and a track record of success from training their intuition, teams can eventually lead with an intuitive approach. But it isn’t a very effective way to begin.

This method also leverages the work of Anders Ericsson and deliberate practice. The key here is the notion of increasing feedback into the process of estimating work effort. The deliberate action of working through a conversation that evaluates each of the work effort characteristics introduces more and better feedback loops that help the team evaluate the quality of their decision. Over time, they get better and better at correcting course and internalizing the lessons.

It’s like learning to drive a car. A new driver will leverage System Two heavily before they can comfortably rely on System One while driving. This is good enough for most driving situations. However, it wouldn’t be good enough if that same driver who is competent at driving in city traffic was suddenly placed on a NASCAR track in a powerful machine going 200 miles per hour.

A NASCAR track might be where we would go look for expert drivers but not where we would look for competent delivery truck drivers. For work estimates on software projects, we’re looking for a level of good enough that’s a reasonable match for the project work at hand. And we’re looking for better than untrained intuitive guesses.

Determining Effort Value – Tactics

While the concept and practice is straightforward, shifting a team from intuitive guesses about story points to a more deliberate approach for determining effort value (a.k.a. story points) can be a challenge at first. The following approach may help start the process.

  1. Begin by focusing on product backlog items (PBIs) that the team has estimated using their previous approach that are at a 5 or greater. There isn’t much to be gained by applying this approach to PBIs estimated at 1 or 2. PBIs that the team knows are a bigger effort but may not be able to articulate why that is the case are good candidates for learning how to apply this technique.
  2. Ask the team how much time it may take to complete a PBI. While I have written before about the importance of excluding time criteria when determining effort values, this can be a good place to start. It is what teams are most familiar with – for better or worse. Teams usually have not problem throwing out a time: 8 hours, 16 hours, etc.
  3. With the time estimate in hand ask the team:

“If you sit in front of your computer and start the clock, will the PBI be done if you do nothing and the estimated time elapses?”

I would hope the team would answer “No.”

  1. With the answer to the first question in hand, ask the following question:

If the passage of time alone won’t get the PBI work completed, what will you be doing (actions and behaviors) to complete the work?

The conversation that follows from this questions is the basis for determining the effort criteria the team needs to better describe what they will be doing on their way to completing the PBI. The techniques around establishing effort criteria are described in an earlier post.

Time Out!

In Estimating Effort – An Explicitly Implicit Approach I stated that time cannot be one of the attributes the team uses to describe what they mean by “effort.” The importance of this warrants the need for a deeper dive into the rationale behind this rule and how excluding time can lead to better predictability for team performance.

The primary objective for coaching teams to think about effort independent of time constraints is so that they can improve their skills for thinking about the actual work involved. Certainly they will spend time completing the work. But the simple passage of time won’t get the work done. Someone has to actually DO something. That something is the effort.

For example, maybe someone on the team says the product backlog item requires a lot of documentation. It isn’t complex and there aren’t any dependencies, it’s just going to take a lot of time – 7 days, maybe. So they want to give that PBI an effort value of 5 or 8 (or 5 or 8 story points, if that’s what you’re using) because it’s going to take a lot of time.

Remember, the purpose of these criteria is to generate a conversation around what the actual effort is. The criteria are just a set of guideposts that help the team hold a meaningful conversation about the effort.  So when someone on a team insists that they estimate using time, I ask them “What are you doing as the time you’ve estimated is passing? Are you just sitting there, watching the seconds tick away?” Of course they aren’t just sitting there. I’m asking the questions to elicit a comment about the actual work they are doing. Maybe they answer with something a little less vague, like “typing words.” That’s good. “What’s the difference between typing those words in a word processor and typing code in Vim?”

Continuing down this line of inquiry usually leads to the realization that typing documentation has many similar traits to coding. It can be complex. It may have dependencies.  It may require research for accuracy and it certainly will need a lot of debugging (professional writers call this “editing.”) Coders typically don’t like writing documentation. To them it’s just about the tedium of banging something out that’s not as fun as code. Sussing out the effort like this will lead to better acceptance criteria and definition of done associated with the PBI.

The downside of time estimates is that they hide all manner of sins and rabbit holes. The planning fallacy, precision bias, availability heuristic, and survivorship bias are just a few of the mental obstacles guaranteed to reduce the accuracy of time estimates. Or you may have to deal with a team member who wants to estimate using time because they know full well it offers the opportunity to hide slow work. (Gamers gotta game.) When teams have run the gauntlet of effort criteria, they are more likely to end up with a better picture of how much work they are being asked to do when time is excluded from the conversation. Effort criteria force the team to be more explicit about the activities they are engaged with as the clock ticks.

The investment in identifying time-independent effort criteria yields further benefits in the retrospective. Was the team unable to complete a PBI in the sprint? Was all the work finished two days early? Have a look at the effort criteria and ask which of them were a factor in making the PBIs a bigger or smaller effort than initially estimated. This is how teams learn and improve their skill at estimating. The better they are at estimating the more predictable their productivity.

OK, so let’s say you have a team doing a great job of determining the effort needed to complete a PBI and they do so without including time. No doubt, management will be unimpressed. They want time estimates. Good news! We can give them time estimates…in two week increments.

With the team focused on figuring out time independent effort values for every PBI in the backlog and an ongoing experience of how much effort they can reliably complete in two week increments, product owners can provide a reasonable forecast for when the release or project will be complete. The team focuses on accurate time independent effort estimates. The scrum master and product owner worry about the performance metrics and time projections.

It’s surprising how hard of a sell this can be for teams. They are hard wired to think in terms of time because that’s what traditional project management has hounded them for since before coding was a thing. I tell teams, “With Agile and scrum, you no longer have to worry about time. That’s the product owner’s job. But you do have to develop very good skills at estimating effort.” It’s common for them to have a hard time adjusting to the new paradigm.

Estimating Effort – An Explicitly Implicit Approach

It is difficult to make predictions, especially about the future.Unknown

Sage advice.

So why bother estimating the amount of work needed to complete a product backlog item? After all, since estimates are about the future the probability is high that they will be wrong. Actually, they may very well be guaranteed to be wrong. It’s just that some of the guesses will be more accurate than others. And if they happen to match what the effort ended up to be, they just look like they were “right.”

I’ve written in the past expressing my thoughts about estimating the effort needed to complete product backlog items, particularly with respect to story points. I believe working to find a relative gauge to how well teams are estimating work is important. Without them, cognitive biases such as the optimism bias and planning fallacy can significantly distort a project delivery timeline. However, the phrase “story point” is burdened with a lot of baggage. It has been abused and misused such that invoking the phrase often causes more harm than good.

I’ve been experimenting recently with a different approach to estimating effort. The method I’ll describe in this post got a bit of a boost after listening to a recent interview with Psychologist and Nobel laureate Daniel Kahneman. In this interview, Kahneman describes an experience he had while serving in the Israeli army some sixty years ago. He was assigned the job of setting up an interview process that would determine how well a recruit would do as a combat soldier. For this process, he selected six traits and instructed the interviewers to ask questions designed to evaluate each trait independently and score them. The interviewers were not happy with this approach. As a compromise, Kahneman instructed the interviewers, when they were finished asking about the six traits, to close their eyes and just jot down a number they felt matched how good a soldier the recruit might be. What he discovered:

When we validated the results of the interview, it was a big improvement on what had gone on before. But the other surprise was that the final intuitive judgments added, it was good. It was as good as the average of the six traits, and not the same. It added information, so actually we ended up with a score that was half determined by the specific ratings, and the intuition got half the weight. That, by the way, stayed in the Israeli army for well over 50 years.Daniel Kahneman

This intuitive evaluation made by the interviewers is similar to what Agile methods ask of development teams when determining a value for “story points.” T-shirt sizes, planning poker, dot voting, affinity mapping and many similar techniques are all designed to elicit an intuitive sense of the effort involved. If there is a disagreement between team members, than a dialog follows to understand what the discrepancy is all about. This continues until there is alignment on what the team believes the effort to be. When it works, it works well.

So on to the details of the approach I’ve been experimenting with. (It doesn’t have a name yet.) The result of this approach is a number I call the “effort value.” The word “value” is a reference to the actual elementary mathematics value being derived. Much like the answer to the question “What value results from adding 2 and 2?” Answer: 4. The word “value” also suggests an intrinsic worth, something beyond a hard number. My theory is that this will help teams think beyond the mere number and think also about the value they are delivering to stakeholders. The word “point” correlates to a hard number and lacks any association to intrinsic worth or value.

Changing the words introduces a simple and small shift that nonetheless has a significant impact. With the change, teams are more open to considering a different approach to determining estimates.

So how is the effort value derived?

I begin by having the team define 4-5 characteristics or attributes that, to them, describe what they mean by “effort.” It is important for the team to define these attributes. By doing so, they own the definition and it becomes much harder for them to dismiss the attributes as “someone else’s” and thereby object to their use in deriving an effort value. These attributes can be anything that is meaningful to the team. Examples:

  • Complexity – Is the work straightforward (e.g. code a bubble sort function) or does it involve interrelated systems (e.g. code a predictive inventory control algorithm)?
  • Dependencies – How dependent is the product backlog item on other backlog items or other teams?
  • Familiarity – Is this work very similar to work the team has done in the past or something quite new? Tasking a coder with documenting a piece of straightforward code may actually be a difficult effort because the coding language they spend most of their day with is familiar whereas writing clear sentences that non-technical people can understand is unfamiliar.
  • Information – Is the detail in the product backlog item complete? Are the acceptance criteria and definition of done clear?
  • Technical Debt Risk – Does the PBI require any refactoring of related code? Is any technical debt being incurred with the PBI?
  • Design Stability – Is there a lot of discovery and exploration needed to complete the PBI?
  • Confidence for Completing a PBI within the Sprint – This category may roll up several categories.
  • Tedium – Perhaps the effort involves a lot of repetitive copy and paste that nonetheless requires careful attention to avoid simple mistakes.

The team can define any attribute they wish. However, there are a few criteria to consider:

  • Keep the list limited to 4-6 attributes. More than that risks turning the derivation of an effort value into the equivalent of a product backlog item navel-gazing exercise.
  • Time cannot be one of the attributes.
  • The attributes should be reasonable. Assessing a product backlog item’s effort value by evaluating it’s “aura” or the current position of the stars are generally not useful attributes. On the other hand, I’ve listened to arguments against evaluating estimates in terms of “complexity” as being similarly useless. I see the point of those arguments, but my view is that the attributes must first and foremost be meaningful to the entire team. In the end, it’s an educated guess and arguments about the definition of terms like “complexity” are counterproductive to the overall intent of deriving an effort value.

Each of these attributes is then given a scale, the same scale for each attribute – 1 to 10, 1 to 15 – whatever the team feels is most appropriate. The team then goes through each of these attributes and evaluates the product backlog item attribute on the scale. (NB: After nine months of Plan-Do-Check-Adapt, a better approach for scoring the attributes has been determined.) The low number on the scale represents very little impact. If dependency, for example, is one of the attributes then a 1 might mean that the product backlog item is entirely self-contained. A 10 might represent a case where the product backlog item is dependent on several other product backlog items or perhaps the output from other teams.

When this is done, ask the team where on the modified Fibonacci scale they think this particular product backlog item’s effort value should be. If they’re struggling you can do the math: find the average for all the attributes and match that number in the modified Fibonacci scale. If the average is a decimal, for example 3.1, match the value to the next highest modified Fibonacci scale number. In this case the value would be 5. Then ask the team if they feel that number it’s a good representation of the effort value for the product backlog item.

This may seem like a lot of unnecessary gyrations, but for technical people it’s a simple process they can understand. The bonus is a number they can calculate. The number isn’t what’s important here. What’s important is the conversation that happens around the attributes and what the team feels about the number that results from the conversation. This exercise is meant to develop their intuitive muscles for considering multiple aspects and dimensions behind the “effort” needed for them to get the work done.

Use this process enough times and eventually calculating the average can be dropped from the process. Continue using this process and eventually calculating the numbers for the individual attributes can be dropped from the process. I don’t know if it’s a good idea to drop the use of the attributes for generating the needed conversation around the effort needed, but it will certainly be valuable to reconsider the list of attributes from time to time so as to fine tune the list to match what the team feels is important.

With this approach I’m turning the estimation process on its head (or back on its feet, if Kahneman is right.) Rather than seek the intuitive response first (e.g. t-shirt size) and elicit details later if there is a mismatch between team members, this method seeks to better prime and develop the team’s intuition about the effort value by having them explicitly consider a list of self-selected attributes (or traits) for effort first and then include an intuitive evaluation for effort.

Don’t try to form an intuition quickly, which was what we normally do. Focus on the separate points, and then when you have the whole profile, then you can have an intuition and it’s going to be better. Because people form intuitions too quickly, and the rapid intuitions are not particularly good. If you delay intuition until you have more information, it’s going to be better.Daniel Kahneman

Update

See Time Out! and Determining Effort Value – Tactics for additional information on this technique.

The Practice of Sizing Spikes with Story Points

Every once and a while it’s good to take a tool out of it’s box and find out if it’s still fit for purpose. Maybe even find if it can be used in a new way. I recently did this with the practice of sizing spikes with story points. I’ve experienced a lot of different projects since last revisiting my thinking on this topic. So after doing a little research on current thinking, I updated an old set of slides and presented my position to a group of scrum masters to set the stage for a conversation. My position: Estimating spikes with story points is a vanity metric and teams are better served with time-boxed spikes that are unsized.

While several colleagues came with an abundance of material to support their particular position, no one addressed the points I raised. So it was a wash. My position hasn’t changed appreciably. But I did gain from hearing several arguments for how spikes could be used more effectively if they were to be sized with story points. And perhaps the feedback from this article will further evolve my thinking on the subject.

To begin, I’ll answer the question of “What is a spike?” by accepting the definition from agiledictionary.com:

Spike

A task aimed at answering a question or gathering information, rather than at producing shippable product. Sometimes a user story is generated that cannot be well estimated until the development team does some actual work to resolve a technical question or a design problem. The solution is to create a “spike,” which is some work whose purpose is to provide the answer or solution.

The phrase “cannot be well estimated” is suggestive. If the work cannot be well estimated than what is the value of estimating it in the first place? Any number placed on the spike is likely to be for the most part arbitrary. Any number greater than zero will therefore arbitrarily inflate the sprint velocity and make it less representative of the value being delivered. It may make the team feel better about their performance, but it tells the stakeholders less about the work remaining. No where can I find a stated purpose of Agile or scrum to be making the team “feel better.” In practice, by masking the amount of value being delivered, the opposite is probably true. The scrum framework ruthlessly exposes all the unhelpful and counterproductive practices and behaviors an unproductive team may be unconsciously perpetuating.

Forty points of genuine value delivered at the end of a sprint is 100% of rubber on the road. Forty points delivered of which 10 are points assigned to one or more spikes is 75% of rubber on the road. The spike points are slippage. If they are left unpointed then it is clear what is happening. A spike here and there isn’t likely to have a significant impact on the velocity trend over, for example, 8 or 10 sprints. One or more spikes per sprint will cause the velocity to sink and suggests a number of corrective actions – actions that may be missed if the velocity is falsely kept at a certain desired or expected value. In other words, pointing spikes hides important information that could very well impact the success of the project. Bad news can inspire better decisions and corrective action. Falsely positive news most often leads to failures of the epic variety.

Consider the following two scenarios.

Team A has decided to add story points to their spikes. Immediately they run into several significant challenges related to the design and the technology choices made. So they create a number of spikes to find the answers and make some informed decision. The design and technology struggles continue for the next 10 sprints. Even with the challenges they faced, the team appears to have quickly established a stable velocity.

The burndown, however, looks like this:

If the scrum master were to use just the velocity numbers it would appear Team A is going to finish their work in about 14 sprints. This might be true if Team A were to have no more spikes in the remaining sprints. The trend, however, strongly suggests that’s not likely to happen. If a team has been struggling with design and technical issues for 10 sprints, it is unlikely those struggles will suddenly stop at sprint 11 and beyond unless there have been deliberate efforts to mitigate that potential. By pointing spikes and generating a nice-looking velocity chart it is more probable that Team A is unaware of the extent to which they may be underestimating the amount of time to complete items in the backlog.

Team B finds themselves in exactly the same situation as Team A. They immediately run into several significant challenges related to the design and the technology choices made and create a number of spikes to find the answers and make some informed decision. However, they decided not to add story points to their spikes. The design and technology struggles continue for the next 10 sprints. The data show that Team B is clearly struggling to establish a stable velocity.

And the burndown looks like this, same as Team A after 10 sprints:

However, it looks like it’s going to take Team B 21 more sprints to complete the work. That they’re struggling isn’t good. That it’s clear they struggling is very good. This isn’t apparent with Team A’s velocity chart. Since it’s clear they are struggling it is much easier to start asking questions, find the source of the agony, and make changes that will have a positive impact. It is also much more probably that the changes will be effective because they will have been based on solid information as to what the issues are. Less guess work involved with Team B than with Team A.

However, any scrum master worth their salt is going to notice that the product backlog burndown doesn’t align with the velocity chart. It isn’t burning down as fast as the velocity chart suggests it should be. So the savvy Team A scrum master starts tracking the burndown of value-add points vs spike points. Doing so might look like the following burndown:

Using the average from the parsed burndown, it is much more likely that Team A will need 21 additional sprints to complete the work. And for Team B?

The picture of the future based on the backlog burndown is a close match to the picture from the velocity data, about 22 sprints to complete the work.

If you were a product owner, responsible for keeping the customer informed of progress, which set of numbers would you want to base your report on? Would you rather surprise the customer with a “sudden” and extended delay or would you rather communicate openly and accurately?

Summary

Leaving spikes unpointed…

  • Increases the probability that performance metrics will reveal problems sooner and thus allow for corrective actions to be taken earlier in a project.
  • The team’s velocity and backlog burndown is a more accurate reflection of value actually being created for the customer and therefore allows for greater confidence of any predictions based on the metrics.

I’m interested in hearing your position on whether or not spikes should be estimated with story points (or some other measure.) I’m particularly interested in hearing where my thinking described in this article is in need of updating.

[This article originally appeared on the Agile Alliance blog.]

Agile Money

In a recent conversation with colleagues we were debating the merits of using story point velocity as a metric for team performance and, more specifically, how it relates to determining a team’s predictability. That is to say, how reliable the team is at completing the work they have promised to complete. At one point, the question of what is a story point came up and we hit on the idea of story points not being “points” at all. Rather, they are more like currency. This solved a number of issues for us.

First, it interrupts the all too common assumption that story points (and by extension, velocities) can be compared between teams. Experienced scrum practitioners know this isn’t true and that nothing good can come from normalizing story points and sprint velocities between teams. And yet this is something non-agile savvy management types are want to do. Thinking of a story’s effort in terms of currency carries with it the implicit assumption that one team’s “dollars” are not another team’s “rubles” or another teams “euros.” At the very least, an exchange evaluation would need to occur. Nonetheless, dollars, rubles, and euros convey an agreement of value, a store of value that serves as a reliable predictor of exchange. X number of story points will deliver Y value from the product backlog.

The second thing thinking about effort as currency accomplished was to clarify the consequences of populating the product backlog with a lot of busy work or non-value adding work tasks. By reducing the value of the story currency, the measure of the level of effort becomes inflated and the ability of the story currency to function as a store of value is diminished.

There are a host of other interesting economics derived thought experiments that can be played out with this frame around story effort. What’s the effect of supply and demand on available story currency (points)? What’s the state of the currency supply (resource availability)? Is there such a thing as counterfeit story currency? If so, what’s that look like? How might this mesh with the idea of technical or dark debt?

Try this out at your next backlog refinement session (or whenever it is you plan to size story efforts): Ask the team what you would have to pay them in order to complete the work. Choose whatever measure you wish – dollars, chickens, cookies – and use that as a basis for determining the effort needed to complete the story. You might also include in the conversation the consequences to the team – using the same measures – if they do not deliver on their promise.

How To Run an Agile Death March

Found on the Internet…

An experienced scrum master describes their work cycles as going “from being very busy during sprint end/start weeks to be [sic] very bored.” While this scrum master works very hard to fill in the gaps with 1:1’s with the team members and providing regular training opportunities, they nonetheless ask the question, “Does anyone have any suggestions of things I am maybe not doing that I should be doing?” One response included the following:

“Now, it could be that you have worked to create a hyper-performing team and there is no further room for improvement. A measure of this is that velocity (or similar metric) has increased by an order of magnitude in the last year.

However, the most likely scenario is that you and your team have become ‘comfortable’ and velocity has not increased significantly in the last few Sprints and/or there is a high variance in velocity.”

This reflects a common misunderstanding of “velocity” and its confusion with “acceleration.” (It also reflects the “more is better” and “winners vs losers” thinking derived from the scrum sports metaphor and points as a way of keeping score. I’ve written about that elsewhere.) Neither does the commenter understand what “order of magnitude” means. A velocity that increases by an order of magnitude in a year isn’t a velocity, it’s an acceleration. That’s a bad thing. This wouldn’t be a “hyper-performing” team. This would be a team headed for a crash as a continual acceleration in story points completed is untenable. More and more points each sprint isn’t the goal of scrum. A product owner cannot predict when their team might complete a feature or a project if the delivery of work is accelerating throughout the project.

Assuming a typical project, something that continues for a year or more, the team and the project will eventually crash as they’ve been pressured to work more and more hours and cut more and more corners in the interests of completing more and more points. The accumulation of bugs, small and large, will slow progress. Team fatigue will increase and moral decrease, resulting in turn-over and further delays. In common parlance, this is referred to as a “death march.”

Strictly speaking, velocity is some displacement over time. In the case of scrum, it is the number of story points completed in a sprint. We’ve “displace” some number of story points from being “not done” to “done.” By itself, a single sprint’s velocity isn’t particularly useful. Looking at the velocity of a number of successive sprints, however, is useful. There are two pieces of information from looking at successive sprint velocities that, when considered together, can reveal useful aspects of how well a team is performing or not. The first is the average over the previous 5 to 8 sprints, a rolling average. As a yard stick, this can provide a measure of predictability. Using this average, a product owner can make a rough calculation for how many sprints remain before completing components or the project based on the story point information in the product backlog.

The measure of confidence for this prediction would come from an analysis of the variance demonstrated in the sprint velocity values over time. Figures 1 and 2 show the distinction between the value provided by a rolling average and the value provided by the variance in values over time.

Figure 1

Figure 2

In both cases the respective teams have an average velocity of 21 points per sprint. However, the variability in the values over time show that the team in Figure 1 would have a much higher level of confidence in any predictions based on their past performance than the team shown in Figure 2.

What matters is the trend, each sprint’s velocity over a number of sprints. The steady completion of story points (i.e. work) sprint to sprint is the desirable goal. Another way to say this is that a steady velocity makes it possible to predict project delivery dates. In real life, there will be a variance (up and down) of sprint velocity over time and the goal is to guide the project such that this variance is within a manageable range.

If a team were to set as its goal an increase in the number of story points completed from sprint to sprint then their performance chart might initially look like Figure 3.

Figure 3

Such a pace is unsustainable and eventually the team burns out. Fatigue, decreased moral, and overall dissatisfaction with the project cause team members to quit and progress grinds to a halt. The fallout of such a collapse is likely to include the buildup of significant technical debt and code errors as the run-up to the crescendo forced team members to cut corners, take shortcuts, and otherwise compromise the quality of their effort. [1] The resulting performance chart would look something like Figure 4.

Figure 4

All that said, I grant that there is merit in coaching teams to make reasonable improvements in their overall sprint performance. An increase in the overall average velocity might be one way to measure this. However, to press the team into achieving an order of magnitude increase in performance is a fools errand and more than likely to end in disaster for the team and the project.

References

[1] Lyneis, J.M, Ford, D.N. (2007). System dynamics applied to project management: a survey, assessment, and directions for future research. System Dynamics Review, 23 (2/3), 157-189.

Story Points and Fuzzy Bunnies

The scrum framework is forever tied to the language of sports in general and rugby in particular. We organize our project work around goals, sprints, points, and daily scrums. An unfortunate consequence of organizing projects around a sports metaphor is that the language of gaming ends up driving behavior. For example, people have a natural inclination to associate the idea of story points to a measure of success rather than an indicator of the effort required to complete the story. The more points you have, the more successful you are. This is reflected in an actual quote from a retrospective on things a team did well:

We completed the highest number of points in this sprint than in any other sprint so far.

This was a team that lost sight of the fact they were the only team on the field. They were certain to be the winning team. They were also destine to be he losing team. They were focused on story point acceleration rather than a constant, predictable velocity.

More and more I’m finding less and less value in using story points as an indicator for level of effort estimation. If Atlassian made it easy to change the label on JIRA’s story point field, I’d change it to “Fuzzy Bunnies” just to drive this idea home. You don’t want more and more fuzzy bunnies, you want no more than the number you can commit to taking care of in a certain span of time typically referred to as a “sprint.” A team that decides to take on the care and feeding of 50 fuzzy bunnies over the next two weeks but has demonstrated – sprint after sprint – they can only keep 25 alive is going to lose a lot of fuzzy bunnies over the course of the project.

It is difficult for people new to scrum or Agile to grasp the purpose behind an abstract idea like story points. Consequently, they are unskilled in how to use them as a measure of performance and improvement. Developing this skill can take considerable time and effort. The care and feeding of fuzzy bunnies, however, they get. Particularly with teams that include non-technical domains of expertise, such as content development or learning strategy.

A note here for scrum masters. Unless you want to exchange your scrum master stripes for a saddle and spurs, be wary of your team turning story pointing into an animal farm. Sizing story cards to match the exact size and temperament from all manner of animals would be just as cumbersome as the sporting method of story points. So, watch where you throw your rope, Agile cowboys and cowgirls.

(This article cross-posted at LinkedIn)


Image credit: tsaiproject (Modified in accordance with Creative Commons Attribution 2.0 Generic license)

Relative Team Expertise and Story Sizing

In Parkinson’s Law of Triviality and Story Sizing, I touched on the issue of relative expertise among team members during collaborative efforts to size story cards. I’d like to expand on that idea by considering several types of team compositions.

Team 1 is a tight knit band of four software developers represented in Figure 1.

Figure 1 - Team 1
Figure 1 – Team 1

Their preferred domain and depth of experience is represented by the color and area of their respective circles. While they each have their own area of expertise, there is a significant overlap in common knowledge. All four of them understand the underlying architecture, common coding practices, and fundamental coding principles. Furthermore, there is a robust amount of inter-domain expertise. When needed, the HTML5/CSS developer can probably help out with JavaScript issues, for example. The probability of this team successfully working together to size the stories in the product backlog is high.

Team 1 represents a near-ideal team composition for a typical software related project. However, the real world isn’t so generous in it’s allocation of near-ideal, let alone ideal, teams. A typical team for a software related project is more likely to resemble Team 2, as represented in Figure 2.

Figure 2 - Team 2
Figure 2 – Team 2

In Team 2, the JavaScript developer is fresh out of college,  new to the company and new to the business. His real-world experience is limited so his circle of expertise is smaller relative to his teammates. The HTML5/CSS developer has been working for the company for 10 years and knows the business like the back of her hand. So she has a much wider view of how her work impacts the company and product development. As a team, there is much less overlap and options for helping each other through a sprint is diminished.  As for collaborative story sizing efforts, the HTML5/CSS and C# developers are likely to dominate the conversation while the JavaScript developer agrees with just about anything not JavaScript related.

As Agile practices become more ubiquitous in the business world, team composition beings to resemble Team 3, as shown in Figure 3.

Figure 3 - Team 3
Figure 3 – Team 3

The mix now includes non-technical people – content developers and editors, strategists, and designers. Even assuming an equal level of experience in their respective domains, the company, and the business environment, there is very little overlap. Arriving at a consensus during a story sizing exercise now becomes a significant challenge. But again, the real world isn’t even so kind as this. We are increasingly more likely to encounter teams that resemble Team 4 as shown in Figure 4.

Figure 4 - Team 4
Figure 4 – Team 4

As before, the relative circle of expertise among team members can vary quite a bit. When a team resembles the composition of Team 4, the software developers (HTML5/CSS and C#) will have trouble understanding what the Learning Strategist is asking for while the Learning Strategist may not understand why what he wants the software developers to deliver isn’t possible.

When I’ve attempted to facilitate story sizing sessions with teams that resemble Team 4 they either become quite contentious (and therefore time consuming) or team members that don’t have the expertise to understand a particular card simply accept the opinion of the stronger voices. Neither one of these situations is desirable.

To counteract these possibilities, I’ve found it much more effective to have the card assignee determine the card size (points and time estimate) and work to have the other team members ask questions about the work described on the card such that the assignee and the team better understand the context in which the card is positioned. The team members that lack domain expertise, it turns out, are in a good position to help craft good acceptance criteria.

  • Who will consume the work product that results from the card? (dependencies)
  • What cards need to be completed before a particular card can be worked on? (dependencies)
  • Is everything known about what a particular card needs before it can be completed? (dependencies, discovery, exploration)

At the end of a brief conversation where the entire team is working to evaluate the card for anything other than level of effort (time) and complexity (points), it is not uncommon for the assignee to reconsider their sizing, break the card into multiple cards, or determine the card shouldn’t be included in the sprint backlog. In short, it ends up being a much more productive conversation if teammates aren’t haggling over point distinctions or passively accepting what more experienced teammates are advocating. The benefit to the product owner is that they now have additional information that will undoubtedly influence the product backlog prioritization.