Transparency, Source Code Quality, and Metrics

I’ve been reading “Hello World: Being Human in the Age of Algorithms” by Hannah Fry. She relates this story:

In 2012, a number of disabled people in Idaho were informed that their Medicaid assistance was being cut. Although they all qualified for benefits, the state was slashing their financial support – without warning – by as much as 30 per cent, leaving them struggling to pay for their care. This wasn’t a political decision; it was the result of a new ‘budget tool’ that had been adopted by the Idaho Department of Health and Welfare – a piece of software that automatically calculated the level of support that each person should receive.

Unable to understand why their benefits had been reduced, or to effectively challenge the reduction, the residents turned to the American Civil Liberties Union (ACLU) for help.

[The ACLU] began by asking for details on how the algorithm worked, but the Medicaid team refused to explain their calculations. They argued that the software that assessed the cases was a ‘trade secret’ and couldn’t be shared. Fortunately, the judge presiding over the case disagreed. The budget tool that wielded so much power over the residents was then handed over, and revealed to be – not some sophisticated AI, not some beautifully crafted mathematical model, but an Excel spreadsheet.

Within the spreadsheet, the calculations were supposedly based on historical cases, but the data was so badly riddled with bugs and errors that it was, for the most part, entirely useless. Worse, once the ACLU team managed to unpick the equations, they discovered ‘fundamental statistical flaws in the way that the formula itself was structured’. The budget tool had effectively been producing random results for a huge number of people. The algorithm – if you can call it that – was of such poor quality that the court would eventually rule it unconstitutional.

My first thoughts were, “How bad a spreadsheet hack do you gotta be to have your work be declared unconstitutional? And just how many hacks does it take to build an unconstitutional spreadsheet?”

To be fair, math is hard. Government is complex. And I’m comfortable with the assumption that everyone who had a hand in building this spreadsheet had good intentions. Venturing a guess, the breakdown happened at the manager/politician/lawyer level.

It is probable that the complexity of the task quickly overtook the abilities of the spreadsheet author(s) and the capabilities of the tool. Eventually, no single person understood how the whole thing worked. Consequently, making a change in one place affected how the spreadsheet worked in n other places and no one was capable of regression testing the beast. But the manager/politician/lawyer types knew what to do: Hide behind the “trade secret” smoke.

There are many lessons from this story. Plenty of points of failure. What I’m interested in writing about is the importance of transparency and how a good set of performance metrics can help in maintaining transparency.

The externally facing opacity in this story is readily apparent. What we don’t (and probably never will) see is the lack of transparency prevalent internally to the Idaho Department of Health and Welfare and whomever designed and built the spreadsheet tool. I’d bet a round of drinks that neither has heard of Agile much less employed its principles and practices. These by themselves – when actually practiced long term – go a long way toward establishing a culture of transparency. This is the key. Long term practice. A period of time is needed to change behaviors, mindsets, attitudes, beliefs, and when necessary, personnel. Even over the long term, implementing an Agile methodology isn’t improvisational theater. A strategy and a way to measure progress is needed.

Which gets me to metrics.

Selecting metrics and tuning them over time is critical to measuring team performance and developing improvement plans. Metrics that inform meaningful actions are the goal. Leave the vanity metrics that verify what managers want to hear or already “know” to the competition.

I’ve encountered my share of overly complex ways to measure the performance of individuals and teams. Often the metrics taken from machine-like task work (for example, assembly line work) are applied to creative or intellectual/knowledge tasks. This type of re-purposing results in, for example, counting lines of code or the number of source code check-ins as an indicator of software developer productivity. It never ends well.

When working to define a set of metrics to track an individual or team’s performance it is more effective to begin by asking several questions.

  • What problems are you trying to solve?
  • What questions will your chosen metrics answer?
  • What questions will your chosen metrics not answer?
  • How, specifically, will you know you can trust you metrics? How will you know when they are right and how will you know when they are wrong?
  • How well do your metrics compliment each other? That is, by combining them do you end up with a much better picture of individual or team performance the you do by considering individual metrics?
  • Do your metrics support any planned actions for improvement? Are you collecting actionable metrics or vanity metrics?

Finally, it is important to understand the limits of performance metrics. Displaying velocity charts that have fractions of story points implies an accuracy that simply isn’t there. Significantly adjusting project timelines based on the first three sprints worth of velocity data can have adverse secondary effects on the project.

There is no perfect set of metrics, no divine set of measures that match an impossible standard of perfect objectivity and fairness. The best possible set of metrics is one that supports useful decisions rather than simply instructs managers where to apply the stick. They should help show the way to performance improvement rather than simply report results.

I work to have 3-5 metrics, depending on the individual, the team, and the project. Less than 3 and the picture starts to look rather flat. More then 5 and the task of performance monitoring can become overly complicated and cumbersome. Keep it lean and manageable. That way, it’s easier to tell when things aren’t working and your metrics are much less likely to violate your team’s constitutional rights.

Haiku #19

Are you here or there?
Working, in isolation.
The sprints continue.

How to Develop a Team Identity with A/B Testing

If you’ve ever been fit for prescription glasses, you’ve no doubt had the experience of the eye exam where the doctor flips between different lens strengths and asks “Is this better or worse than before?” It’s basically A/B testing.

This came to mind after reading a research paper authored by Dan Gilbert and Jane Ebert [1] and listening to Gilbert’s TED Talk, “The surprising science of happiness.” The key bit, as described by Gilbert:

Let me first show you an experimental paradigm that’s used to demonstrate the synthesis of happiness among regular old folks. This isn’t mine, it’s a 50-year-old paradigm called the “free choice paradigm.” It’s very simple. You bring in, say, six objects, and you ask a subject to rank them from the most to the least liked. In this case, because this experiment uses them, these are Monet prints. Everybody ranks these Monet prints from the one they like the most to the one they like the least. Now we give you a choice: “We happen to have some extra prints in the closet. We’re going to give you one as your prize to take home. We happen to have number three and number four,” we tell the subject. This is a bit of a difficult choice, because neither one is preferred strongly to the other, but naturally, people tend to pick number three, because they liked it a little better than number four.

Sometime later — it could be 15 minutes, it could be 15 days — the same stimuli are put before the subject, and the subject is asked to re-rank the stimuli. “Tell us how much you like them now.”

The result was that their previous #3 was ranked as #2 and their previous #4 was ranked as #5. This reflects what Gilbert calls “synthetic happiness.” Having been denied their #1 and #2 choices, experiment participants was forced to “settle” for a lesser choice. However, having made the choice they increased they preference for the lesser choice and thereby synthesized happiness with that choice. Just as interesting, the previous #4 choice was pushed further down the scale as if to put some distance between the previous #3 choice. In effect, distinguishing the decision to take home #3 as clearly the better choice.

All this gave me an idea for something to try with a team I’ve working with that needed to rehabilitate their team identity into something healthier. Typically, teams sour on the idea of going through an exercise like this. The team I was working with was no exception. They likened it to defining team goals – a largely tedious and uninspiring chore.

I wanted to know if I could present two possible team identity statements – A/B style – of which one would be clearly undesirable and another more in line with what I suspect the team may be comfortable. The A/B presentation would keep this simple (presenting a selection of six team identity statements as in the experiment with pictures described by Gilbert would be a non-starter.)

Offering a choice should compel them to chose one over the other. I’m counting on their brains to do what brains do. When faced with a choice, they make one. If I were to present them with a single identity statement and ask “How would you like to change this to be more in line with the identity you want?”, I’ve every confidence the room would be filled with silence.

The very first presentation had a blank page on the left and my intentionally lame and inaccurate team goal on the right.

The team was well aware “no goal” wasn’t an option and wouldn’t reflect well on their performance review with HR and management. My theory was that when faced with an empty goal and one that was inaccurate, they’d suggest something, however minimal, that was an improvement on the initial goal. This is what happened and the team then spent a few minutes tuning the goal into something a little less cringe-worthy. This began the process of converting the goal from the scrum master’s goal to the team’s goal.

Then I deliberately let a week or more pass.

On next presentation, the goal on the left was the goal they chose and tuned previously. The second choice was similar but contained one or two slight modifications intended to move the team’s identity in a more positive and healthy direction. Over the course of several months I tested – A/B/Eye Exam style – numerous team goals. “Which goal do you prefer, the one on the right or the one on the left?”

So we had a start. From here on out it was just a matter of improvement. Keying off of things the team said or did, I’d modify the “accepted” goal and present it as an option at the next opportunity.

The key  or driver in this approach, the hypothesis goes, is to set it up so that the team makes the decisions rather than having something foist upon them. The are virtually guaranteed to reject or strongly resist the latter. With the former, they have ownership in the decision. To reject their decision is to say, in essence, that they made a bad or wrong choice, a bad or wrong decision. In general, people don’t like to admit such a thing so they stick with a decision – for better or worse – if it’s a decision they made and are responsible for.

Update (2020.04.13)

Another important element in play with this approach is the anchoring cognitive bias, particularly early on. People are much more comfortable making comparisons between things than they are with coming up with something original. By presenting a blank goal and one that reflects a direction in which I want the team to move – from nothing to something positive – the hypothesis is that the team will assimilate toward more positive goals and that this assimilation will become self-reinforcing over time.

References

[1] D. T. Gilbert, J. E. J. Ebert (2002) Decisions and Revisions: The Affective Forecasting of Changeable Outcomes, Journal of Personality and Social Psychology, Vol. 82, No. 4, 503–514

 

Photo credit: Max Pixel