ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinAfterwood

Overload Journal #140 - August 2017 + Programming Topics   Author: Chris Oldwood
Chris Oldwood reminds us to fix the problem, not to blame.

As a child. I struggled with honesty; often shooting from the hip meant I never really considered the consequences of my actions and so found myself facing the ire of my parents more frequently than I should have. It was never anything serious but I began to find it easy to explain why anything that went wrong was never entirely my fault. I suspect the innocence of childhood gave me the benefit of the doubt more often than I deserved too. Fortunately, my parents did a good enough job that my moral compass ensured I never strayed far from the straight and narrow.

Software development has always appeared more forgiving than many other jobs. In my early days as a programmer there were a million different reasons why things never went smoothly. For example, the very environment we worked in was brittle – you would routinely have to restart 16-bit Windows and pray your hard disk was still intact. I’ve lost many hours trying to debug even user-mode applications that would cause the earlier versions of Windows to crash. Then you have compilers with code generation bugs which you needed to work around by writing the same code in a different way. And then there’s the language itself with its many traps and pitfalls for the unwary programmer that resulted in ‘undefined behaviour’ (UB) which you often found out the hard way. Add to this the ambiguities of natural language leading to poorly specified requirements and you’ll find it’s fairly easy to avoid having the finger pointed directly at you as the root cause of most problems.

My safe little bubble eventually burst one afternoon courtesy of a support incident at a large financial organisation. I was genuinely surprised when a colleague quietly asked me why I had just rebooted our system’s main servers in the production environment. He confirmed that he’d double checked the security logs and, yes, my login was in there as being the instigator of the machine restarts, which seemed pretty conclusive. It didn’t take long before I realised what had happened and how the mistake had been made. Yes, I had a pretty good clue where things had gone wrong but ultimately the mistake was mine. My stomach churned as I waited for the fallout. You often hear tales about how people have been marched straight off the premises for misconduct and can’t help but wonder if there's a grain of truth somewhere in those friend-of-a-friend stories. Being a contractor and fairly new to the company didn’t feel like it was exactly going to help my cause either.

Of course, nothing happened. In retrospect, my mistake was insignificant compared to the many others that occurred around me and, whilst there was a loss of service as everything slowly came back up and recovered, the actual loss to the business was probably less than the time taken to work out what it would have cost. Naturally the first code change I made straight after was to my custom admin tool so that machine names starting with a P and D were more easily discernible – bright red for the former, something contrasting for the latter. Oh, and it popped up a warning message too, for good measure.

The notion of holding a post-mortem is not a new one although the emphasis on it being a blameless post-mortem seems to have gained more recognition in recent years. My earliest recollection of the idea of post-mortems (outside the medical ones on TV shows like Quincy) was through the embedded software column in Dr Dobbs Journal, written by Ed Nisley. NASA had started making their post-mortems publicly available and so Ed would publish extracts in his column along with some additional commentary. I’ve never worked in that kind of environment but certainly marvelled at the ingenuity it creates working within such constraints. It would be easy to laugh at some of the extremely costly and yet seemingly trivial mistakes they’ve made in the past but the main take-away for me was always that if the super-smart people at somewhere like NASA couldn’t get this software development lark right all the time, then what chance did I stand?

Reading about the recent major outage at a British airline and the apparent scapegoating of a system administrator reminded me of some of my own little mishaps and how they had been dealt with. I’ve clearly been fortunate enough to have worked within teams where the level of trust and respect both within the team and around it are sufficiently high that the occasional mistake is dealt with appropriately. One can only assume that senior managers and shareholders are after a scalp when something of that magnitude goes awry and therefore it takes real courage to stand up and blame the process that let this happen rather than single out the person who was likely the victim of a weak process.

The prime directive, which is read out at the start of a team’s retrospective, is a very clear statement which attempts to try and make the team comfortable so they can get to the business of improving the process they use, rather than blame the people themselves for their actions. Martin Fowler has suggested in the past (based on the work of Pat Kua) how important reading this statement has become, as repeatedly hearing it could potentially change the culture of the team though the physiological effect known as priming.

This also ties in very nicely with ‘psychological safety’ which Google’s recent project Aristotle [NYTimes] managed to bolster with some qualitative data. This is nothing entirely new though as ‘safety’ also features in the lower layers of Maslow’s Hierarchy of Needs, which he published back in the late 1940’s and has no doubt been the subject of many other research projects in the intervening years. What probably caused Google’s project Aristotle to surface on my (and many other programmers’) radar was no doubt down to the workforce studied.

The TL;DR of that research appears to be that we are at our happiest when we work with nice people, although I’m sure it’s highly disingenuous to try and distil it to such a simplistic outcome. I don’t think I’ll ever truly get over the small amount of fear I feel when administering a production system but I also think that may be a healthy attitude, to some degree, to ensure I’m not reckless through complacency. Either way, as long as any fear we do feel is one of our own self-restraint and not out of a lack of process, and subsequent retribution, then our productivity will remain at its highest.

Reference

[NYTimes] https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.htmllearned-from-its-quest-to-build-the-perfect-team.html

Overload Journal #140 - August 2017 + Programming Topics