Best Practices vs Witch Hunts

Best practices can be a Good Thing. Sergey Ignatchenko considers when they can turn into Really Bad Things.

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Bunny, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [ Loganberry04 ]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

In any field, software development included, there are lots of well- and less-known best practices . As a rule of thumb, best practices are very useful and in general should be followed. Unfortunately, way too often, developers are starting to take best practices as gospel, and/or become obsessed with them. This turns these best practices into outright witch hunts . In this article, we’ll discuss what the mechanism behind witch hunts is, and why they’re Really Bad Things.

Life cycle: from best practice to witch hunt

Let’s see how best practice usually evolves (YMMV, but usually the picture is rather similar):

A few teams start (usually independently) using a certain practice within their projects.
The practice becomes more widespread.
Somebody publishes the practice as a ‘rule of thumb’, usually with rationale behind, and with certain cases of applicability exceptions.
The practice is officially named as ‘best practice’ by some authoritative source. The rationale is usually still present, but some of the applicability exceptions are often lost due to the lack of space.
The practice becomes pretty much universal. At the same time, both the rationale and the applicability exceptions are gradually forgotten.
Then some people (‘zealots’) emerge, who think that the Earth will stop turning project is hopelessly deficient if there is even one single case of the best practice being violated. Neither the rationale nor the applicability exceptions are taken into account.
When such a ‘zealot’ lays his hands on a real-world project, he starts to enforce the practice mercilessly.
At first, the ‘zealot’ normally starts with eliminating the least controversial of the best practice violations. Among other things, this means that at this stage he usually eliminates violations which are in line with the rationale and in conditions where the applicability exceptions do not apply. As a result, his efforts at this stage tend to benefit the project in general, which is usually more or less obvious to the whole team. This triggers the natural feeling that the ‘zealot’ is right in his fight for the best practice ideals.
As a natural consequence, only a few people dare to challenge the ‘zealot’ with his further fight for the best practice. Also, a substantial chunk of the team starts to believe that this specific best practice is the Absolute Good Thing. And this is the point where the best practice effectively becomes a witch hunt.
Applying this specific best practice is never questioned, and both rationale and applicability are completely ignored. In the name of this best practice, pretty much anything can be done within this team. In particular, it includes violating any other best practices (especially those lacking their own ‘zealots’ in the team). All kinds of weird things can and will happen in such teams (for practical examples, see below). As an additional benefit, if there are two best practices with ‘zealots’ behind them in the same team, any kind of conflict between these best practices can result in a ‘holy war’.

Wait, but it is still a best practice, isn’t it?

The best practice is useful only if all considerations (including both rationale and applicability) are carefully considered. Without taking these considerations into account, applying the best practice often leads to conflicts with other best practices, which in turn can easily lead to grave consequences.

What can possibly go wrong when applying a best practice?

Naming something best practice doesn’t really change its impact on the projects. As with pretty much anything out there and despite the name, best practices are not absolute virtues. Rather, they are ‘rules of thumb’, with all kinds of considerations which need to be taken into account. The very name ‘best practice’ is misleading – a more appropriate name would be ‘usually a best practice’. Below are a few examples of ‘best practices’ which in application went wrong at some point.

Example 1. Mild example – magic numbers

Use of ‘magic numbers’ (using unnamed numerical constants) is generally frowned upon in the programming community. An associated best practice is to replace them with named constants. This best practice exists for two good reasons.

Rationale:

Better readability
Better maintainability in case the constant changes

So far, so good, but recently I’ve run into discussion where the author has asked if in the code

  kbytes = bytes / 1024;

1024 is a ‘magic number’ and should be replaced with the named BYTES_PER_KBYTE . Moreover, there were quite a few people saying that 1024 is indeed a ‘magic number’ which should be replaced. However, I contend (and most of the practical programmers I’ve spoken about it agree) that this is wrong, and that 1024 should be left as is (unless an exception applies as described below).

To realize the reason why this best practice is wrong, one should take a look at the very reasons behind it. Note that in our case, both reasons behind having this best practice do not apply. First, bytes / 1024 is more readable than bytes / BYTES_PER_KBYTE , and second, the chances of BYTES_PER_KBYTE ever changing are infinitesimally small (ok, unless you’re into nitpicking between kilobytes and kibibytes, and are planning to change the semantic meaning of the term KBYTE in the context of your project, which would be a Very Bad Thing per se ).

This means that BYTES_PER_KBYTE is not a good thing to have, and the plain 1024 is generally preferable. However, as we are speaking about applicability, this observation is also not an absolute, and there are possible exceptions. One such exception arises if you need a consistent way of displaying your bytes to the end-user. In this case, a constant such as BYTES_TO_DISPLAY_KBYTES might make sense (and it may change later, which justifies its existence), but the generic BYTES_PER_KBYTE is still pretty much useless.

Ok, this was a very mild example of a witch hunt , with very mild negative implications.

Example 2. Stronger example – memory leaks

Those dealing with C/C++ are familiar with memory leaks ¹ . Common best practice is to avoid them at all costs.

Rationale:

To avoid using unnecessary memory which can exhaust computer/process resources

Applicability Exception:

All memory allocated via malloc() will be freed on program exit anyway, so calling free() right before program exit is not necessary.

One of the common methods used for detecting memory leaks is using some kind of tool (such as Valgrind for Linux and the built-in debugger for Windows) which runs on program exit and lists all the blocks which were allocated via malloc() but were never deallocated via free() . It is indeed a very useful tool, and it is very good best practice to run this tool and eliminate as many of these memory leaks as possible. However, should we always aim to eliminate all of them ?

More than once in my programming career, I have needed to allocate a once-per-program buffer (for example, for logging purposes) which was used by all the program threads. When I faced this task for the very first time 20 or so years ago, I tried to deallocate this buffer properly on program exit. However, in some weird scenarios ² deallocating it caused some race conditions, which once-in-100-or-so exits have caused a program crash (thread writing to an already-deallocated buffer) for no real reason.

That time, I spent almost two weeks trying to fix this elusive problem (which kept reappearing after each fix in a new form). It continued as a kind of weird ding-dong battle until I asked myself: what would change if I don’t deallocate this buffer at all ? The whole rationale of fighting memory leaks doesn’t really apply when we’ve already decided to exit the program, as in a few microseconds it will all be freed anyway (and in a much more efficient manner BTW). As soon as I realized this, the problem went away for good, and I was able to throw away a few dozen of rather weird synchronization lines of code which I wrote when trying to fix the problem by other means.

In this example, I myself was guilty of witch hunting , though I’m humbly asking the sentencing judge to consider my inexperience at that point as a mitigating circumstance.

Our next example is more severe, and would involve breaking the code as a result of a witch hunt .

Example 3. Breaking code while fighting -Wall

This one I’ve seen myself more than once. In some project, there is a perfectly valid C code, which involves some implicit casts (like unsigned-to-signed of the same size, or well-defined integer truncations/promotions). All these casts are perfectly well-defined in C, and lead to perfectly well-defined results both in theory and practice. The code is reasonably readable too. In short, there are no (zero, nada) problems with the code whatsoever.

Then, a ‘zealot’ comes in and adds -Wall to the compiler command line. Right away, the compiler starts to complain about these casts (why the compiler does it is beyond me, but this is not in the scope now). These warnings are overzealous, as the code is well-defined and standard-compliant.

Then, our ‘zealot’ starts to get rid of these overzealous warnings – not by disabling those overzealous warnings, and not even by thoughtfully adjusting types so there are less conflicts, but by merely inserting explicit casts as he sees fit. He needs to insert dozens of such casts, and he doesn’t pay much attention to what the-code-he-changes really does (he’s on the job of eliminating warnings, and he has much more to do, anything else is merely an obstacle on his way to achieving the Greater Good of -Wall without warnings).

As he makes a mistake when inserting casts, the previously perfectly valid code becomes broken. Even worse, the code might be broken in fringe cases which may go unnoticed for a while. Not to mention that code with tons of explicit casts becomes much less readable. I rest my case and ask the jury to sentence our ‘zealot’ to a life of abstinence from programming.

Now to our next, and final, example of witch hunting .

Example 4. The ultimate example – Debian RNG disaster

Once upon a time, there was a library named OpenSSL. As a part of it, they used a random number generator, which in turn, had two lines of code which were using uninitialized memory. It wasn’t a real problem, as the whole idea was about gathering as much entropy as they could get their hands on, and uninitialized memory couldn’t possibly hurt them.

Then, an overzealous Debian supporter, in a holy fight to eliminate all Valgrind warnings, commented out these two lines of code [ Mason08 ] [ Schneier08 ] [ Kroll13 ]. Nothing changed, except that all the keys generated by all the Debian-based distributions (yes, these do include the all-popular Ubuntu), became easily guessable until this bug was fixed (which took over two years). Worse than that, when the problem was discovered, it meant that all the SSL exchanges between such Debian-based distributions, suddenly became retrospectively vulnerable (i.e. if somebody recorded them, he’d be able to decrypt them now based on the nature of this vulnerability).

The whole incident was at the time the worst security incident in the entire open source community, and it resulted precisely from an overzealous developer trusting tools and making changes to eliminate warnings without understanding the context or the potential implications. ³

It is interesting to note that when analyzing the disaster, all kinds of explanations (read: excuses) were given for it happening, including the outright ridiculous, “ The OpenSSL code was too clever by half ” [ Cox08 ]. Very few sources (if any) have mentioned the real culprit: an obsession with the enforcement of best practices , which unfortunately too easily becomes a witch hunt. BTW, I’m not trying to condemn the person who made the change – but rather the whole culture where such witch hunts (without taking into consideration related circumstances) are considered the Right Thing To Do.

I’m scared, what should I do?

There are at least two lessons to be learned from this article. The first one is:

Whenever you’re in doubt about applying a best practice – think about the rationale behind it. If the rationale doesn’t apply in your case – forget about the best practice , it doesn’t apply here.

The second lesson applies to much more dangerous witch hunts within existing code:

Whenever there is the slightest risk that by enforcing a best practice on existing code, you’ll break it – think more than twice before going ahead with your change. And no, the change being just commenting two lines and having no apparent downsides, is not guaranteed to be safe.

Merciless refactoring is good, merciless refactoring having no clue about what you’re doing, is not.

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.

References

[Cox08] Russ Cox, ‘Lessons from the Debian/OpenSSL Fiasco’, http://research.swtch.com/openssl

[Kroll13] Joshua Kroll, ‘The Debian OpenSSL Bug: Backdoor or Security Accident?’, https://freedom-to-tinker.com/blog/kroll/software-transparency-debian-openssl-bug/

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine! , http://bitsnbobstones.watershipdown.org/lapine/overview.htm

[Mason08] Justin Mason, ‘Serious Debian/Ubuntu openssl/openssh bug found’, http://taint.org/2008/05/13/153959a.html

[NoBugs12] ‘No Bugs’ Bunny, ‘Memory Leaks and Memory Leaks‘, http://accu.org/index.php/journals/1936

[Schneier08] Bruce Schneier, ‘Random Number Bug in Debian Linux’, https://www.schneier.com/blog/archives/2008/05/random_number_b.html

While those writing in garbage-collected languages may not be familiar with memory leaks and even think that there can be no memory leaks in JVM, it is wrong at least for so-called semantic memory leaks [ NoBugs12 ]
AFAIR, most of the trouble was about threads being not terminated for various reasons, and it does happen when dealing with network stuff and you don't want your user to wait forever for no apparent reason.
The argument that he tried to ask OpenSSL team and asked in the wrong place, and who’s responsible for it happening (which has resulted in lots of fingerpointing between Debian and OpenSSL at the time), is beyond the scope of this article.