We are taught that adding comments to code is a good thing and adds value. In practice this value is seldom, if ever, realised. Mark Easterbrook makes the case for a better way.
At the ACCU 2006 conference, the question was asked "Is the total value of all the comments in all of the code out there less than or more than zero?" [ 1 ] If the answer is less than zero, which judging from participants' reactions and war stories is quite possibly the case, then the ability to put comments in source code has had a negative impact on the programming community, and is therefore A Bad Thing™.
Learning from the experts
Why do programmers put comments in code? To answer this we need to look at how the language is taught, so I took a number of books by the most respected authors in the industry that are aimed at beginners or start from the basics, and looked to see how these experts recommend beginners use comments, and they all had something along the lines of:
i++; /* Increment i */
So, if all these highly recommended books are consistent on what a code comment should look like, it must be right, right? It is a brave programmer to tell these esteemed writers that the code samples they use in their books are poor examples. Taking one particular example at random, on page 24 of The C++ Programming Language, 3rd Ed , by Bjarne Stroustrup we find the following example of how to write a C++ function:
void some_function() // function that doesn't return a
                     // value
  {
    double d = 2.2   // initialise floating-point
                     // number
    int i = 7;       // initialise integer
    d = d+i;         // assign sum to d
    i = d*i;         // assign product to i
  }
Thus Bjarne's book is teaching single character variables are good and comments should tell the code reader what the code is doing. Much much later on page 138 Bjarne does describe better use of comments, but by this time most beginners have given up reading and are out there hacking code with looming deadlines. Even those who do read more of the book have what they learned on page 24 reinforced with "[indent style]...I have my preferences, and this book reflects them. The same applies to comments". That's right, just copy Bjarne's "preferable" commenting style from his book. (Note: Before you email me, I have read to the end of the section, my point is that not everyone bothers to read all the way to the end.)
Examples of this Good Practice™ of commenting can be found everywhere. One example from some telecoms software had, in revision 1.1 in CVS:
channel++; //increment channel
The good thing about this line was that both the code and the comment agreed (experience shows this is quite often not the case). The bad thing about this code is that they are both wrong, the problem being that the original coder didn't understand the difference between channel and timeslot [ 2 ] and chose the wrong variable name. Fortunately, when he (or someone who came along later) realised the mistake he used a snazzy re-factoring tool that avoids the problems with global "find and replace" and understands the code enough to only change the relevant variable name and not other occurrences of the string or the same name in another scope. The code now reads:
timeslot++; //increment channel
You might ask why the re-factoring programmer didn't change the comment to match and it was because he wasn't looking at that particular line, but at this one:
  uint32_t channel; // I think this is probably
                    // timeslot, not channel.
The first person to come along realised the poor choice of variable name and "fixed" it by adding a comment. The second decided it would be better to change the variable name. Obviously, the second programmer believes in saying it in the code and disregarding comments, because the changed line now reads:
   uint32_t timeslot;  // I think this is probably
                       // timeslot, not channel.
It passed programmer testing, because the test suite only tested the code and disregarded the comments. It then passed code review with only an observation because it had already been tested and checked in, and the company policy was not to modify the source file after testing unless a potential customer visible fault is found. The rationale being that modifying the comment modifies the file, forcing a rebuild and requiring a re-test [ 3 ].
"If the code and comments disagree, both are probably wrong".
Norm Schryer
Better comments
The problem with all these code gurus is that they are code gurus, and comments are not code. If they stuck to teaching code and left teaching comments to comment gurus we would get much better comments. Or would we? I trawled various sources for good practice in commenting and found the following pieces of advice:
- Say why and not how.
- Meaningful.
- Say it in the code when you can, and in the comments when you must.
- Only use comments where the code is non-obvious and/or non-portable.
- Don't write comments that repeat the code.
- Do write illuminating comments that explain approach and rationale.
This is not a complete list, but a representative sample. The last two are from C++ Coding Standards [ 4 ] where Herb and Andrei dedicate only 4 lines of their book to comments. This sends the message that the authors do not regard good comments as a subject to spend much time on.
I have stolen a good example of someone following the above guidelines from Roger Orr [ 5 ]:
    // Clear collection before we start
    myCollection.empty();
The code by itself would hopefully be spotted at code review because the reader would have to think about why there is a statement with no side-effects, and make sure it fails the review. Add in the comment following the above guidelines and the "why?" is obvious so the reader doesn't have to ponder the code and moves on to more important things. Would this code be better with or without the comment? Would the coding error be more obvious without the comment?
What is a comment?
If you ask a group of programmers what they understand by "comment" you will almost certainly start an argument. This will not be quite as heated or frequent as "vi or emacs", but neither will it be a minor disagreement. The descriptions I have heard vary from "line comments" - the "block headers are documentation not comments" school of thought, through to "everything the compiler ignores", bringing in commented-out code and all #if 0 blocks into the definition. I suspect that more than a few out there will include comments that aren't even in the source code file (CVS check-in comments anyone? RCS maybe?).
And then there is self-documenting code: If comments are a form of documentation, and a programmer writes self-documenting code, surely the code is then a "comment". Just because the compiler reads it does not mean that it is checked for correctness - the compiler does not care if a variable is called channel , timeslot , or foo - variable names are for the human reader, not the machine, and thus are a type of commenting!
Mandating useless comments
Almost every code generating place I have worked has some form of mandating comments in code, either by having a coding standard which is selectively checked at code review (and comments are always selected, sometimes the only thing selected), or just going straight to a tick-box list somewhere that has "adequate comments" as a line item. Adequate by quantity is easy to check. Adequate by quality is not so easy. Most reviews choose the path of least resistance and go for a quantity check.
I know that the misguided individuals who write coding standards and code review checklists don't intend to mandate useless comments: the desire is to promote good practice. However, programmers are only human (for the sake of this argument I'm assuming this is the case, you may know some exceptions) and will take the easiest path to conformance - which is to follow what is actually mandated rather than what is intended to be mandated.
The most useless mandated comments in the wild are block headers. Some development teams even put an empty block header into an easily accessible file somewhere (they call it a skeleton header or some other similar excuse) so it can easily be copied and pasted into every source file. Mostly these headers remain as function separators making sure each function or procedure is kept away from the adjacent one by lots of screen real estate so you cannot ever see more than one on the screen at once.
Sometimes the cut-and-pasters go that bit further and fill out the fields in the block header. Unfortunately these are typically:
- Name of function - Don't copy and paste from the function name a few lines down, type it in to increase the chance of error.
- Purpose of function - Take the function name and add spaces between the words (and maybe put back the vowels and other letters if they are missing).
- Inputs - Another name for the argument list.
- Outputs - Another name for the return type.
- Algorithm - Pseudo code, which means the code written in the language you would prefer to use rather than the language that the compiler understands.
- Comments or Remarks - Your chance to demonstrate your wit and charm, or lack of it, to your fellow coders.
Some development teams even discourage the filling-in of block headers by surrounding them with a box of asterisks or similar characters. Any changes make the right hand side of the box irregular, so it is better to leave it in its original regular form.
It seems to me that the unwritten rule is: If you have a function that is similar to the one you want to write, or not similar at all, just copy both the header and the code. Because someone has already filled in the header for you it saves some time and allows you to concentrate on the code.
Furthermore, if the coding standards mandate block headers with an asterisk in column 2, it is possible to grep for just the headers and produce the documentation from the code files. Counting the duplicate functions then allows an estimate of how much cut-and-paste programming has taken place. But look at how much documentation there is!
Block Headers: Please, just say no.
Automatic comment generation
Integrated development tools now take a lot of the effort out of adding comments to code, saving the programmer valuable minutes of coding time. Picking a popular IDE and using the tool to create a new Dialog Box obtains comments such as:
  return TRUE;  // return TRUE unless you set the
                // focus to a control
                // EXCEPTION: OCX Property Pages
                // should return FALSE
Now I thought the idea of automatic code generation was that the code was generated automatically for the programmer, or am I being particularly naïve about this? Surely the system, either at code generation time or later, can work out if the focus is set to a control or not, and whether this is an OCX property page or not. This particular example was taken from production code - did the original programmer actually look at the code and decide that TRUE was correct but was too lazy to modify the comment appropriately, or not?
The nice thing about good code generators is that quite often the default behaviour is what is required, so there is no need to even go and look at the automatically generated code. Thus the presence of
// TODO: Add extra validation here CDialog::OnOK();
throughout the code leaves the maintenance programmer with an uneasy feeling that perhaps the original coder did not finish the job.
If the above example had not auto-generated comments but pseudo code instead:
Perform custom validation (if any) then call: CDialog::OnOK();
The coder would have been forced to look at the generated text and turn it into the correct code. The compiler would have insisted on it!
In another universe
Just imagine a world in which the compiler reads the whole of the source file instead of skipping the bits the coder cannot be bothered to code in the language. This is a fancy way of saying "No comments".
To imagine what this nirvana might be like, let's look at something else that the designers of programming languages decided wisely not to add to the language: Version Control. The first version control system [ 6 ], RCS, foolishly added the history of the code to the code file itself, but everyone saw this was bad (all right, almost everyone) and caused files to grow beyond the file system limit and lo, CVS was born where the code is the payload and the version information is the metadata. CVS makes putting the version information in the file so difficult nobody bothers to do it. [ 7 ]
The nice things about having code and version control system loosely coupled are:
- If you don't like your version control system you can change it without changing your programming language.
- If you don't like your programming language you can change it without changing your version control system.
- You can view just the code, or you can view just the revision information, or a bit of one and a bit of the other. You choose to view just what is appropriate for the current task.
- There is a big market for version control systems, and add-ons for version control systems, and books and websites on version control systems.
In our brave new (comment-free) world, the more creative individuals will soon work on adding what is missing...
Someone will write a pre-processor that allows annotations and code to be mixed, but strips them out before the compiler or interpreter sees them. This will be a great development because the way annotations are added is language independent. The alternative, the Balkanisation of annotations, would be awful because you could end up with a crazy situation where lines starting with a certain symbol mean an annotation in an interpreted language but, say, a pre-processor instruction in a compiled language.
An obvious tool to support annotations is the programmer's editor, whether it be stand-alone such as Unix vi, or within a feature-rich IDE such as Microsoft Visual Studio. The potential in the integrated development tools is large: hover the cursor over the variable and a balloon pops up with any comment assigned to the variable. There could be options to open up further information such as seeing the definition or any comments assigned to the containing code (class if it is a member variable, function if it is a local variable, etc.). Annotations could be applied at any level, even right up to whole application or project. Even the simpler tools can use code colouring or other hints to indicate the presence of annotations, perhaps showing them in a separate but linked scrolling window that can be re-sized or hidden when not required.
Annotations could be a valuable part of pair programming with one of the pair taking the role of coder and the other the annotator. A positive feedback loop would occur: when the coder reads the annotations he obtains a qualitative measure of how much the annotator has understood so that both the code and the annotations can be improved. All this happens in a short timeframe when the original implementation knowledge is still fresh.
The majority of code in the wild comes under the category of legacy code. The constraints of maintaining such code are the rule of minimum change and no gratuitous modifications. Considerable time and effort is spent analysing and understanding the code in order to add necessary enhancements or fix problems without violating those constraints. As annotations don't modify the deliverable code or force automatic rebuilds (e.g. by make or similar tools), they can be added as the code is examined with little risk. Just imagine the understanding that would be gained and then immediately thrown away if those new annotations were hard-wired to the code and couldn't be modified!
Conclusion
The direction that programming comments, a.k.a. code annotation, has taken has helped keep the quality and maintainability of code low. The way that different programming languages implement comments is inconsistent and in some cases conflicting. The authors of books, industry gurus, and the quality [ 8 ] procedures in most companies have all perpetuated the status quo with the result that innovation, which could lift the usefulness of code annotation to higher levels, has been stifled.
Is it time to change the mindset of the programming community so that the total value of all the comments in all of the code moves into positive territory and then continues this upwards trend? This requires those with influence to lead the change:
- Authors of books and articles should stop putting bad examples of comments in code snippets. The topic being described should appear in the text, not in the code.
- Coding standards should stop mandating comments and move towards making code readable and understandable. A comment should reflect a failure to make the intent of the code clear, and therefore be a fall-back position and not a primary goal.
- Leaders of code reviews should look for justification for comments, and reject cases where the comments are only added because the coder is too lazy to write the code properly in the first place.
- Block headers should be actively discouraged. The total value of block headers is so far into negative territory that elimination of all block headers would be a significant improvement.
- Programming tools should be enhanced to support annotations to code that improve the readability, instead of comments that pollute it and make it more difficult and confusing to understand.
These auto-generated comments are the result of lazy implementation of the tools (just write a comment instead of working out what the code should be) followed by lazy users (the computer generated it so it must be right). Definitely a case of two wrongs make an even worse wrong.
It may not be possible to go back and make existing commenting consistent and useful across multiple languages and the millions of files of existing source code, but it is possible to change the approach of programmers so that newly written code is appropriately annotated. All it needs is the realisation that we have been wrong all these years and that there is another way.
Notes and references
[ 1 ] By Russel Winder in Peter Sommerlad's session called "Only the Code tells the Truth".
[ 2 ] Timeslots typically start at zero and channels often start at one. In 2Mbit/s ISDN timeslots 0 and 16 are reserved so that channel 1 is in timeslot 1 and channel 16 is in timeslot 17. As long as you only test the first 16 channels you don't have to bother about the subtle difference.
[ 3 ] This raises the perennial question: When do you do code reviews, before check-in and testing, or after? Both have their pros and cons, and neither stands out as being significantly better.
[ 4 ] C++ Coding Standards . Herb Sutter and Andrei Alexandrescu. 2005.
[ 5 ] Producing Better Bugs. Roger Orr. ACCU conference 2006.
[ 6 ] I'm taking a little artistic licence here with the history of source code revision systems, but you get the picture.
[ 7 ] It also makes it so hard to get it out again, nobody can work out how to, and thus exports and imports into a fresh CVS archive occur. This gets rid of the pollution in the code, but it gets rid of all the history as well.
[ 8 ] As in ISO9000 type quality, as opposed to goodness.









 
                     
                    