Write to Learn
Jon Jagger
< jjagger@qatraining.com >Why do I write articles for ACCU you might ask? Well, if truth be told, it's not as altruistic as it appears: I write to learn! Writing forces me to order my thoughts (which generally end up slightly less disorganised :-). Readers send me feedback which usually sparks new thoughts. This ties in with John's last editorial "We are all continually learning and re-learning, and that process isn't just listening and reading, it's speaking and writing". To try and convince you of this I'm going to write about what I learned from my own Overload 29 article. I hope this will convince some of you out there that writing articles is in your own self interest. Now, where to start? Something trivial
How to say nothing?
date::date(int dd, int mm, int ccyy) : day(dd), month(mm), year(ccyy) { // empty }
Constructors often do all they need to in their member initialisation list. Not unreasonable, since constructors do initialisation. The comment tries to emphasise that the null body is not accidental. While reading this empty comment I recalled a classically bad comment (don't laugh now, wait till you see it real code)
value++; // increment value
As I thought about it I realised my empty comment was almost exactly the same! It wasn't conveying the message I wanted it to: that the body is deliberately empty. How to say this succinctly? For now I've settled on
// all done
Here's some more ado about nothing. I noticed my previous article contained code fragments that used ... in place of large chunks of code. This is visually confusing as catch-all handlers also use ellipses [ 1 ] . So I'm trying something different.
reference or value ?
namespace accu { class string { public: char & operator[](size_t index); const char & operator[](size_t index) const; ... ... ... }; }
Just after this fragment (again from my previous article) I wrote "An alternative version of the const array-subscript operator could return a plain char (by value). There is not much to choose between the two, ..." While reading this paragraph I decided to it would be useful to list just what differences there are. There is a difference worth highlighting. Consider
const char & example() { const string greeting("Hello"); return greeting[0]; }
If the const subscript operator returns a char reference then example will return a dangling const reference to the initial char of greeting which will go out of scope when example returns. Ooops. However, if the const subscript operator returns a value
char string::operator[](size_t index) const;
the example function will return a const reference to a copy of the initial char of greeting , and all is well. As I write this article I've noticed the previous article said "array-subscript operator". What have arrays to do with this? Nothing. It's the string subscript operator. Terminology matters.
size_t or int ?
A reader asked whether the use of size_t as the index type for the subscript operator might be slower than using a plain (signed) int . I can see the thinking behind this. size_t could be typedef 'd to be an unsigned long , leading to the question of whether long arithmetic is slower than int arithmetic. Does it matter? It might; it depends on the context. However my primary concern when writing code is that my code mirrors my intent. As Dan Saks put it so eloquently at the ACCU conference "Say it in Code". size_t is sensible for a string subscript parameter. It says a negative value doesn't make sense in this context. But suppose the application needs to be speeded up, profiling shows the string subscript operator to be a prime candidate for optimisation, and a test reveals the plain int version is indeed quicker. Would I change the size_t to an int in the subscript operators? Well, yes and no. I'd be tempted to try
namespace accu { class string { public: // types class position { ... ... ... }; public: char & operator[](position index); char operator[](position index) const; ... ... ... }; }
and give string::position lots of checking. This would force clients to write
for ( string::position index = 0; index != limit; ++index) { ... ... ... }
but would allow me to redeclare position if I wanted to
typedef int position; // OR // typedef size_t position;
string::reference or string::char_reference ?
In my code the smart-reference class nested inside string was called char_reference . A sharp eyed reader asked why I'd used the name char_reference and not just reference ? After all, the standard STL containers have a nested type called reference.
namespace std { template<typename type> // simplified class vector { public: typedef ... ... ... reference; ... ... ... }; }
With a moment's reflection you will quickly see that in a template container you cannot name the type that reference is a reference to because that is the name of the template parameter type, which of course will vary. My string class is not a template class, and is not constrained in this way. I can chose any name for the smart-reference class and still be "conforming" with a simple typedef
namespace accu { class string { public: class char_reference; typedef char_reference reference; // idiomatic public: char_reference operator[](size_t index); char operator[](size_t index) const; ... ... ... }; class string::char_reference { public: ... ... ... }; }
reference or pointer ?
In the article my string::char_reference class looked like this...
namespace accu { ... ... .. class string::char_reference { ... ... ... private: string & s; size_t index; }; }
Ugh, s is an awful variable name. Something like target is much more expressive. But should it be a reference? Why not a pointer? I think it's perfectly reasonable for the string client to write
string::reference marker = greeting[0];
and I can see the wisdom of using a pointer data member to emphasise the association between two separate objects with separate (but related) lifetimes. On the other hand a reference has to be initialised and cannot be re-bound. But using a reference might confuse the reader: they might think the smart reference class is making the raw reference data member smart and the size_t data member is just some extra unrelated gubbins. Of course I could make the pointer const. On balance I think I prefer the pointer version.
namespace accu { ... ... .. class string::char_reference { public: char_reference(string *target,size_t index); // default copy constructor OK public: char_reference operator=(char new_value); operator char () const; private: string * const target; size_t index; }; }
As I write this I wonder why index is not also const.
string * const target; const size_t index;
So, why not this?
string * const target; size_t const index;
That somehow seems clearer.
primitive or idiomatic ?
Some string behaviour was not covered in the previous article. Obvious examples are comparison and input/output. Here's how I would do output
namespace accu { class string { ... ... ... public: // primitive output void write(ostream & out) const; }; // idiomatic output ostream & operator<< (ostream &out,const string & to_write); };
and the implementation would be
namespace accu // string : input/output { // primitive output void string::write(ostream & out) const { ... ... ... } // idiomatic output ostream & operator<< (ostream & out, const string & s) { s.write(out); return out; } }
The use of << and >> as streaming operators is very specific to C++. It's easy to forget this. The difference between primitives and idioms is important. Primitives seem right during early design, idioms during late design, as a refinement. It also seems right that the idioms do nothing except forward to the primitive (just like << forwards to write ). I mention this in nauseous detail because it relates strongly to the last section of the article where I discussed the pro's and con's of making string::assign public or private. Looking at this again I realise this is really the same primitive/idiom idea.
void example(accu::string & s) { // this is the primitive use s.assign(0, 'J'); s[0] = 'J'; // this is idiomatic use }
This has helped me make up my mind. I've settled on making string::assign public, and removing the friendship. I've just noticed a tiny bit of hungarian notation in my previous article
void string::assign(size_t index,char new_ch);
Slap. In my defence I plead that I wrote the article tight to the copy deadline. Here's a fragment of the "final" version. The primitive and idiomatic access methods are declared together in their own section. The implementation code chunks at the same section level.
namespace accu { class string { public: class char_reference; typedef char_reference reference; ... ... ... public: // access, idiomatic and primitive char_reference operator[](size_t index); char operator[](size_t index) const; void assign(size_t index, char new_value); private: ... ... ... }; class string::char_reference { public: char_reference (string *target,size_t index); ... ... ... private: string * const target; size_t const index; }; } // string : access, primitive and idiomatic namespace accu { // primitive void string::assign (size_t index, char new_value) { bounds_check(index); unshare_state(); text[index] = new_value; } // idiomatic string::char_reference string::operator[](size_t index) { return char_reference(this, index); } char string::operator[](size_t index) const { bounds_check(index); return text[index]; } } // string::char_reference - assignment namespace accu { string::char_reference string::char_reference::operator= (char new_value) { target->assign(index, new_value); return *this; } string::char_reference:: operator char() const { // this was ro in the previous article // a needless abbreviation const string & readonly = *target; return readonly[index]; } }
I'm constantly amazed just how often apparently simple code benefits from further simplification. The above is simpler than the previous article in two ways. The unfettered use of primitives is one. The other is the visual separation of the two class definitions. In other words, I don't write
namespace accu { class string { public: class char_reference { ... ... ... }; ... ... ... }; }
Here's a thought. Should the char_reference conversion operator be const? Suppose someone writes
const string::char_reference eh = s[0]; cout << eh << endl;
In an expression a reference will automatically "decay" into the thing it is a reference to: a reference is implicitly const, so the explicit const is meaningless. However, if the conversion operator was non-const the second line would no longer compile. This is a step too far. Don't forget code such as
void parameter (const string::char_reference & ah) { cout << ah << endl; }
or (and this is a clincher), code like this
template<typename type> void parameter(const type & ah) { cout << ah << endl; }
expressions or types ?
In the article I wrote the subscript operator with the & token and the operator token together. A comment from Sean Corfield made me re-look at this. He pointed out that Stroustrup consistently uses a different style, one where the char token and the & token have no intervening whitespace. It's easier to see it than read it
// Previous article version char &operator[](size_t index); // Stroustrup version char& operator[](size_t index); // This article. I'm trying it out char & operator[](size_t index);
To take the simplest example, lets look at this bit of C
int answer = 42; int *ptr = &answer *ptr = answer;
This is how Kernighan and Ritchie declare pointers in their white book [ 2 ] . They decided to make the syntax of a declaration mirror the syntax of an expression. Hence the * and the ptr are together in both. The effect is the declaration emphasises how to use the identifier in an expression. Which is exactly the point I think they intended. But in C++ I think Stroustrup would write
int answer = 42; int* ptr = &answer; *ptr = answer;
Why the difference? Well, Stroustrup emphasises the type of ptr in the declaration of ptr . Which again is exactly the point I think he intends. In other words in C the focus is on expressions, whereas in C++ the focus is on types. A natural consequence of this is that Stroustrup never declares more than one pointer in a declaration. If he did he would have to write something like
int* ptr, *another; / version 1
Note however that Stroustrup is quite happy to declare more than one value in a single declaration [ 3 ] . How would Stroustrup declare two pointers of the same type? Perhaps he'd write this
int* ptr; // version 2 int* another;
Is this different to version 1? In a sense it is. In version 1 if I change int to double I'm changing the type of ptr and the type of another . To change them both in version 2 I have to edit twice . In effect version 1 is saying ptr and another are deliberately the same type, whereas version 2 is saying ptr and another are coincidentally the same type. Of course you could write
typedef int* int_pointer; int_pointer ptr, another;
Another point of interest is that all the introductory QA C++ courses use a third style using two spaces...
int * ptr = ...; int & ref = ...;
The rationale is simple. We want each token to be clearly visible: we don't really want to lexically bind the asterisk to the type name "more" than to the identifier name. After all, C++ newcomers have enough to cope with as it is!
That's all for now.
[ 1 ] And in a variadic function but you're not using those in C++, right?
[ 2 ] The C Programming Language
[ 3 ] Thanks to Kevlin Henney for pointing this out