ABSTRACT
This article describes a C++ compiler for the Extended DOS, Windows and Windows NT environments with greatly enhanced run-time diagnostics. Illegal references to memory, such as references through NULL or unset pointers - which would normally corrupt a program, - are detected explicitly. A method of trapping references to unset data is also discussed. The need to cater for 'real' programs containing at least some 'dirty' code, and even in-line assembler, is emphasised. It is argued that the ability to debug C++ programs free from the insidious effects of memory corruption is vital if the language is to enjoy continued popularity and be capable of producing robust maintainable programs.
INTRODUCTION
Despite the many advantages of the C++ language over its parent C, both languages suffer a serious flaw. Using either language it is excessively easy to write programs which inadvertently access memory outside their data structures. This problem is so acute that many teachers prefer 'safer' languages such as Pascal. Even professional programmers waste a great deal of effort chasing bugs associated with corrupt programs.
As an example of how easy it is to write illegal C code, consider the following: fragment:
char x[10]="1234567890";
int k=strlen(x);
The value of k is undefined because the terminating NULL character is not contained inside the array x . The value of k will depend on the bytes of data that follow the array x . In more complex contexts such bugs are extremely hard to spot, especially as the value of k will in all probability vary unpredictably as debugging code is added to the program in order to try to resolve the problem.
The author knows of at least one self-taught C programmer who believed that the statement
char* ptr;
created a pointer variable ptr and initialised it to point at a newly created character variable. Using a well known MS-DOS compiler, he had programmed for some months without discovering his error!
Environments such as DOS and Windows compound the problem because undefined automatic variables acquire values from the stack which may have been used previously by interrupt code. This can make the program extremely unpredictable.
It is possible to program in C++ with intelligent objects which detect or prevent misuse at run-time[ 1 ]. Such objects can be constructed in a known state, thus avoiding the problems of unset variables. Array and pointer operations can then be overloaded and implemented with suitable checks. It would even be possible to selectively enable such checks depending on a macro setting (say). However, few programs will actually get written this way, and even when this approach is adopted, the original class writer will not have the benefit of this mechanism.
Traditionally programmers have relied on static code analysis using Lint[ 2 ], or on tools that detect references to memory that does not belong to the program[ 3 ]. Some attempts have been made to extend the C language in order to specify restrictions that the compiler can enforce[ 5 ]. One notable attempt to provide run-time pointer checks relied on changing the size of pointers and also made several rather restrictive assumptions about the code[ 6 ].
A solution to these problems is contained in the 32-bit Salford Software C++ compiler (SCC) [ 4 ], which runs under DOS and Windows, and is currently being ported to Windows NT. This compiler can either compile a program conventionally, for speed of execution, or can include the run-time checks which are described in the rest of this paper. Immediately an error is detected the debugger is entered before program corruption can begin.
CHECKING POINTERS AND REFERENCES
A program which performs any of the following pointer/reference operations can certainly be classed as illegal:
- A reference through a NULL pointer.
- A reference through an uninitialised pointer.
- A reference which does not point to a data object whose address has been taken in the program (and which could therefore be a legitimate pointer target).
- A reference to an automatic object which has ceased to exist.
- A reference to an allocated object which has since been deleted.
- A reference which would alter a program constant.
-
An attempt to treat a pointer to a function as a pointer to data,
or vice-versa.
Because programmers are free to manipulate pointers in many ways - such as casting them to integers and performing arithmetic on the result - it is difficult to provide a more restrictive set of conditions without causing some valid programs to fail. The SCC CHECK mode mechanism handles all the above cases.
Despite the fact that a 'random' memory reference may modify some other target of a pointer without detection, the above set of tests has been found to be very effective in practice at detecting programming errors. One reason for this is that most objects which are targets for pointers end up allocated on the heap in CHECK mode (see below). This means that no such object will be contiguous with another (because of the red-tape required for storage allocation). Programs which contain pointer references which stray off either end of a data structure (e.g. using strcpy ) will therefore always fail cleanly.
REFERENCES THROUGH UNINITIALISED AND NULL POINTERS
All programs which reference through a NULL pointer cause a fault under SCC, whether compiled in CHECK mode or not. Our representation of the NULL pointer is 0x80000000 (a negative 32-bit integer) and we use a 2 Gigabyte (31-bit) segment for both code and data. This means that negative addresses, such as the NULL pointer, are illegal.
This representation of the NULL pointer is ordinarily invisible to the programmer because the NULL pointer is converted in the case of casts or in a simple statement such as:
char* p=0;
A reference through a NULL pointer can be diagnosed by the debugger so that the error message "Reference through NULL pointer" can be produced by the debugger. The same program run with an ordinary real-mode DOS compiler and even some DOS-extender systems would destroy part of the system interrupt table and continue executing!
In CHECK mode, uninitialised pointers will contain the bit pattern 0x80808080 (see the following section), which also causes a fault when used as an address. Again, the debugger will recognise the particular bit pattern and generate a specific error message - "Reference through uninitialised pointer".
Under Windows NT, the address 0x00000000 is illegal. The NULL pointer will therefore be given its conventional value of zero in this environment, and the exception associated with a reference through a zero address will be interpreted appropriately.
INITIALISING UNSET VARIABLES
When a function is entered, all its automatic variables are initially unset. Since in general the initial value of such variables will depend on the previous user of the stack (i.e. typically the previous function call), the result of using an uninitialised automatic variable can be very unpredictable. In order to fault the use of uninitialised pointers in CHECK mode, code is planted at function entry to set the entire stack area to 0x80 in every byte. This produces illegal pointers and a large negative number in signed integer variables.
In order to make this mechanism more powerful, at the expense of faulting some valid programs, three compiler options are provided:
- An option is provided to force unset static and public data to the unset bit pattern, rather than 0/NULL, as the ANSI C standard requires. Since it is generally considered to be bad practice to rely on such default values, it is anticipated that many users will use this option.
- An option (UNDEF) is provided to check the value of every non-character variable before use to ensure that it does not contain the unset value. This check will obviously fail programs which manipulate numbers corresponding to the unset bit pattern. In practice, because the default integer size in SCC is 32 bits (so that the probability of a 'random' integer having the value 0x80808080 is very low) this check is useful for the vast majority of programs.
-
An option is provided to extend the UNDEF check to include
character variables. Clearly, this option is not
appropriate for programs which perform
arithmetic in character variables, but is very useful otherwise,
since 0x80 is not a frequently used
character.
Unfortunately, programs which use uninitialised bit-packed data will not be faulted by the above scheme (although the result will become predictable).
GENERAL POINTER MISUSE - THE STORAGE ACCESSIBILITY INDEX
In order to identify illegal pointer references of types c to g (above) it is necessary to keep track of every piece of store which can legally be the target of a pointer reference. This includes all global or local constants and variables whose address is taken implicitly or explicitly. Other variables cannot be validly referenced by a pointer. This information is recorded in a table known as the Storage Accessibility Index (SAI). Every reference through a pointer is compiled into code which includes a system call to check that a suitable SAI entry exists. Operations which acquire store from the system, such as new , add an entry in the SAI which is removed when the storage is freed.
The SAI also records if storage is read-only. This enables the system to diagnose programs containing mistakes such as:
char* p="Test";
*p=0;
Data is considered read-only if it is ultimately derived from a constant or from some system object which is meant to be read-only (such as the arguments supplied to the function main ). The const qualifier is not used for this purpose because the user is permitted to override the effect of const using a cast. The type data associated with SAI entries is not recorded (because casts make such information unreliable), however SAI entries associated with function pointers are distinguished, so that a function pointer cannot be cast to a data pointer and used to corrupt the code of the function. Likewise, a pointer to data cannot be cast to a function pointer and be subjected to a call (this would normally send a program totally out of control).
Most of the cost of checking C/C++ code reduces to maintaining the SAI and checking pointer references against it. Typically the use of CHECK will increase execution times by a factor of 10. This cost is acceptable in a debugging context, and can be eliminated automatically by recompilation.
C++ REFERENCES
In general, the use of a reference variable can cause program corruption in the same way as a reference through a pointer. Thus, for example, it is possible to initialise reference variables to objects which disappear before the reference is used:
char* p=new(char[10]);
char& q=*(p+5);
delete p;
q=0;
Furthermore, reference members of classes can become corrupted because currently the class object is entered into the SAI as a whole. For these reasons, accesses to reference variables are also protected by the SAI.
DANGLING POINTERS
A common mistake is to use a pointer to an object which is no longer available. The simplest example of this involves code such as:
char* p=new(char[10]);
delete p;
...
if(*p==0)foo();
Usually the delete operation and the subsequent pointer reference are well separated in the code, making this type of bug particularly insidious. If the SAI were implemented exactly as described above, many such faults might never be diagnosed, because the freed storage might be reallocated by a subsequent new (or malloc ) to an entirely different object which would then become corrupted.
As a partial solution to this problem, storage released by delete (or free etc.) is kept in a special table of obsolete objects rather than being returned to the memory pool. Only when memory becomes short is the store returned to the pool on a first in first out basis.
Any reference to an obsolete object can therefore be diagnosed explicitly as a reference through a dangling pointer. Furthermore, the more memory available on the machine (SCC runs in 32-bit mode, so there is no 640K limit) the less danger there is of a piece of freed storage being reused before a spurious pointer reference.
The common mistake of corrupting the C heap storage by freeing a region of store more than once is a special case of a dangling pointer reference, and is detected at the line where it occurs.
An option is available to force the run-time system never to release dangling storage, thus ensuring that every dangling reference is caught. Since storage is never re-used in this mode, this is only useful for testing small examples, or where very large amounts of memory are available.
DANGLING REFERENCES TO STACK VARIABLES
The following function illustrates a slightly different form of dangling reference:
char* func()
{
char buf[10];
...
return buf;
}
This function will return a pointer to an object which has ceased to exist. Because the storage lies on the stack it will be immediately reused by the next function to be invoked -leading to total confusion. The example above would produce a compiler warning, but this is not possible in more complex cases. In general, a pointer to an automatic (stack) variable may be assigned to a global variable, or even placed inside a structure, and then referenced after the automatic storage has been reclaimed.
Because variable buf is not a heap variable, it would appear at first sight that such a dangling pointer reference could not be detected by the CHECK mechanism. The solution to this problem is that variables which would ordinarily be allocated on the stack (in non-CHECK mode), and whose addresses are taken (and which must therefore be placed in the SAI) are treated specially. These variables are allocated on the heap and placed in the SAI on function entry (more generally on entry to the scope in which they are defined). They are deallocated on exit from the function and are added to the table of dangling references.
Since the arguments of a C function are copies (i.e. passed by value), the entire argument strip is placed on the heap in CHECK mode so that pointers to arguments also become dangling after the function exits.
Operations which unwind the stack, such as the C++ exception-handling mechanism, or the traditional C setjmp / longjmp must take special actions in CHECK mode to deallocate automatic variables placed on the heap when the stack is unwound. The problem is similar to that of ensuring that destructors for automatic objects are executed as the stack is unwound, and is implemented in the same way.
MIXING CHECKED AND UNCHECKED CODE
SCC's sister compiler, FTN77, implements similar checks for the FORTRAN 77 language. The FTN77 run-time system goes to some trouble to ensure that checked and unchecked code can be mixed in one program. This facility was considered essential because FORTRAN programs often make use of scientific libraries, which would usually be supplied precompiled.
Unfortunately, it is easy to see that no such mixing is possible with C++ checked code because of the existence of pointers in the language. For example, an unchecked portion of code could return a pointer to a piece of store which would not be present in the SAI (even worse, the unchecked code might communicate the pointer via a global variable). A reference to such a pointer would generate a spurious diagnostic.
To enforce the restriction that a whole C program must be compiled in CHECK mode, we append __C to all names sent to the linker. This has a very beneficial side effect - a library may contain checked and unchecked versions of the same code and the linker will extract the right copy automatically.
Functions compiled with the UNDEF option to check for the use of undefined data can be freely mixed with functions compiled with the CHECK option.
THE SYSTEM LIBRARY
At first sight it might be thought that the system library need only be compiled in CHECK mode to be suitable for linking into CHECK mode programs. However, if that approach were taken, program errors which manifested themselves inside library routines would produce very obscure diagnostics. Furthermore, some routines, such as new and delete , must be implemented specially for CHECK mode, since they must adjust the contents of the SAL.
In order to provide high quality diagnostics for faults manifesting themselves inside library functions (and to avoid substantial performance costs), CHECK versions of library routines perform the necessary checks on their arguments before passing control to the normal function to perform the bulk of the work. Sometimes checks are performed after the usual routine has executed.
As an example, consider the strlen function in string.h, which calculates the length of a string of characters terminated by a null byte. The CHECK version of strlen checks that its pointer argument is a legal address (a negative address would cause a protection fault inside strlen ) and passes it to the standard strlen routine. When the length result is returned a check is performed to ensure that the pointer argument points to an object in the SAI, and that the string (including the terminating NULL) is completely contained within the same object. A few functions, such as memcmp are implemented specially in order to detect comparisons which run off the end of a valid data blocks.
Some traditional C library functions, such as printf and scanf , take a variable number of arguments depending on one of the arguments (e.g. a format string). Mistakes in calls to these functions are very common.
As explained above, a function's arguments are contained in a single SAI entry. It is therefore easy for functions such as printf to report if too few arguments are supplied. Unfortunately no check is possible that the arguments are indeed of the required type (although SCC 'knows' about the standard C functions which take format arguments, and will warn at compile time if there is a format/data mismatch).
A further problem with the traditional C system library, is that most error conditions (for example a negative argument to sqrt ) are handled by setting the global variable errno and returning. Programs which do not check errno after every library function call which could set it (i.e. almost all programs) can be hard to debug. Other languages, such as FORTRAN, either permit, or require the implementation to fault on such errors. SCC provides a function which may be called to activate certain classes of error, such as arithmetic domain errors, so that the program fails after such an error and enters the debugger.
It is possible for users to set up (hopefully well tested) libraries to operate in this way. This may be very valuable for general purpose class libraries.
C++ CLASSES
In general, C++ classes contain data which the user should never alter. Pointers to the virtual function table, and pointers to virtual base classes are set by the system only. Other objects, such as reference member variables, should not change their values after initialisation. Currently the CHECK mechanism does not address these issues, except that it checks that the virtual function pointer points to a valid virtual function table before using it.
In principle, the SAI entry for such a class could be split into several entries covering each user-accessible region, or some other scheme to restrict access to system parts of the object could be implemented.
DIFFICULT PROGRAMS
By overloading the new operator, or by the use of casts, the C++ programmer can take over the allocation of storage for his data structures. Typically, he may allocate storage from a single large region of memory obtained from the system. Since this large storage region will appear in the SAI, such programs will work in CHECK mode, however the checks will be much less effective. Because of this, a macro (__CHECK__) is set to inform the user that CHECK mode is in operation. The user may then revert to the standard storage allocators in CHECK mode, or even manipulate the SAI directly via system function calls. Since in general it is important that a program executes in the same way in CHECK mode as it will in normal use, wise users will use this facility sparingly!
SCC can handle inline 32-bit Intel assembler. At first sight this poses serious problems in check mode. In particular, automatic objects that are moved to the heap (see above) require an extra level of indirection in order to access them. Assembler instructions which reference such variables directly are therefore faulted in CHECK mode. For example:
void foo(int&);
int k; //Automatic variable k
foo(k); //This takes the address of 'k'
//and forces SCC to allocate it
//on the heap in CHECK mode
asm{
mov eax,k; //Illegal in CHECK mode
A pseudo instruction to load the address of a variable (LAD) is supplied in order to write assembler code that will work in CHECK MODE. For example:
lad eax,k;
will load the address of k into register EAX. This assembles as an LEA instruction, which loads the address of its operand, in normal mode, or if k does not require allocation on the heap, and as a MOV instruction accessing the pointer to k if k is allocated on the heap.
Pseudo instructions are also supplied to check the contents of a register to ensure that they point to a suitable SAI entry. These pseudo instructions assemble into code which preserves all registers over the check. In non-check mode these pseudo instructions generate no code.
CHECK mode does not, of course, solve the problems caused by incorrect assembler code, which may still corrupt the system.
CONSTRAINTS ON FEASIBLE CHECKING SCHEMES
In designing the above checking mechanism considerable care was taken to avoid constructions which would have had unfortunate side-effects. In particular, it was decided that:
- Any scheme which increased the size of pointers was likely to break code which assumed the size of pointers to be 4-bytes.
- Since pointers may be cast to 4-byte (default) integers in normal mode, it was essential that this was possible in CHECK mode. (The standard allows implementations to fail such conversions if there are insufficient bits in the integer to hold a pointer, but real programs impose stronger constraints on implementations).
- Structures, classes, and unions should retain the same size when compiled in CHECK mode.
-
Because unions can contain pointers overlaid on other objects, it
was decided that any scheme that tracked the contents of pointer
variables (as opposed to the data to which they could point) was
impractical.
Because of the above constraints it was decided that any attempt to detect uninitialised data using additional bits was impractical. Furthermore, it was decided that any attempt to validate a pointer reference using the past history of the pointer was impractical. For example:
int k=100;
char* p=new(char[10]);
*(p+k)=0;
This program is clearly wrong (since the programmer can have no knowledge about the memory location he is altering), and will fault in CHECK mode unless the pointer p + k happens to point to a read-write pointer target. Only by tracking the history of the pointer p would it be possible to fault this program in all cases.
CHECKING FUNCTION INTERFACES
As with FORTRAN 77, traditional C programs which do not use function prototypes can become corrupt if a function is called with the wrong number or types of arguments, or if the return type does not match. FTN77 provides a comprehensive check to diagnose such problems at run-time in FORTRAN code. However, since SCC can enforce the use of function prototypes with C code, and these are in any case mandatory with C++ code, we have made the decision not to provide an independent run-time check for argument mismatches.
ARITHMETIC CHECKS
Floating point errors, such as overflow, can be detected by the hardware without special action by the compiler. In order to detect integer overflow, the compiler would have to plant extra instructions to check the condition codes. Unfortunately, many C/C++ programmers rely on the twos-complement implementation of signed integer arithmetic, and perform operations which are strictly illegal. For example, many programmers would be surprised if the left shift operator (' << ') caused an overflow when it forced a change of sign of a signed integer. Also, many programmers mix signed and unsigned data in ways which technically cause overflow. Since integer overflow is a much less serious problem in a 32 bit environment, it was decided not to attempt to detect this problem.
PERFORMANCE
The performance consequences of using CHECK mode vary widely from to program. On average, we find that a reduction of speed by a factor of 10 is common. CHECK mode programs also use much more memory. Since it is anticipated that CHECK mode will only be used as a debugging tool, this loss of performance should be acceptable except for those programs with a real-time component. In principle, some of the overhead of constantly checking accesses to the SAI could be compiled away to improve performance in CHECK mode. However, since some sources of program corruption - such as incorrect assembler code - may invalidate the assumptions which the compiler might use to make such optimisations, it is not clear whether this would be worthwhile.
CONCLUSION
It is a sobering fact that almost every large FORTRAN or C program contains at least a few undetected instances of memory corruption. These problems can be almost eliminated at debug-time by the use of suitable compilation techniques.
ACKNOWLEDGEMENTS
I would like to acknowledge the many helpful discussions with my colleagues Ewan Cunningham, Tim Bartle, Mark Stevens, Robert Chafer, Keng Low, Dinesh Patel, and Tony Webster, which have contributed to this paper and to the software which it describes.
REFERENCES
1. B. Stroustrup, "The Evolution of C++ : 1985 to 1989", Computing Systems, Vol 2,(3), 191-250 (1989).2. PC-lint Reference Manual, Gimpel Software.
3. Nu-Mega Bounds Checker, Nu-Mega Technologies Inc.
4. Salford C/386 Reference Manual, Salford Software.
5. D.W. Flater, Y. Yesha, and E.K. Park, "Extensions to the C Programming Language for enhanced fault detection", Software Practice and Experience, Vol 23 (6), 617-628 (1993).
6. J.L. Steffen, "Adding Run-time checking to the Portable C Compiler", Software Practice and Experience, Vol 22(4),305-316 (1992).