Embedding a value in a type can lead to more efficient code
Andrew Koenig is a former AT&T researcher and programmer. He is author of C Traps and Pitfalls and, with Barbara, coauthor of Ruminations on C++ and Accelerated C++. Barbara Moo is an independent consultant with 20 years of experience in the software field.
There have been a number of books and articles in the past few years about techniques for doing arbitrarily complicated compile-time computations in C++. These techniques are collectively called "template metaprogramming," and there is a whole community out there doing remarkable things with them. Although many of these techniques are specialized and esoteric, there are some metaprogramming-related techniques that we have found to be useful for everyday programs. This article will illustrate some such techniques.
Programming languages have traditionally been divided into two categories: statically typed languages and dynamically typed languages. C++ adds a third category: Its generic-programming features make it easy to write programs that deal with types that first become definite during compilation. For example, in the copy function from the Standard Library, which might be implemented somewhat like this:
template <class In, class Out> Out copy(In begin, In end, Out dest) { while (begin != end) *dest++ = *begin++; return dest; }
we do not know what types In and Out represent at the time we write the function, but the compiler knows these types by the time it has to generate machine code. Accordingly, programs that use functions such as copy can often obtain much of the convenience of dynamic typing while retaining the error checking and speed of static typing.
By now, most of our readers are probably aware of such techniques. What is less well known is that we can apply similar techniques to values, not just types, that become known during compilation. Effectively, we can embed a value in a type, making it possible to do some computations that involve that value during compilation. The resulting programs can be as fast as if the values in question were known at the time the program was written, and more convenient than they would be if the values were fixed as part of writing the program.
Array Types and the Preprocessor
We tend to forget the familiar. If you ask a fish about its environment, the last thing it will describe is the water. It is therefore worth pointing out, however obvious it may be, that C programmers have been including numbers as compile-time parts of types since the 1970s. In C, this technique rests on two fundamental language features:
- C programs, like C++ programs, pass through a preprocessor as part of compilation, and the preprocessor can substitute numbers for particular identifiers.
- Wherever a numeric constant (such as 42) is permitted, one can also write a constant expression (such as 2+40).
These two features together make it possible to write code that depends on values that are computed during compilation. For example:
#define HDRSIZE 100 #define BODYSIZE 1000 struct buffer { char b[HDRSIZE+BODYSIZE]; };
Here we have defined a structure type named buffer that includes a member named b with a size that is the sum of HDRSIZE and BODYSIZE. This computation takes place during compilation, which means that by changing HDRSIZE or BODYSIZE and recompiling, we will cause the buffer type, and presumably any code that uses it, to reflect the revised size.
We can even use the preprocessor to allow us to pass the desired size as a parameter to a macro that will define the type we want:
#define MAKEBUFFER(name,size) \ struct name { \ char b[size]; \ }; MAKEBUFFER(buffer,HDRSIZE+BODYSIZE)
Here, the \ at the end of a line tells the preprocessor to treat the next line as a continuation of the current line, so that we can spread the definition of MAKEBUFFER over four lines instead of one.
The call to MAKEBUFFER expands into
struct buffer { char b[HDRSIZE+BODYSIZE]; };
which does the same compile-time computation as the previous version.
In C++, we can achieve a similar, but more general, effect by using nontype template parameters:
template<int N> struct buffer { char b[N]; };
Now we can use types such as buffer <HDRSIZE+BODYSIZE> without having to resort to macros, or even to give the type a distinct name. Moreover, if we still want to use a name to represent a buffer of a particular size, we can do so with typedef:
typedef buffer<HDRSIZE+BODYSIZE> fixedbuffer;
Whichever form we use, we are taking advantage of the notion that a type can potentially include a number.
One reason that it is interesting to know numbers during compilation is that the compiler can take advantage of such knowledge to generate more efficient code. This code-generation strategy can be particularly useful when the buffer is small. For example, if a buffer contains only a few bytes, the compiler may be able to know that it will occupy only a single word (or a few words) of memory, and generate word-oriented instructions to copy it. If the buffer is larger, the compiler may decide to generate a loop or use a specialized machine instruction that can copy large amounts of memory quickly.
Compile-Time Computation
We can use non-type template parameters for purposes well beyond merely indicating data-structure sizes. For example, suppose we are writing a program that includes a data structure that is tricky to get right and also requires high performance. If we didn't care about performance, we would probably write a member function with a name such as audit that we would call from time to time to verify that the data structure is behaving as we expect. That function might test a debugging flag that we would then be able to set when we wanted to verify that our program was working properly.
However, if we want to wring the last bit of performance out of our program, we would rather avoid testing a debugging flag at runtime. Instead, how about making it a template parameter:
template <bool debug> class Tricky { // ... void audit() { } // ... }; void Tricky<true>::audit() { //The auditing code goes here }
We have defined a template class that includes an audit function that does nothing. Because this function is defined (and not just declared) within the class body, it is implicitly inline. Therefore, if we call it from other functions in the class body, a reasonably decent compiler will generate no code at all for the call.
We have also used template specialization to create an exception for the audit member of Tricky<true>. If we call Tricky<true>::audit, the effect will be to execute the auditing code.
In other words, if we define an object of class Tricky<false>, any calls to audit from that class's objects will have no effect. If, on the other hand, we change the type to Tricky<true>, then each call to audit in objects of that class will execute the debugging code. By including the debugging flag as part of the class type, we can arrange to test the flag during compilation, where it costs nothing during execution.
Compile-Time Loop Unrolling
We can view the audit technique as a way of making a decision at compile time about whether to execute a piece of code. That is, we are effectively executing the code n times, where n is zero or one.
It is possible to generalize this notion to other values of n. For example, suppose we want to copy a given number of objects from one place to another. We can imagine writing a variation on the standard copy function along the following lines:
template<class In, class Out> Out copy_n(In src, Out dest, size_t n) { while (n != 0) { *dest++ = *src++; --n; } return dest; }
If we wish to use this function to copy a large number of objects, we may wish to avoid the repeated decrementing and testing of n. One way of doing so is called "loop unrolling." The idea is to copy more than one element at a time, so long as it is conveniently possible to do so. For example, we could unroll the loop by a factor of two this way:
template<class In, class Out> Out copy_n(In src, Out dest, size_t n) { while (n >= 2) { *dest++ = *src++; *dest++ = *src++; n -= 2; } if (n) *dest++ = *src; return dest; }
Now n is decremented only half as many times as the number of items copied, which presumably reduces the total overhead associated with decrementing and testing by half. However, this optimization comes at a price: When n is small, the overhead actually increases, because our revision tests n a second time after the first loop finishes.
If we are willing to reqiure the value of n to be known during compilation, we can avoid the runtime cost of testing entirely. Our strategy will be to use a specialization technique similar to the one that we used for the debugging flag:
template<size_t n> struct Copy_n { template<class In, class Out> static Out copy_n(In src, Out dest) { *dest++ = *src++; return Copy_n<n-1>::copy_n(src, dest); } }; template<> struct Copy_n<0> { template<class In, class Out> static Out copy_n(In src, Out dest) { return dest; } };
We define a template named Copy_n that will have a (static) member function named copy_n. The idea is that Copy_n<n>::copy_n will be a function that copies exactly n elements.
We define this function recursively: To copy n elements, we copy one element and then call copy_n to copy the remaining n-1 elements. The recursion ends when n reaches 0. Because size_t is an unsigned type, n cannot be negative.
Here is an example of how we might use this class to copy data:
char src[] = "This is a test. This is only a test."; char dest[sizeof(src)]; Copy_n<sizeof(src)>::copy_n(src, dest);
Of course, this code involves a recursive function call for each element that it copies. However, when we examine our compiler's assembly-language code, we learn that it is capable of expanding these calls inline, at least when we ask for compiler optimization.
Irrespective of whether the compiler can eliminate the recursive function calls, there is the question of whether it can handle all of the nested template instantiations. For example, if we were to use this class to copy a 1000-element array, it would ask the compiler to expand 1000 nested template instantiations, which might well be more than the compiler can handle. It is possible to solve that problem by changing the recursion basis: Instead of copying one element and then making a recursive call to copy the rest, we can make a recursive call to copy half the elements, and then make a second call to copy the other half. This technique reduces the recursion depth from order n to order log(n). We shall leave the details as an exercise.
Summary
Types that are static during execution and dynamic during compilation can be nearly as useful as types that can vary during execution. Such types are the foundation of generic programming, which C++ supports via templates. Because types can include numbers, both as the size of an array and in other contexts, the C++ features that allow types to vary during compilation also allow compile-time computation on numbers.
These computations can sometimes be made to substitute for computations that would normally happen during execution. This substitution makes it possible to avoid runtime overhead in contexts such as debugging flags and loop control.
The C++ template mechanism is capable, at least in principle, of doing anything computable. The template-metaprogramming community is exploring the limits of what kinds of template computations are possible. While these explorations proceed, there are plenty of related ideas that we can use today in ordinary programs.
CUJ