Andrew Koenig is a former AT&T researcher and programmer for more than 35 years. He is author of C Traps and Pitfalls and, with Barbara, coauthor of Ruminations on C++ and Accelerated C++. Barbara Moo is an independent consultant with 20 years of experience in the software field.
In our previous column (CUJ, April 2004), we noted that the choice of algorithm can affect a program's performance so much that other performance decisions are irrelevant by comparison. In this article, we are going to look at some examples from the other end of the spectrum to see how much a specific, well-known compiler optimization (or lack thereof) can affect a program's execution time. We will begin by estimating how long it takes to call a trivial function, and then see how much of that time can be saved by asking the compiler to expand function calls inline.
All of the timing figures presented here refer to a particular C++ implementation on a particular computer. We anticipate that different implementations will vary greatly, both in absolute speed and in the relative effect of different optimizations. Nevertheless, we believe that the numbers we cite should give a general idea of what to expect.
How Fast Is a Function Call?
Functions are an important way of organizing large computations in most programming languages, and C++ is no exception. Accordingly, any effort to understand a large program's performance should include at least a cursory measurement of how long it takes to call a function.
Such measurements are easy, as long as we don't demand too much precision. We can begin with the following program:
int main() { int n = 1000000000; while (--n >= 0) { } return 0; }
This simple program counts down from a large, arbitrary number chosen to give reasonable execution times on our computer. When we ran this program, we found that it takes 1.66 seconds without compiler optimization, and 0.64 seconds with optimization. In other words, compiler optimization speeds up this program by a factor of 2.6, and with optimization, we can execute 1.56 billion (!) iterations of this loop per second.
If we change the definition of n to say that it is in a register:
register int n = 1000000000;
the program's execution time stays the same. Obviously, this particular compiler does a good job of putting variables in registers without our help.
To see how long it takes to call a function, we can add a function call to the loop:
void f() { } int main() { register int n = 1000000000; while (--n >= 0) { f(); } return 0; }
Doing so increases the execution time to 5.87 seconds without optimization and 4.1 with optimization. We can infer that the function call itself takes approximately 4.2 nanoseconds without optimization and 3.5 nanoseconds with optimizationso with optimization, a function call with no arguments takes more than five times as long as an entire trip through the loop. Evidently, the penalty for calling a function can be significant. Of course, we may not care: If a function takes 1000 times as long as the function-call overhead, removing that overhead will make little difference.
If we do care enough about function-call overhead to wish to avoid it, we can define the function as being inline:
inline void f() { }
If we turn on compiler optimization, the execution drops to 0.64 seconds, the same as if we had not called the function at all. Without optimization, however, we get our first surprise: The program takes 5.8 seconds to runessentially the same amount of time as if we had not specified inline at all. Apparently, this compiler ignores inline requests unless you also ask it to optimize.
What if we make the function virtual? We can begin by making it into a member function so that we can see if member functions have any overhead that ordinary functions do not share:
class Foo { public: void f() { } }; int main() { int n = 1000000000; Foo foo; while (--n >= 0) { foo.f(); } return 0; }
After this change, the execution times are 5.92 and 0.67 seconds, respectively, so we see that making f a member function has little effect on the execution time. Now, if we make f a virtual function:
virtual void f() { }
the execution time doesn't change at all. It seems the compiler is smart enough to realize that when you call a member of an object, it doesn't need to use the virtual calling mechanism.
Let's try again, this time forcing a virtual call by using a pointer to refer to the member function's object. First, we remove the virtual from f's definition, so you can see what, if any, overhead is associated with using the pointer:
class Foo { public: void f() { } }; int main() { int n = 1000000000; Foo* p = new Foo; while (--n >= 0) { p->f(); } return 0; }
In this version, we dynamically allocate a Foo object and use a pointer to call that object's f member function. The execution times are now 5.9 and 0.68 seconds, respectively, so there is a small amount of overhead associated with the indirection. If we now define f as virtual:
class Foo { public: virtual void f() { } }; int main() { int n = 1000000000; Foo* p = new Foo; while (--n >= 0) { p->f(); } return 0; }
the execution times go up to 8.17 and 5.77 seconds, respectively. This last number is important because it implies that even with compiler optimization, calling a virtual function takes 5.13 nanoseconds, or as much time as eight trips through the loop. We now have enough numbers that they are worth presenting as Table 1.
A More Complicated Example
Having seen that calling a function involves significant overhead, and making the function inline eliminates that overhead, let's see how this overhead affects the execution of a more complicated program. To do so, we shall write a program to create and sort an integer array, and explore how long it takes to do its work. In doing so, we do not expect to learn anything about function-call overhead for the sort function itself. Rather, we hope to be able to measure the function-call overhead associated with element comparisons during the sort.
The Standard Library sort function accepts an optional comparison function as one of its arguments. Supplying that optional argument asks the sort function to use the supplied function instead of the built-in comparison operator whenever it compares the values that it is sorting. Because it calls the comparison function many times, we can reasonably expect that the function-call overhead will contribute significantly to the total amount of time that it takes to sort an array.
One problem with measuring how long it takes to sort an array is that once we have sorted it, we no longer have the original array. Therefore, each time we sort an array we have to recreate it. We do so by creating one array, filling it with random numbers, and then copying it each time we sort it.
To begin, we write a program to create the array and copy it many times without sorting it, so that we know how much overhead is involved:
#include <cstdlib> const int N = 1000; struct Data { int x[N]; }; int main() { Data init; for (int i = 0; i != N; ++i) init.x[i] = std::rand(); int n = 100000; while (--n >= 0) { Data array = init; // We shall sort the array here } return 0; }
Running this program takes 0.1 seconds, with or without optimization. In other words, copying a 1000-element array takes about a microsecond, a measurement that is commensurate with our earlier measurement of 1.56 billion loop iterations per second.
If we replace the comment with a call to std::sort:
std::sort(array.x, array.x + N);
and insert an #include directive for <algorithm>, the execution time becomes 11.8 or 7.3 seconds, respectively, depending on optimization. In other words, sorting a 1000-element array takes 118 or 73 times as long as copying it, depending on compiler optimization.
The next step is to see how the execution time changes when we ask the sort function to call our own comparison function instead of using the built-in < operator to compare its operands. It is reasonable to expect that the program takes longer to run because the sort library function must call our comparison function for each comparison. Moreover, if we make the comparison function inline, it is reasonable to expect that the execution time will not increase as much. The question is how much.
Here is the program:
#include <cstdlib> #include <algorithm> const int N = 1000; struct Data { int x[N]; }; /* inline */ bool compare(int x, int y) { return x < y; } int main() { Data init; for (int i = 0; i != N; ++i) init.x[i] = std::rand(); int n = 100000; while (--n >= 0) { Data array = init; std::sort(array.x, array.x + N, compare); } return 0; }
Here, the inline comment represents the keyword inline or not, depending on whether we wish the comparison function to be inline.
The results are surprising. Without inline, the execution times are 18.2 and 15 seconds, respectively; with it, they are exactly the same. What's going on here? Shouldn't inline functions be faster? Why is std::sort so much faster with its default comparison?
This last question is more subtle than it looks. You might think that the default behavior if you don't give a comparison function to std::sort is to use the built-in < operator, but the truth is more complicated. In fact, the C++ Standard says that if you don't provide a comparison function, std::sort uses an instance of a library template named std::less. Such an instance is a function object that compares two values.
In other words, when we call
std::sort(array.x, array.x + N);
the effect is supposed to be the same as that of
std::sort(array.x, array.x + N, std::less<int>());
Suppose we supply that third argument explicitly. What happens to the execution time?
The answer is both satisfying and puzzling: It drops to 17.4 and 7.3 seconds, respectively, depending on optimization. At this point, we have enough data to merit Table 2.
But how can std::less compare two numbers so much faster than our tiny compare function?
The answer is that when we pass compare as an argument to std::sort, we are implicitly passing the address of our compare function. Although in principle, the compiler has enough information to determine the identity of the compare function when it instantiates the std::sort template, in practice, this particular compiler does not realize that the address of compare is something that it can determine during compilation. In effect, when std::sort calls the comparison function, the compiler always calls it as if it were an arbitrary function, and does not expand the call inline.
In contrast, when the compiler calls a member of a function object, the identity of that member depends only on the type of the function object, and not on the function object's value. The compiler knows the object's type during compilation, of course, and therefore it has enough information to expand the call inline.
To prove this theory, we can rewrite our comparison function as a function object. The resulting program looks like this:
#include <cstdlib> #include <algorithm> #include <functional> const int N = 1000; struct Data { int x[N]; }; struct Compare { bool operator()(int x, int y) { return x < y; } }; int main() { Data init; for (int i = 0; i != N; ++i) init.x[i] = std::rand(); int n = 100000; while (--n >= 0) { Data array = init; std::sort(array.x, array.x + N, Compare()); } return 0; }
We define a structure named Compare that has an operator() member that compares its two arguments. Then we construct an object of that type and pass it as an argument to std::sort. When the compiler instantiates std::sort, it can glean enough information from the type of compare to be able to expand the comparison function inline.
The proof of this theory is that this program executes in 17.9 and 7.4 seconds, respectively, without and with optimization, which is very close to the execution time with the default comparison.
Conclusion
Although the programs we have presented, and our measurements of them, are simple, we found a few surprises along the way:
- Modern desktop computers are easily capable of exceeding a billion iterations of a simple counting loop per second.
- Compiler optimization can make a huge difference in execution time. Perhaps this fact does not surprise you, but we remember C compilers for which the optimizer rarely changed execution time by more than 10 percent.
- Calling even a trivial function takes longer than five trips through the loop; calling a virtual function takes eight times as long.
- If you pass a function as an argument to a template function, the particular compiler that we are using does not expand it inline, even if you define the function as inlineat least, not for the examples we tried.
- If you pass a function object as an argument to a template function, the compiler expands it inline.
- Expanding the comparison function inline doubles the speed of sorting an integer array.
Of course, these measurements, and the corresponding surprises, apply only to this particular machine and compiler. What we expect to be generally true, however, is that any implementation you measure will have similar surprises in its performance.
That expectation is why we say that if you care about low-level performance, there is no substitute for measuring your own environment.