Introduction
It is hard to get into a technical discussion these days without the conversation turning at some point to security particularly security of computer systems connected to a network. Over and over, people who believe that their systems are secure discover that they really arent. If theyre lucky, the discovery takes the form of a bug report from their quality-assurance organization, or a warning and security update from their vendor. If theyre unlucky, they discover the problems only after someone has broken into their systems.
Large software systems can be vulnerable to attack for many reasons. This article discusses one common reason, often called buffer overrun: a program attempts to write past the end of an array, thereby overwriting whatever is in the memory next to the array. A prospective intruder who understands enough about the structure of the program can often use a buffer overrun to set an apparently unrelated variable to a value that allows the system to be penetrated.
As an example, consider a program fragment as simple as this one:
char buffer[100]; std::cin >> buffer;
This fragment reads a line from the standard input stream into an array named buffer, which contains 100 characters including a null character at the end. If the corresponding input is a line with 100 or more characters, part of the input will spill over the end of buffer and overwrite whatever follows buffer in memory.
We cannot forestall this problem by making buffer larger, because an input line might be larger still. However, if we change the definition of buffer so that it has a type that is capable of expanding with its input, such as:
std::string buffer;
we can avoid this overrun entirely as more input arrives, the std::string class will take care of enlarging buffer dynamically to be as large as needed.
Reading into a fixed-size array obviously risks buffer overrun whenever the input exceeds the pre-allocated array size. In general, one can avoid such overruns by using a data structure that grows dynamically. Sometimes, however, buffer overruns can occur even when we dont appear to need a dynamic data structure. For example, consider the following code that appeared in C Traps and Pitfalls (Addison-Wesley, 1988). This example creates a buffer overrun even though we might think we know at compile time that the array is large enough to contain what were putting into it. Here, a simple bug can lead to disastrous results:
int i, a[10]; for (i=1; i<=10; i++) a[i] = 0;
The problem with this example is that i<=10 should have been i<10. Accordingly, i takes on values from 0 through 10 inclusive, but the array a doesnt have an element with index 10. Therefore, the effect of assigning to a[10] is to assign a value to whatever happens to follow array a in memory.
Many C implementations allocate consecutive memory locations for variables that are defined together. The implementation that was used to test this particular example allocates local variables, such as i and a, in reverse order of their definition, because the variables are part of a stack that grows downward in memory. Accordingly, i occupies the same memory that a[10] would occupy if it existed.
On this particular implementation, then, this example has the same effect as if it had been written as:
int i, a[10]; for (i=1; i<=10; i++) { if (i != 10) a[i] = 0; else i = 0; }
The effect is to set i to 0 the last time through the loop, which, of course, has the effect of starting the loop over again and making the program fragment run forever.
Of course, other implementations may well behave differently. For example:
- The implementation might store local variables in the same order as their definitions, instead of storing them in reverse order. In that case, a[10] wouldnt share memory with i, and assigning to a[10] would have a different effect. Just what that effect might be is impossible to say.
- The implementation might optimize the program by keeping i in a register during the loop. In that case, assigning to a[10] would overwrite the memory normally associated with i, but because i was temporarily in a register, overwriting that memory would have no effect.
- The implementation might be kind enough to detect the attempt to overrun the bounds of a, in which case it might terminate the program with an appropriate diagnostic message.
- Something else might happen that hasnt occurred to us.
As you can see, array overruns are particularly insidious because their effects can differ from one C++ implementation to another, which means that no amount of testing can absolutely guarantee their absence. Accordingly, it is important for programmers to write their programs in such a way as to make it unlikely that such problems will occur in the first place.
The rest of this article explains some guidelines that, if followed consistently, can make bounds errors much less likely. We understand that it may sometimes be impossible to follow these guidelines. In such cases, it is worth giving extra scrutiny to the code that violates the guidelines to see if that code might be exceeding an array bound.
Use Library Containers and Iterators Instead of Arrays and Pointers
Pointers and arrays are low-level data structures that are intended to mirror the behavior of typical computer hardware. This mirroring is good for performance, because it implies that the hardware will be able to execute array and pointer operations efficiently. However, this performance comes at a price: allocating an array fixes its size, and that size is not easy to change later.
How many times have you seen structures such as this one?
struct Customer { char name[20]; char street_address[40]; // ... };
In C and C++, the size of an array element of a structure is fixed during compilation, so if we want to store a customers name and address as part of a structure, we have little choice but to put an absolute upper bound on the size. However, there is rarely an absolute need to store such information directly as part of the structure.
In C, we might define a data structure that represents a variable-length string:
struct String { char *data; size_t length; };
and then use instances of that structure instead of fixed-length arrays as part of our customer information:
struct Customer { struct String name; struct String street_address; /* ... */ };
This strategy would then require us to write or obtain a collection of subroutines that would deal with these String structures.
In practice, few C programmers adopt such techniques because of the difficulty of making them work well with common operations such as built-in assignment. For example, suppose we had defined a String structure such as this one, and suppose further that we had two Customer records:
struct Customer c1, c2;
In that case, we would have a problem with a statement such as:
c1 = c2;
The trouble is that in C, assigning one structure to another copies the contents of the structure. In particular, it copies the data and length members of each of the String structures that constitute c2. After such a copy, the pointers in c1 and c2 point to the same memory, which makes it difficult to decide when to deallocate that memory. It is such difficulties that discourage C programmers from creating abstractions such as the String structure above.
The story is quite different in C++, because along with a data structure that defines a variable-length string, it is possible to define the algorithm to follow when someone tries to copy such a string. Indeed, the C++ Standard library has already done this work for us:
struct Customer { std::string name; std::string street_address; // ... };
Now, if we write:
Customer c1, c2;
and execute:
c1 = c2;
the effect will be to deallocate whatever memory c1 was previously using, allocate enough memory to hold a copy of each component of c2, and copy all of those components. By offering the possibility of richer abstractions, C++ makes it much more straightforward to use those abstractions.
Fortunately, it is straightforward to learn how to use such abstractions in ordinary contexts. More advanced uses are also possible and require understanding of how and when to use them in order to do so effectively. The rest of this article will illustrate several guidelines for avoiding buffer overruns in C++ programs.
Append to Vectors Instead of Pre-allocating Them
Just as the library string class is often a good substitute for character arrays, the library vector template is often a good substitute for arrays of other types. However, it is usually wise to use a vector slightly differently from the way one would use the corresponding array. The key difference is that there is a reasonably efficient way of appending an element to a vector, namely to use push_back.
Suppose, for example, that we have defined an operator>> that is capable of reading Customer records from a file:
std::istream& operator>>(std::istream in, Customer& cust) { // ... return in; }
and we want to read the entire contents of the standard input file, which contains Customer records. To do so, we can define a vector with no elements and append each Customer record to the vector as we read it:
std::vector<Customer> customers; Customer c; while (std::cin >> c) customers.push_back(c);
This technique is likely to be familiar only after using vectors for a while. If, instead, we are accustomed to using arrays instead of vectors, we might be tempted to estimate how many elements the vector will contain and pre-allocate that many elements:
// a dubious approach // a vector with 42 elements std::vector<Customer> customers(42); int count = 0; Customer c; while (std::cin >> c) customers[count++] = c; // Wrong!
The trouble with this approach is that it is sure to overrun the bounds of the customers array eventually. To forestall this overrun, it would be necessary to replace:
customers[count++] = c;
with statements that check whether count has reached customers.size, and, if so, to increase the number of elements in customers. Such checking and adjustment is just what push_back does for us automatically.
Aside from force of habit, there are two contexts in which you might want to define the size of a vector in advance. One context is when you really do know without having to guess how many elements the vector will have. Even in such a context, however, you may well want to use push_back anyway. The reason is that giving a vector a size defines its elements, and initializing those elements takes time. For example, if we define:
vector<int> v(10000);
we are not merely defining a 10,000-element vector we are initializing its elements to zero. There is no reason to take the time for such initialization unless we intend to use the elements.
The other context for defining the size of a vector is to avoid having to reallocate the vector when we use push_back to append to it. In this context, however, we dont really want to define the number of elements in the vector rather, we want to allocate enough memory to it that when it comes time to append elements to the vector, we do not need to allocate memory again.
As an example, lets look again at our vector of Customer records. Instead of actually creating 42 records, lets suppose instead that we had measured the performance of our program, and we were convinced that we needed to avoid having to allocate more memory until we actually had more than 42 records. Then, instead of writing:
std::vector<Customer> customers(42); // construct 42 records
we would write:
std::vector<Customer> customers; customers.reserve(42); // leave room for 42 records
In the first of these examples, we define customers as a vector with 42 elements, each of which has whatever default value is appropriate for a Customer. In the second example, the call to reserve does not change the number of elements in customers. That number was zero before we called reserve, and it is still zero. What the call does is to ensure that customers has enough memory that it can grow to at least 42 elements without needing to allocate more memory.
By using reserve, we can make it unnecessary for subsequent calls to push_back to have to allocate additional memory and therefore avoid the associated overhead, without having to say for certain how many elements our vector will eventually have.
Use Inserters as Destinations
By appending elements to containers instead of guessing the containers size, we can avoid overrunning the containers bounds when we guess wrong. This principle also comes into play when we use algorithms that store sequences of elements into containers. For example, suppose we have a vector named v:
std::vector<int> v; for (int i = 0; i != 10; ++i) v.push_back(i);
and we want to copy its elements into another vector named w. Here is a common beginners mistake:
std::vector<int> w; std::copy(v.begin(), v.end(), w.begin());// Wrong!
The difficulty is in the third argument to std::copy. The first two arguments represent the source; the third argument represents the destination. By giving w.begin() as the third argument to std::copy, we are saying that we want std::copy to overwrite elements of w, starting at the initial element.
The trouble with this strategy is that std::copy will go on overwriting elements as long as there are elements of v to copy. In other words, unless w has at least as many elements as v before we call std::copy, the call to std::copy will overrun ws bounds. In this particular example, w has no elements at all because we never gave it any so if v has so much as a single element, the program wont work.
The way around this problem is to use inserters instead of plain iterators on the receiving end of algorithms such as std::copy whenever possible. For example, if we know that w has no elements, and we want to copy vs elements into w, we might do so this way:
// Much better! std::copy(v.begin(), v.end(), std::back_inserter(w));
Now, each element that we copy will extend w appropriately, so there is no possibility of overrunning ws bounds.
Equivalently, we might write:
w.insert(w.end(), v.begin(), v.end());
This call to insert will insert copies of the elements delimited by v.begin() and v.end() into w immediately before w.end(). In other words, it will append the elements of v to those of w.
Even more straightforwardly, if we know that w is empty before we start, we can write:
w = v;
which will cause ws elements to be copies of those of v. However, this straightforward assignment works only if v and w are the same type, whereas the calls to copy and insert will work even if v and w are different kinds of containers.
From a safety viewpoint, the important property of an inserter is that it is an iterator that causes its corresponding container to grow automatically to contain the inserted elements. By far the most common inserter is back_inserter, which, as its name suggests, yields an iterator that inserts elements at the back of its container. As we have already seen, we can use copy and back_inserter together to append the contents of one container to another.
If a container supports efficient insertion at the beginning, it also supports front_inserter, which yields an iterator that inserts elements at the front of the container. For example, if l is a list<int>:
std::list<int> l;
then we can insert a copy of the elements of v at the beginning of l by writing:
std::copy(v.begin(), v.end(), std::front_inserter(l));
Equivalently, we can use insert directly:
l.insert(l.begin(), v.begin(), v.end());
Finally, for containers, such as list, that support efficient insertion in the middle, there is an inserter called inserter that allows for insertion anywhere in the container. For example, std::front_inserter(l) is equivalent to std::inserter(l, l.begin()), and std::back_inserter(l) is equivalent to std::inserter(l, l.end()).
In all cases, inserters provide a place where algorithms can put their output without the programmer having to worry whether there is enough room for that output.
Use == and != for Bounds Comparisons
Lets look again at the example from C Traps and Pitfalls:
int i, a[10]; for (i=1; i<=10; i++) a[i] = 0;
Surprising as it may seem, one way to avoid mistakes of this kind is to use != and == to terminate loops instead of <, <=, >, or >=. For example, if the loop had been written this way:
for (i=1; i!=10; i++) a[i] = 0;
the bounds overrun in the original example would not have occurred [1].
This guideline stems from the fundamental tendency of C and C++ programs to express ranges in terms of the first element of the range and a value that is one past the last element. For example, if we have an array with 10 elements, then 0 is the index of the first element and 10 is one past the index of the last element.
Because we express ranges asymmetrically in this way, no element in the range is equal to the ranges upper bound. Accordingly, we can iterate over the elements of a range this way:
for (var = lower_bound; var != upper_bound; ++var) // ...
We know that the != condition is correct because the upper bound is not part of the range. Using != in this way means that we never have to worry about whether to use < or <=.
There is another reason to avoid <, <=, >, and >=: these operators are not general across all iterator types. For example, look again at our vector named customers, and see how we might write the contents of that vector on the standard output stream:
std::vector<Customer>::iterator it; for (it = customers.begin(); it ! = customers.end(); ++it) std::cout << *it << std::endl;
In this example, we use != instead of < to compare it to customers.end(). We have already argued that it is wise to use != instead of < so that we do not need to think about whether to use < or <=. What we are saying now is that the other advantage of using != is that this code would continue to work even if we were to use a list instead of a vector:
std::list<Customer> customer_list; std::list<Customer>::iterator it; // We can write a list as easily as we // can write a vector // because we used != instead of <. for (it = customer_list.begin(); it ! = customer_list.end(); ++it) std::cout << *it << std::endl;
The point is that unlike vector iterators, list iterators support only the == and != comparisons. Therefore, the only way to express the bounds of a list if by using values that the == and != comparisons can detect. By restricting iterations to these operations, then, we gain not only safety but also flexibility.
Summary
We have discussed three ways of avoiding bounds overruns:
- Using data structures that allow bounds to change before they can be overrun.
- Using inserters to allow output sequences bounds to grow as necessary.
- Stopping at the right place in an input sequence.
By far the most important of these strategies is to use standard library containers instead of built-in arrays. Particularly useful is starting with an empty container using push_back to append elements to the container as needed.
If your program needs to transfer data from one container to another, it is often wise to use an inserter to represent the transfers destination. Using an inserter will create new container elements instead of overwriting elements that are already there and will cause the containers bounds to grow automatically to accommodate those elements.
Finally, once you have a container full of data, you can avoid reading past the end of the container by using asymmetric bounds which will happen automatically if you express those bounds with begin and end and using only the == and != operators to check whether you have reached the end.
Note
[1] Of course, there is still the problem that the first value of i in the loop is one instead of zero, but that is not a bounds overrun error. Indeed, it might not even be an error at all. In any event, if we use library containers in place of arrays, we will write a.begin() instead of 0, solving that problem too.
Andrew Koenig is a member of the Large-Scale Programming Research Department at AT&Ts Shannon Laboratory, and the Project Editor of the C++ standards committee. A programmer for more than 30 years, 15 of them in C++, he has published more than 150 articles about C++ and speaks on the topic worldwide. He is the author of C Traps and Pitfalls and co-author of Ruminations on C++.
Barbara E. Moo is an independent consultant with 20 years experience in the software field. During her nearly 15 years at AT&T, she worked on one of the first commercial projects ever written in C++, managed the companys first C++ compiler project, and directed the development of AT&Ts award-winning WorldNet Internet service business. She is co-author of Ruminations on C++ and lectures worldwide.