Introduction
This article was inspired by a question on Usenet:
The binary_search function takes two iterators and a value, and returns a bool that says whether the iterators specify a range that contains an element with the given value. However, if binary_search finds such an element, it doesnt tell me where the element is. How can I find out the location of the element?
Thinking about this question, and how to answer it, helped us clarify our own understanding of binary search, and how the C++ Standard library implements them, so we want to take this opportunity to share that understanding with our readers.
As it happens, the library offers a binary-search function that solves part of the problem: if the sequence contains an element with the requested value, the function returns an iterator denoting the element. However, we shall see that there are a few subtleties: the function is not named binary_search, as the poster expected, and its return value is a bit more complicated than the poster might have imagined.
We will begin this article with a brief look at what a binary search is and how it works. Then we will don our library-designers hats and explore how best to make binary searches available to library users. In doing so, well see that the standard library provides several binary-search functions, all of which modify the classical binary-search algorithm slightly for a better fit to the C++ Standard librarys constraints. Finally, we will look at how to apply what we have learned when using the library to do binary searches.
Requirements
A binary search is a neat piece of everyday magic: it finds a given value in a collection without looking at the whole collection. We might think such a task impossible had we never looked for a word in a dictionary. For example, if we are looking for the word sandbox in a dictionary, and we open the dictionary to a page marked rakerapier, we know that the word we seek is further on in the dictionary, and we can disregard the entire part of the dictionary that precedes the page in front of us.
The key to making a binary search possible is that the sequence in which it searches is already ordered. The fundamental idea is to start with a pair of locations that represent the (sorted) range in which the value sought must appear if it appears at all. In C++, we would use iterators to represent such a range and follow the usual convention that one iterator represents the first element of the range and the second represents one element past the end of the range. We might call these iterators low and high, and the value sought x, with the notion that if x is in the sequence at all, it must be in the range [low, high).
Binary-search algorithms can take many forms. One classic form repeats the following steps:
- If the range is empty that is, if low == high then we didnt find x, because we know that if x is anywhere in the sequence, it is in [low, high), and we have just determined that [low, high) is empty.
- Otherwise, we compute the location of an element, which we shall call mid, that is approximately halfway between low and high. If low and high are close together, mid might be equal to low, but mid must not be equal to high because high does not necessarily denote an element. The following picture may be helpful:
- At this point, there are three possibilies:
- The element at mid is equal to x, in which case weve found the value that we seek.
- The element at mid is greater than x, in which case, the value we seek must be in the range [low, mid) (not [low, mid], because the element at mid is greater than x, so we can exclude that element from the range.
- The element at mid is less than x, in which case, the value we seek must be in the range [mid+1, high).
In either of the two latter cases, we adjust the values of low and high appropriately and continue the iteration until we find the value or reduce the range to nothing. Of course, if we reduce the range to nothing, then we couldnt find the element, and we can conclude that the element was not in the original sequence.
This algorithm imposes several requirements on the sequence. We have already noted the most important requirement, which is that the sequence be ordered. Another is that it be possible to compute mid efficiently, which is easy if our iterators are random-access iterators [1].
Translating these requirements into a C++ implementation, we can see that normally binary searches will be most useful with sequences that dont change very often. For example, it is fairly common to keep a collection of sorted values in a vector. In that case, searching for a value is quick, but adding a new value is slow because it involves copying the part of the vector that follows the insertion (and might involve copying the whole vector if reallocation is necessary).
What about the notions of greater, less, and equal? Standard library algorithms generally try to minimize the requirements that they impose on their users, and these three relations are closely intertwined. In fact, the standard library defines binary searches entirely in terms of a single relation, which we can think of as being less than. It infers the nature of the other two relations from this one. We can name this relation anything we like; the default is <.
There are several reasons for this design that are worth understanding. First, by requiring only one rather than three relations, the library makes it easier to define types that can be the subject of binary searches. Second, by relying on only one relation, the library makes it easy to provide a named function for the relation. If algorithms needed both less and equal, a user who wanted to provide named functions for these operators would have to provide two function parameters instead of one. In addition to being a nuisance, this requirement would lead to problems when users provided the functions in the wrong order, or when the functions were mutually inconsistent. Moreover, as well see in the next section, it turns out that we can eliminate one of the two tests on each iteration when we rely solely on the less relation.
The final reason is perhaps the most subtle. By relying solely on an appropriate < relation (which we might choose to give a different name) the library allows for a more flexible definition of equality than might be possible if we were to use the usual == relation. The library assumes that < is defined on the elements of the sequence, and that for any two elements x and y, either x < y, y < x, or x and y are effectively equal. We say effectively here because they do not have to be truly equal. For example, < might represent case-insensitive comparisons between strings. There is no problem with the fact that reading and Reading are two different strings, as long as you dont care which one of them you get when you search for either of them.
Accordingly, the standard library binary-search algorithms never use the notion of equality explicitly, and we shall have to use a slightly different algorithm than the classical one we have already described. Before we look at the algorithm, however, we should think about what information we would like to be able to obtain from a binary search.
Fundamental Operations
Suppose you are searching a sequence for a particular value. What will you do next if you find it? What if you dont find it? We can think of five possibilities:
- If you find the value, that fact might be enough by itself. In other words, you might care only that you found it; its location in the sequence might be irrelevant. For example, the sequence might represent all the people who are authorized to use your system, a system in which you care only whether a particular person is authorized and dont need any other information about the people.
- Similarly, if you dont find the value, that fact might also be enough by itself.
- If you find the value, knowing its location might be necessary for further processing. Like the person who posted the Usenix question that opened this article, you might want to access the element once it was found. Continuing our previous example, the representations of your authorized people might include additional information about them information that doesnt participate in the < comparison and you might want to use that information once youve found it.
- If the value sought isnt in the sequence, you might want to know where to insert it so that it will be there the next time.
- There is one other possibility, which the person who asked the original Usenet question did not mention: there might be more than one element with the given value. For example, dictionaries have multiple entries for the same word. For that matter, you might want to insert the value into a sequence even though it already has an element with that value. In either of these cases, you will want to know the range of elements that contain the given value.
What kind of functions might a library define to cater to these various possibilities? What would be their arguments and return values? The arguments are simple enough: two iterators that define the sequence to search, a value to seek, and an optional comparison operation. But what about the return value?
In the first two cases, its pretty obvious: all we need to return is whether or not the value sought was found. For this usage, the library provides a function named binary_search, which returns a bool that indicates whether an element with the given value was found.
The other cases are more complicated, because on the surface, it appears to be necessary to return more information to the caller than a single iterator can contain.
Before looking at how the library solves this problem, lets take a deeper look at the remaining cases. Suppose, for example, that we intend to insert the value into the sequence if it is not already there, and we want to use an element with the given value if we find it. This example subsumes cases (3) and (4), as well as the case posed by the Usenet post that motivated this article. Evidently, the search function, whatever it is, needs to return an iterator that refers to the given element if one exists, and one that refers to the appropriate insertion point otherwise. Moreover, for a search function to be useful in this context, it has to return an additional piece of information: a value that indicates whether the iterator is referring to an element equal to x or to a prospective insertion point for such an element.
Suppose, for example, that iter is the type of our iterators, and c is the container that contains the sequence in which we are searching. We might imagine writing something like this:
pair<iter, bool> search_result = <I>search-function-1</I>(c.begin(), c.end(), x); if (search_result.second) { // search_result.first contains an appropriate iterator } else c.insert(search_result.first, x);
Our hypothetical search function therefore returns two values: an iterator and a bool that says whether or not that iterator refers to an element with the sought value.
This interface is slightly messy: to use it, you must create a local variable to hold the two results and then inspect both of them. Moreover, in principle the bool is redundant in the sense that the program can figure out what its value must be without looking at it directly. For example, we might have a search function that returns only the iterator and doesnt even give us a clue as to whether that iterator refers to an element or only to a place where we could insert the element. In that case, we might write code that looks like this:
iter search_result = <I>search-function-2</I>(c.begin(), c.end(), x); if (search_result != c.end() && *search_result == x) { // search_result contains an // appropriate iterator } else c.insert(search_result, x);
This second usage example is only slightly more complicated than the first. Moreover, if we want to insert x into the sequence in all cases, our code becomes still simpler:
c.insert(<I>search-function-2</I>(c.begin(), c.end(), x), x);
Apparently the functions to handle cases (3) and (4) need to return only a single iterator, which the user may test to distinguish whether the iterator return denotes the element sought or the insertion point for that element.
There is a subtle additional reason why search-function-2 is more fundamental than search-function-1: it is possible for the search function to locate the appropriate point to insert a given value without even knowing whether an element with that value exists! As well see, the standard library functions that perform binary searches all use this modified algorithm.
The algorithm begins as before, with the low and high iterators but this time, we think of the insertion point we seek as being in the range [low, high] rather than thinking of an element being in the range [low, high). In other words, we are now thinking about iterator values that include the off-the-end iterator, not just about iterators that refer to elements, low == high; the range has a single iterator value, and that value is the insertion point that we seek. This insertion point might be off the end if x is greater than any element that is already in the sequence.
We are looking for the right place to insert x a point with the property that all elements before that point are strictly less than x, and all elements after that point are greater than or equal to x. The algorithm to do so works as follows:
- If low == high, were done; the result is either low or high. Note that the == here is a comparison between iterators, not between container values, so it uses an operation that is always available to us because of the definition of iterators.
- Otherwise, compute mid as before to be about halfway between low and high, being permitted to be equal to low but not high. Because mid is not equal to high, we know that mid refers to an element.
- There are now only two cases:
- The element at mid is less than x, in which case the insertion point must be in the range [mid+1, high].
- Otherwise, the insertion point must be in the range [low, mid].
Perhaps surprisingly, doing without == for the value type has made our algorithm simpler. In particular, it is no longer necessary to do two comparisons each time through the loop. Because of this greater simplicity, and because it is no longer ever necessary to do equality comparisons, the standard library offers search-function-2, not search-function-1. It does so under the name lower_bound, which returns an iterator that indicates the first point at which it is possible to insert the given value in the sequence without disturbing the order.
As you might suspect from the name, there is also an upper_bound function, which returns the last possible insertion point. There is also a function named equal_range, which returns both insertion points as a pair of iterators. These last two operations cater to our fifth case above, handling properly the possibility that there might be more than one instance of the element were looking for in the sequence.
Finally, as weve already noted, if you want to know whether a given value appears at all, and you dont care where it appears, there is a function named binary_search that returns an appropriate bool. This function can use the same modified algorithm we just described to find where the element would be if it existed. It then tests the element at that point (if there is one) to determine whether the element is the one we seek or not. If so, it returns true; otherwise, it returns false.
Examples
In the following examples, we assume that we are looking for a value x in a container c that has iterators of type iter.
- Starting with our posters query, we can find the location of the element x if it exists in c by testing the return value from lower_bound:
iter search_result = lower_bound(c.begin(), c.end(), x); if (search_result != c.end() && *search_result == x) // process *search_result as necessary
This program fragment has one problem: it relies on !=, when it ought to be relying only on < for consistency with the standard library itself. We can remedy that problem by noting that if search_result refers to an element, that element cannot possibly be less than x. If it were, inserting x at that point would disrupt the ordering of the sequence, but the definition of lower_bound is that it returns a point at which inserting x will not cause disruption. Therefore, the comparison *search_result == x must have the same result as the comparison !(*search_result > x). Because we wish to rely only on <, we will rewrite this sub-expression as !(x < *search_result), leading to the following program fragment:
iter search_result = lower_bound(c.begin(), c.end(), x); if (search_result != c.end() && !(x < *search_result)) // process *search_result as // necessary
Does x appear in c? if (binary_search(c.begin(), c.end(), x)) { // x is in c } else { // x is not in c }
Insert a copy of x in c. If elements with value x are already there, the newcomer will appear before all of them. c.insert(lower_bound(c.begin(), c.end(), x), x);
Insert a copy of x in c. If elements with value x are already there, the newcomer will appear after all of them. c.insert(upper_bound(c.begin(), c.end(), x), x);
Insert a copy of x into c if and only if a copy is not already there. iter search_result = lower_bound(c.begin(), c.end(), x); if (search_result == c.end() || *search_result != x) c.insert(search_result, x);
As before, we can make this code fragment more consistent by avoiding the use of !=:
iter search_result = lower_bound(c.begin(), c.end(), x); if (search_result == c.end() || x < *search_result) c.insert(search_result, x);
Process all of the elements of c that are equal to x: pair<iter, iter> range = equal_range(c.begin(), c.end(), x); for (iter i = range.first; i != range.second; ++i) { // process the element at *i }
Conclusion
Part of the difficulty in defining and using binary-search functions is that the search yields more information than will fit easily in a single value, because when the search fails, it is useful to know the point at which the value sought would have appeared. In other words, if we are searching for a value in an n-element sequence, there are 2n+1 distinct plausible results: any element in the sequence, the point just before any element, and the point just after the last element. There are only n+1 distinct iterator values for such a sequence, so it is necessary to return additional information somehow.
In our hypothetical search-function-1, we chose to return an iterator and a bool. This choice would have allowed us to return 2n+2 possible values, but at the cost of being ungainly to use.
The standard library does something cleverer: by returning only the point at which an element might be inserted, it reduces the number of possible return values to n+1 exactly the number of values that an iterator can describe. It appears at first that this strategy would lose information, but when we set out to define an appropriate algorithm, we found that because the library uses only < and not ==, that information was never there in the first place.
Accordingly, the library offers two kinds of binary-search function: a family of functions that locates an appropriate insertion point for a value, and a function that determines whether that value is present in the sequence without saying where. In addition, it caters in three different ways to the possibility that a value might appear more than once: the upper_bound function yields the last possible insertion point to go with the lower_bound functions first possible insertion point, and the equal_range function locates all of the elements with a given value at once.
Note
[1] It turns out that it is possible to do a binary search reasonably efficiently given only forward iterators, provided that the cost of incrementing an iterator is enough less than the cost of comparing elements. However, it requires different algorithms, and doesnt affect the main point of the article, so were mentioning it only in a footnote.
Andrew Koenig is a member of the Large-Scale Programming Research Department at AT&Ts Shannon Laboratory, and the Project Editor of the C++ standards committee. A programmer for more than 30 years, 15 of them in C++, he has published more than 150 articles about C++ and speaks on the topic worldwide. He is the author of C Traps and Pitfalls and co-author of Ruminations on C++.
Barbara E. Moo is an independent consultant with 20 years experience in the software field. During her nearly 15 years at AT&T, she worked on one of the first commercial projects ever written in C++, managed the companys first C++ compiler project, and directed the development of AT&Ts award-winning WorldNet Internet service business. She is co-author of Ruminations on C++ and lectures worldwide.