As I told a friend "You know, I plan on writing an article on efficient sorting and searching in C++," he replied with badly dissimulated disappointment, "What? Weren't these issues put to rest a long time ago?"
And boy does he have a point. After all the seminal work of Hoare [1], after Donald Knuth's fantastic work in The Art of Computer Programming [2], after Stepanov and Musser's contribution (specifically applied to C++) [3], after Plauger's refinements [4], and after all the work done by so many other people on the subject of sorting and searching, it would appear that there's really nothing new to say on sorting and searching in general, or on generic incarnations thereof in C++.
The interest in the subject is not incidental. It is estimated that programs spend 80% of their time on searching (save for I/O). This means that applying improved methods for searching data and sorting it to make it easier to search is always welcome.
After five years of experience with implementing the standard library, you would think that every possible bit of efficiency has been squeezed out by today's world-class implementations of std::sort, std::find, std::lower_bound, std::binary_search, and such.
Not quite. This article offers some tips, accompanied by measurements, on improving generic sorting and searching in C++. Wishful thinking leads me to also consider it an open suggestion for the few standard library implementers out there. The numbers are quite interesting: under certain circumstances, you can get as much as double mileage over current implementations of the Standard. That, however, doesn't happen always and with all kinds of data.
So if you are itching for faster sorting and searching, you are in the right place. If you enjoy generic programming, you are also in the right place. If you enjoy expanding your horizons with new ideas that don't have anything to do with sorting and searching, again you are in the right place. That covers pretty much everyone, so by now you should feel really compelled to read, and guilty not to.
Guess I could sell used cars.
How to Send a Christmas Card to Yourself
Let's start with the easiest routine of all: linear search.
In a popular British TV comic routine, a comedian sends himself a Christmas card. In the case that no one sends him a card, he at least has received one.
If you use an instant messaging program (i.e., Yahoo Messenger, MSN Messenger), you can add your own ID as a friend. Whenever you log on, you know there's at least one friend there online for you. (Yeah, I actually tried that.) Never feel lonely again!
How about applying the same principle to linear searching? You plant the element that you are looking for directly in the sequence to be searched. If there was none before, there is one now.
Why would one choose to do this? Let's review the classic generic linear search algorithm, the one that you can write with your toes while sleeping:
template <class Iter, class ValueType> Iter find(Iter from, const Iter to, const ValueType& val) { for (; from != to && !(*from == value); ++from) { } return from; }
The above code is identical to most of today's STL implementations of std::find. We will use it as a reference for evaluating alternative implementations. That code is as simple as it gets, partly because of the clever STL iteration style that we all know and love: ranges are closed to the left and open to the right [begin, end) so there are no contortions needed whatsoever. You iterate while the condition is met, and then you can tell how far you went.
Given that things are so clear and simple, you ask how could it possibly be any better? Well, the answer is send yourself a Christmas card, so you'll at least receive one! Look at the condition. On each iteration, you test for two things: "end of range" and "found." The "end of range" test means, eek, what if the whole range was spanned without finding anything? But if you know there is at least one, you don't have to test for end of range because you are confident you'll find at least one match.
If comparing iterators is not much cheaper than comparing values, then your savings can be significant. For example, on most machines, comparing two pointers costs exactly as much as comparing two integers. This means if you're searching for integers, you could slash the search time by two.
So, an idea would be to plant the value you are looking for instead of the last element of the sequence. Save that last element of the sequence in a temporary. Search with only one comparison per iteration, because you are sure to never exhaust the whole range without finding at least one occurrence. When the search ends, you can sort things out.
We'll put all the sample code (also available for download) in namespace Achilles (named after the fast-running star of ancient Greece. In spite of his speed, ironically, he'd be unable to beat a turtle given only a 100-step handicap [5]). For completeness, there's even a function Heel in the Achilles namespace, so if you call Achilles::Heel, you know what to expect. Anyhow, Achilles is a bit of a misnomer. The code for this article comes straight from YASLI (Yet Another Standard Library Implementation), the legendary portable STL implementation that puts absolutely all other implementations on welfare by being overwhelmingly faster.
// Inside namespace Achilles, actually yasli template <class Iter, class ValueType> Iter mutative_find(Iter b, Iter e, const ValueType& v) { using std::swap; if (b == e) return e; std::iterator_traits<Iter>::value_type tmp(v); --e; swap(*e, tmp); for (; !(*b == v); ++b) { } swap(*e, tmp); if (b == e && !(v == *b)) ++b; return b; }
The name of the new routine is mutative_find, because it mutates the sequence being searched. Besides, "mutative" lends a Hollywood-scary-movie atmosphere to an, oh, so geeky article.
So here's how it all works: first off, empty sequences are handled as a degenerate case. Then, a temporary is created holding the value you're looking for and then swapped with the last element in the sequence. Effectively, the temporary holds the old last value, and the last value of the sequence holds the element you're looking for. Now you can search with only one test per iteration. When the search ends (necessarily), you have to swap back the last element in place (so you don't leave the sequence mutated). Finally, one more little test distinguishes between a false positive and a genuine match for the last element. That's it!
The routine above is predictably faster than the run-of-the-mill find implementation, but it also imposes quite a few more requirements on the types on which it operates. Let's enumerate them:
- The iterators must be bi-directional. Decrementing e is instrumental to the algorithm's functioning. So, input and forward iterators are out of the question.
- The sequence search must be mutable. Oops. That is, if you search through a range of constant elements (like, if you deal with const_iterators), Achilles::mutative_find won't compile.
- The type iterated by Iter must accept a non-throwing swap. If swap could throw, a very unpleasant thing might happen: before you search, you mutate the last element of the sequence, and after the search, as you try to put the toothpaste back in the tube, the last call to swap throws an exception. So, you end up with a failing function, which is pretty bad, and with unpredictably changed data (what caller would think searching through a range would change that range?), which is, put in mild terms, a catastrophe of epic proportions.
- Compared to the reference find implementation, there's one more copy in sight. Achilles' routine makes a copy of the searched object so it can swap it with the last element of the sequence. That shouldn't be a problem in and of itself, but if the copy operation might throw an exception, Achilles::mutative_find is not Standard-compliant anymore. That's not as bad as the apocalyptic scenario above, because if Achilles::mutative_find fails, at least it didn't change your data. Still, it's easy to imagine situations in which you have data that you'd enjoy searching through, but that's extremely awkward to copy unwittingly. Think, for example, of searching for a File in a vector of Files, or a DatabaseConnection in a... you get my drift.
In conclusion, Achilles::mutative_find is fast, but not universal. On the other hand, you don't want to put everyone at a disadvantage for the sake of some. To exploit speed when possible, while still keeping generality, the algorithm we're looking at is:
if (<i>Iter is bidirectional_or_random</i> && <i>range is mutable</i> && <i>Iter's value type supports non-throwing swap</i> && <i>Iter's value type's copy constructor doesn't throw</i>) { ... cool, do it the fast way ... } else { ... alright, do it the conservative way ... }
Now, honestly, all that business with the non-throwing swap and the copy constructor that won't throw is just too vague to be convincing. Ladies and gentlemen, I'd like to make a motion to reduce this notion to the coarser, yet clearer, Iter's value type must be a POD (Plain Old Data) type. PODs don't throw at all, so creating temporaries is okay, plus the default std::swap will work for them just fine.
Seems like there are quite a few situations where Achilles isn't quite that fast, so we should be looking at ways to improve the "conservative" way.
Duff Hits Again
A past installment of "Generic<Programming>" [6] uses Duff's device to improve the speed of filling and copying memory, with excellent results. If you're not familiar with Duff's device, you may want to skim that article for a description. Now is the time to put the same device to work for linear searching.
// Inside namespace Achilles template <class Iter, class ValueType< Iter duff_find(Iter b, const Iter e, const ValueType& v) { switch ((e - b) & 7u) { case 0: while (b != e) { if (*b == v) return b; ++b; case 7: if (*b == v) return b; ++b; case 6: if (*b == v) return b; ++b; case 5: if (*b == v) return b; ++b; case 4: if (*b == v) return b; ++b; case 3: if (*b == v) return b; ++b; case 2: if (*b == v) return b; ++b; case 1: if (*b == v) return b; ++b; } } return b; }
Duff's device is a loop unrolling technique that can be applied to a variety of tasks, including, well, linear searching. The initial switch statement places execution flow inside the loop at the exact point that will smoothly handle the termination condition.
From an efficiency viewpoint, Achilles::duff_find performs eight times fewer iterator comparisons than the reference find implementation. This places its efficiency somewhere in between Achilles::mutative_find (which does no repeated iterator comparison) and the reference find.
Let's analyze what strain duff_find puts on Iter and its corresponding value type. Very nicely, duff_find doesn't ask for extra copies, swaps, or any other tricks. However, the expression e - b reveals that the difference between two iterators must be computed presto so Iter must be random, period.
Ultimately, this leaves us with the following outline of the best-of-the-breed find implementation:
if (<i>Iter is bidirectional_or_random</i> && <i>range is mutable</i> && <i>Iter's value type is a POD</i>) { ... cool, do it the fast way ... } else { if (<i>Iter is random</i>) { ... still cool, do it the Duff way ... } else { ... alright, do it the conservative way ... } }
That's a definite improvement. I'm sure there are more techniques that would apply to other kinds of sequences, but this sounds like a great exercise for the reader. For now, let's see how to dispatch to one of the three algorithm implementations we now have.
Choosing Algorithms at Compile Time
Only a couple of years ago, this section on type manipulation would have been the meat of the article. "Wow, you can detect that a type is const, how cool!" Nowadays, however, there are known techniques and libraries that explain all the magic, so all we have to do is put it to work.
We detect Iter's category by looking at std::iterator_traits<Iter>::iterator_category, which can be one of std::input_iterator_tag, std::forward_iterator_tag, std::bidirectional_iterator_tag, or std::random_access_iterator_tag.
We could use Loki's type traits facility to play hide-and-seek with types [7], but quite frankly, I prefer Boost's [8]. Boost's type traits are more complete and better designed. (So in correlation with the "Not Invented Here" syndrome, yours truly must be suffering from the "Invented Here" syndrome.)
With Boost's type traits, we can detect whether, for example, Iter is random by checking the compile-time constant:
boost::is_same< std::iterator_traits<Iter>::iterator_category, std::random_access_iterator_tag>::value
To detect whether the sequence pointed to by Iter is mutable, we can look at std::iterator_traits<Iter>::pointer_type. If that pointer_type is not a pointer to const, then the sequence is mutable. We detect const-ness like this (say PointerType is the pointer type):
boost::is_const< boost::remove_pointer<PointerType>::type>::value
Finally, Boost also provides a mechanism for POD detection by writing:
boost::is_POD< std::iterator_traits<Iter>::value_type>::value
Now that we have all the information in Boolean form, we're halfway there. Simple if statements wouldn't work for selecting the appropriate algorithm, because the compiler will try to compile both branches of the if statement, even though one branch will never be executed. We need to use a different idiom, which uses templates combined with overloading, to select the appropriate algorithm. Refer to [9] for details.
In essence, what we'll do is compute a compile-time integral constant. That constant is 0 for the straight algorithm, 1 for the Duff algorithm, or 2 for the mutative algorithm. Then, we'll invoke the appropriate algorithm by using the Loki::Int2Type template class.
So here's what the dispatcher looks like:
template <class Iter, class ValueType> Iter find(Iter from, Iter to, const ValueType& val) { using namespace std; typedef iterator_traits<Iter>::iterator_category Category; enum { isBidir = boost::is_same< Category, bidirectional_iterator_tag>::value }; enum { isRand = boost::is_same< Category, random_access_iterator_tag>::value }; typedef iterator_traits<Iter>::pointer PointerType; typedef boost::remove_pointer<PointerType>::type IteratedType; enum { isMutableSeq = !boost::is_const< IteratedType>::value }; typedef iterator_traits<Iter>::value_type ValueType; enum { isPod = boost::is_POD<ValueType>::value }; enum { selector = (isBidir || isRand) && isPod && isMutableSeq ? 2 : (isRand ? 1 : 0) }; Loki::Int2Type<selector> sel; return find_impl(from, to, val, sel); }
I know that there's some clutter in there. But this is the way it is; funny how only the last line actually does something; all the rest just sets up the stage.
There are three overloads of find_impl. Here's a synopsis of them:
// Inside namespace Achilles template <class Iter, class ValueType> Iter find_impl(Iter from, const Iter to, const ValueType& val, Loki::Int2Type<0>) { ... conservative find implementation ... } template <class Iter, class ValueType> Iter find_impl(Iter from, const Iter to, const ValueType& val, Loki::Int2Type<1>) { ... Duff-based find implementation ... } template <class Iter, class ValueType> Iter find_impl(Iter from, Iter to, const ValueType& val, Loki::Int2Type<2>) { ... mutative find implementation ... }
Well, so much about the implementation. On to the next step.
Measurements
Making good speed measurements is hard. Without much pretense, here are some results that I obtained with some simple test code. The code runs a repeated number of calls to Achilles::find or std::find over an array of integers and reports the average time spent per element searched.
Elements searched std (ns/element) Achilles (ns/element) Improvement (%) MSVC
MWCW MSVC MWCW MSVC MWCW 128 int 1.640 1.637 0.910 1.189 80.22 27.37 512 1.570 1.554 0.753 1.063 108.50 31.60 2,048 1.550 1.535 0.715 1.031 116.78 32.83 8,192 1.548 1.530 0.755 1.024 105.03 33.07 32,768 1.557 1.536 0.760 1.024 104.87 33.33 131,072 1.662 1.988 0.885 1.248 87.80 37.22 524,288 2.884 2.874 2.884 2.874 0.00 0.00 128 const int 1.640 1.637 1.234 1.225 32.90 25.17 512 1.570 1.554 1.099 1.084 42.86 30.24 2,048 1.550 1.535 1.067 1.060 45.27 30.94 8,192 1.548 1.530 1.153 1.157 34.26 24.38 32,768 1.557 1.536 1.158 1.170 34.46 23.83 131,072 1.662 1.988 1.290 1.472 28.84 25.96 524,288 2.884 2.874 2.880 2.880 0.14 -0.21
The test was performed with two compilers: Microsoft Visual C++ .NET 7.1 alpha ("For evaluation purpose only, not for production use") and Metrowerks CodeWarrior 7.0. The elements searched were ints (exercising the mutative algorithm) and const ints (exercising the implementation based on Duff's device).
For large amounts of data, there is no improvement at all. This effect is discussed in [6]: when operating on very large ranges of memory, the speed is limited by the memory bandwidth not by the actual operations that are going on inside the processor. Your data is much bigger than your cache, so you end up hitting that memory bus hard. Because of the slow speed of that bus compared to the processor and the cache, large-block memory operations get more of an I/O feel. So if your task is to occasionally search over huge amounts of data, the techniques presented in this article won't help much.
However, for repeated searches on small and medium amounts of data, our effort wasn't in vain; the table shows consistent improvements of Achilles over std when searching integers. The savings, however, are not as significant for all types of data. Don't forget that what we save is iterator comparisons. If the value types are much more expensive to compare than the iterators, then the savings will be insignificant. If, however, you have fast value-type comparisons and/or costly iterator comparisons, then you'll enjoy considerable benefits.
Random Thoughts
Was it worth it? The answer, as always, is it depends. If you are searching through a vector of matrices, you won't notice any improvement at all: the price of comparing matrices will eclipse the price of comparing iterators. Or, if you write code for a machine with limited program memory, the increased size of Achilles::find might play against you.
Wasn't it Donald Knuth who said radix omnia malorum prematurae optimisatia est? Well, I couldn't agree more. But when it comes to libraries, especially the standard library, I have a slightly different view.
Often, a library writer doesn't know exactly how that library will be used. The library might be used for mild tasks or in a speed-hungry context. While a fast library can be used for mellow stuff, a slow library can't be used in inner loops. My opinion in the matter is: bottom line, a library writer's duty is to offer as good a set of tradeoffs as possible.
Such thoughts apply all the more when it comes to the standard library. The C++ Standard library is likely to be used by all C++ applications. In addition, the standard library often doesn't have to worry about portability either. It's expected to find in a world-class implementation of the standard library "surprising" little nuggets of magic that mere application programmers wouldn't have thought of. The standard library should be the ideal towards which we all enviously gravitate.
So far, only one std implementation passes that litmus test: the legendary YASLI.
But wait until the next column, which discusses generic sorting. We'll talk about a new algorithmic technique, and we'll implement together a sorting algorithm. We'll compare that with a couple of current std::sort implementations.
Acknowledgments
Many thanks are due to Florin Trofin, Dang-Thao Truong, and Attila Feher, who provided thorough reviews.
Bibliography and Notes
[1] C. A. R. Hoare. "Quicksort," Computer Journal, 1962, 5, 10-15.
[2] D.E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching (Addison-Wesley Longman, 1998).
[3] D. Musser. The Standard Template Library (Rensselaer Polytechnic Institute, 1994).
[4] P.J. Plauger. "Standard C/C++: A Better Sort," C/C++ Users Journal, October 1999.
[5] <http://web.mit.edu/18.02-esg/www/notes/zeno.pdf>
[6] A. Alexandrescu. "Generic<Programming>: Typed Buffers (II)," C/C++ Users Journal C++ Experts Forum, October 2001, <www.cuj.com/experts/1910/alexandr.htm>.
[8] <www.boost.org>. Check the file type_traits.h.
[9] A. Alexandrescu. "Generic<Programming>: Mappings between Types and Values," C/C++ Users Journal C++ Experts Forum, October 2000, <www.cuj.com/experts/1810/alexandr.htm>.
Andrei Alexandrescu is a Ph.D. student at University of Washington in Seattle, and author of the acclaimed book Modern C++ Design. He may be contacted at www.moderncppdesign.com. Andrei is also one of the featured instructors of The C++ Seminar (<http://thecppseminar.com>).