Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C++ Made Easier : Using Library Algorithms


January 2001/C++ Made Easier


Excerpted from Accelerated C++: Practical Programming by Example by Andrew Koenig and Barbara E. Moo, © 2000 by AT&T, Inc., and Barbara E. Moo. Reprinted by permission of Addison Wesley Longman, Inc. All rights reserved.

Introduction

This article is a slightly modified excerpt from our book, Accelerated C++ [1], which became available in September 2000. Our aim in this book is to teach as quickly as possible how to write useful C++ programs. Therefore, we start by explaining the most useful parts of C++. This strategy may seem obvious, but it has the radical implication that we do not begin by teaching a C-like subset of C++. Instead, we use high-level data structures from the start, explaining only later the foundations on which those data structures rest. This approach makes it possible to begin writing idiomatic C++ programs immediately.

Our approach is unusual in another way: We concentrate on solving problems, rather than on exploring language and library features. We explain the features, of course, but we do so in order to support the programs, rather than using the programs as an excuse to demonstrate the features.

Because this book teaches C++ programming, not just features, it is useful not only for beginners, but also for for readers who are already familiar with C++, and who want to use the language in a more natural, effective style. Too often, people new to C++ learn the language mechanics without learning how to apply the language to everyday problems. Our book offers the opportunity to move beyond the mechanics.

We used to teach a week-long intensive C++ course every summer at Stanford University. We originally adopted a traditional approach to that course: Assuming that the students already knew C, we started by showing them how to define classes, and then moved systematically through the rest of the language. We found that our students would be confused and frustrated for about two days — until they had learned enough that they could start writing useful programs. Once they got to that point, they learned quickly.

When we got our hands on a C++ implementation that supported enough of what was then the brand-new standard library, we overhauled the course. The new course used the library right from the start, concentrating on writing useful programs and going into details only after the students had learned enough to use those details productively.

The results were dramatic: After one day in the classroom, our students were able to write programs that had taken them most of the week in the old course. Moreover, their frustration vanished.

The following excerpt comes from a point about a third of the way through the book. In it, we talk about how to use standard-library algorithms to solve everyday programming problems. At this point in the book, we have already explained, and used, standard-library containers and iterators. The part of the book that you are about to see introduces several new algorithms and programming techniques — but its main goal is to make the parts of the library that it describes so comfortable to use that readers will reach for them automatically when they want to solve appropriate problems.

Analyzing Strings

Earlier in the book, we introduced iterators, and gave an example that used a loop to concatenate copies of the elements of a vector<string>, which we called bottom, onto the end of another vector<string>, which called ret:

for (vector<string>::const_iterator
   it = bottom.begin();
   it != bottom.end(); ++it)

   ret.push_back(*it);

We noted that this loop was equivalent to inserting a copy of the elements of bottom at the end of ret, an operation that vectors provided directly:

ret.insert(ret.end(), bottom.begin(), bottom.end());

This problem has even a more general solution: We can separate the notion of copying elements from that of inserting elements at the end of a container, as follows:

copy(bottom.begin(), bottom.end(), back_inserter(ret));

Here, copy is an example of a generic algorithm, and back_inserter is an example of an iterator adaptor.

A generic algorithm is an algorithm that is not part of any particular kind of container, but instead takes a cue from its arguments’ types regarding how to access the data it uses. The standard library’s generic algorithms usually take iterators as their arguments, which they use to manipulate the elements of the underlying containers. So, for example, the copy algorithm takes three iterators, which we’ll call begin, end, and out, and copies all the elements in the range [begin, end) (i.e., the elements between begin and end, including begin but excluding end) to a sequence of elements starting at out and extending as far as necessary. In other words,

copy(begin, end, out);

has the same effect as

while (begin != end) {
   *out++ = *begin++;
}

except that the while statement changes the values of the iterators, and copy doesn’t.

Before we describe iterator adaptors, we should note that this loop depends on the use of the postfix version of the increment operators. These operators differ from the prefix versions in that begin++ returns a copy of the original value of begin, incrementing the stored value of begin as a side effect. Thus,

it = begin++;

is equivalent to

it = begin;
++begin;

The increment operators and * have the same precedence, and they are both right-associative, which means that *out++ has the same meaning as *(out++). Thus,

*out++ = *begin++;

is equivalent to the more verbose

out = begin;
++out;
++begin;

Let’s return to iterator adaptors, which are functions that yield iterators with properties that are related to their arguments in useful ways. The iterator adaptors, such as back_inserter, are defined in <iterator>. The most common iterator adaptor is back_inserter, which takes a container as its argument and yields a kind of iterator that, when used as a destination, appends values to the container. For example, back_inserter(ret) is an iterator that, when used as a destination, appends elements to ret. Therefore,

copy(bottom.begin(), bottom.end(), back_inserter(ret));

copies all of the elements of bottom and appends them to the end of ret. After this function completes, the size of ret will be its old size plus bottom.size().

Notice that we could not call

// error--ret is not an iterator
copy(bottom.begin(), bottom.end(), ret);

because copy’s third argument is an iterator, not a container. Nor could we call

// error--no element at ret.end()
copy(bottom.begin(), bottom.end(), ret.end());

This latter mistake is particularly insidious, because the program will compile. What it does when you try to run it is another story entirely. The first thing copy will try to do is assign a value to the element at ret.end(). There’s no element there, so what the implementation will do is anybody’s guess.

Why is copy designed this way? Because separating the notions of copying elements and expanding a container allows programmers to choose which operations to use. For example, we might want to copy elements on top of elements that already exist in a container, without changing the container’s size. As another example, we might want to use back_inserter to append elements to a container that are not merely copies of another container’s elements.

Another Way to Split

Another function that we can write more directly using the standard algorithms is the split function, which we wrote earlier. This function takes a string as its argument and returns a vector<string> with elements that represent the words that constitute the string. The hard part of writing the earlier version of that function was dealing with the indices that delimited each word in the input line. We can replace the indices by iterators and use standard-library algorithms to do much of the work for us:

bool not_space(char c)
{ return !isspace(c); }
vector<string>
split(const string& str)
{
   typedef string::const_iterator
      iter;
   vector<string> ret;
   iter i = str.begin();
   while (i != str.end()) {
      // ignore leading blanks
      i = find_if(i, str.end(),
             not_space);

      // find end of next word
      iter j = find_if(i, str.end(),
                  isspace);
      // copy the characters
      // in [i, j)
      if (i != str.end())
        ret.push_back(string(i, j));
      i = j;
   }
   return ret;
}

This code uses a lot of new functions, so it will take a bit of explanation. The key thing to keep in mind is that it implements the same algorithm as the original, using i and j to delimit each word in str. Once we’ve found a word, we copy it from str and push the copy onto the back of ret.

This time, i and j are iterators, not indices. We use typedef to abbreviate the iterator type so that we can use iter instead of the longer string::const_iterator. Although the string type does not support all of the container operations, it does support iterators, so we can use the standard-library algorithms on the characters of a string.

The algorithm that we use in this example is find_if. Its first two arguments are iterators that denote a sequence; the third is a predicate, which tests its argument and returns true or false. The find_if function calls the predicate on each element in the sequence, stopping when it finds an element for which the predicate yields true.

The first call to find_if seeks the first nonblank character, which will denote the beginning of the word. Remember that one or more spaces might begin a line or might separate adjacent words in the input. We do not want to include these spaces in the output. Because the library provides an isspace function, but not its logical opposite, we wrote our own simple predicate to return the negation of isspace. Thus not_space(c) will be true if and only if c is a nonblank character.

After the call to find_if, i will denote the first nonblank character, if any, in str. We use i in the next call to find_if, which looks for the first space in [i, str.end()). If find_if fails to find a value for which the predicate is true, it returns its second argument. Therefore, j will be initialized to denote the blank that separates the next word in str from the rest of the line or, if we are on the last word in the line, j will be equal to str.end().

At this point, i and j delimit a word in str. All that’s left is to use these iterators to copy the data from str into ret. In the earlier version of split, we used string::substr to create the copy. However, that version of split operated on indices, not iterators, and there isn’t a version of substr that operates on iterators. Instead, we construct a new string directly from the iterators that we have. The expression string(i, j) will create a string that is a copy of the characters in the range [i, j). We push this new string onto the back of ret.

It is worth pointing out that one thing that is missing from this version of the program is the various tests of the index i against str.size(). Nor are there the obvious equivalent tests of the iterator against str.end(). The reason is that the library algorithms are written to handle gracefully calls that pass an empty range. For example, at some point the first call to find_if will set i to the value returned by str.end(), but there is no need to check i before passing it to the second call to find_if. The reason is that find_if will look in the empty range [i, str.end()) and will return str.end() to indicate that there is no match.

Palindromes

Another problem in character manipulation that we can use library facilities to solve succinctly is deciding whether a word is a palindrome. Palindromes are words that are spelled the same way front to back as back to front. For example, "Eve" and "radar" are both palindromes. Notice that our definition deliberately ignores case distinctions; we want to consider e and E as equivalent.

Here is a compact solution that uses the library:

bool is_palindrome(string s)
{
   transform(s.begin(), s.end(),
      s.begin(), tolower);
   return equal(s.begin(), s.end(),
      s.rbegin());
}

First, we convert the string to its lowercase equivalent by calling the transform library function. This function takes three iterators and a function. The first two iterators specify a range of elements to transform; the function defines the transformation. The third iterator is the destination into which to put the result of running the function. When we call transform, we are responsible for ensuring that the destination has room for the values from the input sequence. In this case, we are storing back into the same string that supplies the input, so we know that there is enough room. We made s a string, rather than a const string&, because the call to transform modifies s, and we don’t want is_palindrome to change its argument.

The effect of calling transform is to run tolower on each character in s and put the result back into s at the same position. The tolower function comes from <cctype> and does what its name suggests: It returns the lowercase version of its argument. If its argument is not an uppercase letter, it returns a copy of the argument.

The return statement calls the library function equal and uses rbegin, a container function that we have not yet seen. Like begin, the rbegin function returns an iterator, but this time it is an iterator that starts with the last element in the container and moves backward through the container.

The equal function compares two sequences to determine whether the values in them are equal. As usual, the first sequence is specified by the first two iterators passed to equal. The third argument is the starting point for the second sequence. The equal function assumes that the second sequence is the same size as the first, so it does not need an ending iterator. Because we pass rbegin as the starting point for the second sequence, the effect of this call is to compare values from the back of s to values in the front. The equal function will compare the first character in s with the last. Then it will compare the second to the next to last, and so on. This behavior is precisely what we want.

Finding URLs

As the last of our examples of character manipulation, let’s write a function that finds web addresses, called URLs (uniform resource locators), that are embedded in a string. We might use such a function by creating a single string that holds the entire contents of a document. The function would then scan the document and find all the URLs in it.

A URL is a sequence of characters of the form protocol- name://resource-name where protocol-name contains only letters, and resource-name may consist of letters, digits, and certain punctuation characters. Our function will take a string argument and will look for instances of :// in that string. Each time we find such an instance, we’ll look for the protocol-name that precedes it and the resource-name that follows.

Because we want our function to find all the URLs in its input, we’ll want it to return a vector<string>, with one element for each URL. The function executes by moving the iterator b through s, looking for the characters :// that might be a part of a URL. If we find these characters, it looks backward to find the protocol-name and it looks forward to find the resource-name:

vector<string> find_urls(const string& s)
{
   vector<string> ret;
   typedef string::const_iterator iter;
   iter b = s.begin(), e = s.end();
   // look through the entire input
   while (b != e) {
      // look for one or more letters followed by ://
      b = url_beg(b, e);
      // if we found it
      if (b != e) {
         // get the rest of the URL
         iter after = url_end(b, e);
         // remember the URL
         ret.push_back(string(b, after));
         // advance b and check for more URLs on this line
         b = after;
      }
   }
   return ret;
}

We start by declaring ret, which is the vector into which we will put the URLs as we find them, and by obtaining iterators that delimit the string. We will have to write the url_beg and url_end functions, which will find the beginning and end of any URL in the input. The url_beg function will be responsible for identifying whether a valid URL is present and if so, for returning an iterator that refers to the first character of the protocol-name. If it does not identify a URL in the input, then it will return its second argument to indicate failure.

If url_beg finds a URL, the next task is to find the end of the URL by calling url_end. That function will search from the given position until it reaches either the end of the input or a character that cannot be part of a URL. It will return an iterator positioned one past the last character in the URL.

Thus, after the calls to url_beg and url_end, the iterator b denotes the beginning of a URL, and the iterator after denotes the position one past the last character in the URL:

We construct a new string from the characters in this range and push that string onto the back of ret.

All that remains is to increment the value of b and look for the next URL. Because URLs cannot overlap one another, we set b to (one past) the end of the URL that we just found and continue the while loop until we’ve looked at all the input. Once that loop exits, we return the vector that contains the URLs to our caller.

Now we have to think about url_beg and url_end. The url_end function is simpler, so we’ll start there:

string::const_iterator
url_end(string::const_iterator b, string::const_iterator e)
{
   return find_if(b, e, not_url_char);
}

This function just forwards its work to the library find_if function. The predicate that we pass to find_if is one that we will write, called not_url_char. It will return true when passed a character that cannot be in a URL:

bool not_url_char(char c)
{
   // characters, in addition to alphanumerics, that can
   // appear in a URL
   static const string url_ch = "~;/?:@=&$-_.+!*'(),";

   // see whether c can appear in a URL and return the negative
   return !(isalnum(c) ||
      find(url_ch.begin(), url_ch.end(), c) != url_ch.end());
}

Despite being small, this function uses a fair bit of new material. First is the use of the static storage class specifier. Local variables that are declared to be static are preserved across invocations of the function. Thus, we will create and initialize the string url_ch only on the first call to not_url_char. Subsequent calls will use the object that the first call created. Because url_ch is a const string, its value will not change once we have initialized it.

The not_url_char function also uses the isalnum function, which the <cctype> header defines. This function tests whether its argument is an alphanumeric character (a letter or a digit).

Finally, find is another algorithm that we haven’t used yet. It is similar to find_if, except that instead of calling a predicate, it looks for the specific value given as its third argument. As with find_if, if the value sought is present, the function returns an iterator denoting the first occurrence of the value in the given sequence. If the value is not found, then find returns its second argument.

With this information in hand, we can now understand the not_url_char function. Because we negate the value of the entire expression before we return it, not_url_char will yield false if c is a letter, a digit, or any of the characters in url_ch. If c is any other value, the function returns true.

Now the hard part begins: implementing url_beg. This function is messy, because it must deal with the possibility that the input might contain :// in a context that cannot be a valid URL. In practice, we’d probably have a list of acceptable protocol-names and look only for those. For simplicity, though, we’ll limit ourselves to being sure that one or more letters precede the :// separator, and at least one character follows it:

string::const_iterator
url_beg(string::const_iterator b, string::const_iterator e)
{
   static const string sep = "://";
   typedef string::const_iterator iter;

   // i marks where the separator was found
   iter i = b;
   while (
      (i = search(i, e, sep.begin(), sep.end())) != e) {
      // make sure the separator isn't at the beginning
      // or end of the line
      if (i != b &&
             i + sep.size() != e) {
         // beg marks the beginning of the protocol-name
         iter beg = i;
         while (beg != b &&
            isalpha(beg[-1]))
            --beg;

         // is there at least one appropriate character
         // before and after the separator?
         if (beg != i &&
             && !not_url_char(i[sep.size()]))
            return beg;
      }
      // the separator we found wasn't part of a URL;
      // advance : past this separator
      i += sep.size();
   }
   return e;
}

The easy part is to write the function header. We know that we’ll be passed two iterators denoting the range in which to look, and that we’ll return an iterator that denotes the beginning of the first URL in that range, if one exists. We also declare and initialize a local string, which will hold the characters that make up the separator that identifies a potential URL. Like url_ch in the not_url_char function, this string is static and const. Thus, we will not be able to change the string, and its value will be created only on the first invocation of url_beg.

The function executes by placing two iterators into the string delimited by b and e:

The iterator i will denote the beginning of the URL separator, if any, and beg will indicate the beginning of the protocol-name, if any.

The function first looks for the separator by calling search, a library function that we haven’t used before. This function takes two pairs of iterators: The first pair denotes the sequence in which we are looking, and the second pair denotes the sequence that we wish to locate. As with other library functions, if search fails, it returns the second iterator. Therefore, after the call to search, either i denotes (one past) the end of the input string, or it denotes a : that is followed by //.

If we found a separator, the next task is to get the letters (if any) that make up the protocol-name. We first check whether the separator is at the beginning or end of the input. If the separator is in either of those places, we know that we don’t have a URL, because a URL has at least one character on either side of its separator. Otherwise, we need to try to position the iterator beg. The inner while loop moves the iterator beg back through the input until it hits either a non-alphabetic character or the beginning of the string. It uses yet another new idea: the notion that if a container supports indexing, so do its iterators. In other words, beg[-1] is the character at the position immediately before the one that beg denotes. We can think of beg[-1] as an abbreviation for *(beg - 1).

If we were able to advance the iterator over as much as a single character, we assume we’ve found a protocol-name. Before returning beg, we still have to check that there’s at least one valid character following the separator. This test is more complicated. We know that there is at least one more character in the input, because we’re inside the body of an if that compares the value of i + sep.size() with e. We can access the first such character as i[sep.size()], which is an abbreviation for *(i + sep.size()). We test whether that character can appear in a URL by passing it to not_url_char. This function returns true if the character is not valid, so we negate the return to check that it is valid.

If the separator is not part of a URL, then the function advances i past the separator and keeps looking.

This code uses the decrement operator. It works like the increment operator, but it decrements its operand instead. As with the increment operator, it comes in prefix and postfix versions. The prefix version, which we use here, decrements its operand and returns the new value.

Reference

[1] Andrew Koenig and Barbara E. Moo. Accelerated C++: Practical Programming by Example (Addison-Wesley, 2000), ISBN 0-201-70353-X.

Andrew Koenig is a member of the Large-Scale Programming Research Department at AT&T’s Shannon Laboratory, and the Project Editor of the C++ standards committee. A programmer for more than 30 years, 15 of them in C++, he has published more than 150 articles about C++ and speaks on the topic worldwide. He is the author of C Traps and Pitfalls and co-author of Ruminations on C++.

Barbara E. Moo is an independent consultant with 20 years’ experience in the software field. During her nearly 15 years at AT&T, she worked on one of the first commercial projects ever written in C++, managed the company’s first C++ compiler project, and directed the development of AT&T’s award-winning WorldNet Internet service business. She is co-author of Ruminations on C++ and lectures worldwide.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.