C++ Theory and Practice: Standard C++ as a High-Level Language?
Dan Saks
Is C++ a high-level language with low-level roots, or a low-level language with high-level aspirations? Dan gives us his view.
Copyright © 1999 by Dan Saks
By the time you read this, my son Ben will be completing his second season running cross-country for Springfield Shawnee High School. I enjoy watching him run. I really want him to do well at his sport and have fun at the same time.
Ben had a disappointing spring track season marred by injury. After resting for about a month after the track season, we agreed that it would be fun for me to coach him back into shape for cross-country. Really. He thought it would be fun, too. Recognizing that training techniques have changed considerably in the 20 some odd years since I competed in school, I talked to a few coaches and scoured my local library for books on running that post dated the running boom of the late 1970's.
One of the books I read had a section on the relationship between runner and coach. It draws from the words of Arthur Lydiard, a prominent track coach from New Zealand. Recognizing the individualism of distance runners, Lydiard urged runners to hold their coaches accountable:
If the coach says, "Okay, go out and run 20 times 400 meters," you say, "Coach, why am I doing this? What physiological effects is this going to have on me?" If the coach can't tell you, then go out and get another coach, because this one is going to hurt you. It's your career. " [1]
I suspect that many coaches, particularly in regimented sports such as American football, might not take too kindly to such questioning. ("Okay, smartass, drop and give me 20.") However, I found this passage gratifying. This has been my attitude for quite some time, not just toward coaching, but toward teaching. Especially when it comes to C++.
C++ has a lot of details. Despite my best efforts in the classroom to keep my presentation of language issues grounded in program design issues, the big picture gets lost once in a while. Consequently, I tell my students that they should never hesitate to ask, "Why are you telling me this? How is this going to help me write better C++ programs?"
I have the same attitude toward my writing. And well I should, because one of our readers is holding CUJ in general, and me in particular, accountable. Since I suspect he's not alone in his concerns, I've elected to respond here. Besides, it gives me an opportunity to address a related issue I've been itching to discuss.
Dear Editor,
Considering that Bjarne Stroustrup's article "Learning Standard C++ as a New Language" (CUJ, May 1999) was quite visionary, I think CUJ itself is full of examples of a paradigm that is only very slowly shifting towards C++. Dan Saks' "Isolating Design Decisions, Part 1" (CUJ, August 1999) is a typical example. The article has a very interesting topic, but he fails to show us how to do things easier and in a state-of-the-art manner.
He mentions that he could declare the cross-reference table as:
typedef map < string, set<unsigned> > table;
This would actually replace more than 100 lines of his own code, but he decides to consider these possibilities "not just yet." Also the main file of his sample code is full of C-- instead of C++ style. Why is he not using standard functions like istream::getline to read a line of text? Why is he not using C++ strings?
Writing something like:
while (isalnum(c = fgetc(stdin)) || c == '_'))
is (without reason) very hard to understand for not so experienced programmers. His whole function get_token is a typical example of "C" code, that is yet small but not well designed. This is of course OK for a homegrown utility, but not for published code in a programmer's magazine.
With Standard C++ done right, the whole code would be less than half the size of the code that Dan is showing us.
I am working on software projects with hundreds of thousands of lines of code. People working on those projects may if they do it right write modules in 5,000 very well structured lines, or they might write the same module in 15,000 lines, that I, and sometimes they, do not understand. We read CUJ because we want to learn how to make our work better, quicker, and our code smaller, faster, and easier to maintain. I usually appreciate many of the things CUJ points out, but there is much too much C code in it that does not show us how to make things easier, but in the opposite unreasonably difficult.
Yours sincerely,
Stefan Woerthmueller
Application Programming, Music & Emotions Unlimited
[email protected]
The letter poses a few interesting questions. Let's get the minor one out of the way first namely, why didn't I use istream::getline to read a line of text?
Scanning Techniques
The cross-reference program reads identifiers (as in C or C++) from standard input. In addition to scanning identifiers, the program must note the number of the line on which each identifier appears. If you try using an istream extractor to read one identifier at a time, as in:
cin >> token;
you'll miss the newlines. By default, extractors skip over newlines along with all other whitespace characters.
istream::getline ought to solve that problem. You can use getline to read a line at a time into a string. Then you can use an istringstream to extract tokens from the line, as in:
unsigned ln = 0; string id, line; while (getline(cin, line)) { ++ln; istringstream iss(line); while (iss >> id)) // place id and ln into the cross-reference }
The problem here is that the extraction:
iss >> id
interprets a token as any sequence of characters separated by whitespace. Given the input line:
get_token(std::string &s)
the extractor finds:
get_token(std::string
as the first token and:
&s)
as the second. This does not satisfy the program's requirements. The tokens on the line are supposed to be:
get_token std string s
Neither <iostream> nor <stdio.h> has formatted input functions that will scan identifiers directly. You have to write a function to do it.
One approach is to read a line at a time and scan the identifiers from each line. Another approach is to regard newline characters, as well as identifiers, as tokens, and scan them directly from the input stream. My get_token (Listing 1) uses the latter. It scans standard input for either an identifier or a newline, and ignores everything else. This is a pretty classic approach to input scanning. Although I admit the expression in:
while (isalnum(c = fgetc(stdin)) || c == '_')
is a little more complicated than I'd like it to be, I don't see any way to rewrite it without complicating other parts of the program. And I don't think using <iostream> instead of <stdio.h> makes much difference here.
Now to the larger concern that CUJ, and especially I, have been too slow to embrace C++ as a truly high-level language.
Shifting Gears Too Slowly
In his article, Stroustrup advocates an approach to learning and teaching C++ with a heavy emphasis on using libraries, especially the Standard C++ library, from the outset. In particular, he advises using <iostream> in preference to <stdio.h>, using the string class in preference to character arrays, and using vector and iterator classes in preference to the language's native arrays and pointers. Coverage of lower- level language and library facilities should come later.
I whole-heartedly agree with some of Stroustrup's points, but I have serious reservations about others. I'll elaborate shortly. But regardless of my position, this is, after all, the C/C++ Users Journal, not just the C++ Users Journal. We cater to C as well as C++ programmers over a wide range of ability and experience. In articles that demonstrate techniques of interest to both C and C++ programmers, it makes perfect sense to use the features that are common to both languages.
I skimmed over the CUJ issues from May through August 1999 to find C++ articles that used a "retro" style (using classes, but not the newer features of the Standard C++ library). It wasn't obvious to me that any articles other than mine were guilty of this offense. So what's my excuse for not using streams, strings, and all that other good stuff?
My series on isolating design decisions is actually the continuation of a thread I started well over a year ago. It's been a while since I explained my motives for choosing this example, so I guess I should explain them again.
I started writing regularly for this magazine in 1991. In those days, my column was entitled "Stepping Up to C++." My first few articles were a series in which I transformed a C program into an object-oriented C++ program. That program was the cross-reference generator.
In the years since, C++ changed and so did I. In looking back at that early series, I found a number of things in the program that I'd do differently. I found other things that, while I'd do them the same, I'd explain them differently. Thus, I decided to revisit the example.
Why am I so fond of this particular programming example? My focus is on how to use C++ to build large systems by dividing those systems into simpler abstractions. For this, I need a programming example that's complicated enough so that you can see if the chosen techniques really do make it simpler. On the other hand, the program has to be small enough to fit into the magazine format. I think the cross-reference generator strikes a nice balance.
As I explained in the article that the letter criticizes, I started with an existing program rather than one that I wrote from scratch because I didn't want to spend time developing algorithms and data structures. Rather, my focus is on wrapping those algorithms and data structures into tidy abstract bundles.
In 1991, C++ had no standard library. There wasn't a standard string class, and there certainly weren't any containers and iterators. There wasn't any question about my failure to use them. Now that we have these library components, you can ask why I'm not using them. I think I explained that, but I'll try to say it more clearly.
The standard library's containers are very useful, but they aren't the solution to every problem. The library doesn't provide data structures such as circular lists, skip lists, or graphs. And the containers that are in the library might not be fast enough or compact enough for some applications. In short, the standard library does not eliminate the need to write your own classes and data structures. Not by a long shot. When you do write you own, you should think carefully about packaging those structures as abstract types. C++ gives you lots of packaging options, and it's those options I want to discuss.
Would it be better for me to choose a programming example with data structures that don't map so obviously into library components? Maybe, if I could find one that's not too big and not platform specific. If any of you have ideas for programs on the scale of the cross-reference program that use different data structures, and you're willing to share them with me, I'd be delighted to hear from you.
Did I really have to start this programming example as a C program? Wouldn't it work just as well if I had used streams instead of C files? As I explained last year [3]:
I debated whether to start with a C++ program rather than a C program, but decided to stick with C. C provides a well defined starting point for this exercise. If I had started with a C++ program, I'd have had a hard time deciding which C++ features to use initially. For example, the iostream library uses parameters of reference types. If I had used iostream instead of stdio, should I have also used references extensively? The iostream components are members of namespace std. Had I used iostream, should I have used the explicitly qualified names for the components, or should I have used using directives? It is my intent to discuss these issues, but I want to take them one at a time.
Performance Issues
I have another reason for starting my analysis with a C program, which I haven't explained until now; namely, I want to use the C program as a reference for measuring the cost of various abstraction techniques.
When I worked with the cross-reference program back in 1991, I found that restructuring the C program into an object-oriented C++ program increased the size of the executable code by less than two percent and had a negligible effect on execution speed. With all the changes in the C++ language and library over the past several years, I've heard a lot of programmers express doubt that a well written class-based C++ program could still compete favorably with its C counterpart. I want to find out for myself.
Although I have not completely rewritten the cross-reference program to my satisfaction, I made some measurements just to see how things are going. I compared the program I started with, which had no classes [3], against the version from my last column, which has classes for cross-reference tables and for line-number sequences [4]. I considered both code size and execution speed.
Much to my chagrin, I realized that my initial version of the program used new instead of malloc, so it wasn't really a C program after all. So I cranked out a version in straight C and compiled it as such. I also produced two more versions of the cross-reference program just to gain a sense of what I might confront in future versions of the program. One version uses the cross-reference table and line-number sequence classes, but it uses streams and strings instead of C files and character arrays. The other uses the map and vector classes to implement the cross-reference table and line number sequences. (Earlier I suggested using set, but vector turns out to be faster.)
In summary, I compared five versions of the cross-reference program:
1. a program written in Standard C, similar in style to the one appearing in exercise 6-3 of Tondo and Gimpel [5],
2. the program from which I started last year, identical to (1) except that it uses new instead of malloc,
3. the version of the program from my last column, with classes for cross-reference tables and line number sequences, but still using C files and character arrays,
4. a program just like (3), but with streams and strings,
5. a program that uses maps and vectors as well as strings and streams.
I will examine (4) in an upcoming column. It may be a while longer until I discuss (5) in any detail, so I'll just show it to you without much comment. It appears in Listing 2. (Actually, I do have one comment. It uses getline to read a line at a time and scans the tokens from each line. I did this just to try out the alternative I described earlier.)
I compiled each program using the latest vintage of three different C/C++ compilers for Windows 9x. (I do not want to turn this from a language feature comparison to a compiler comparison, so I won't name names. At least not yet.) I used only the default compiler options. I did not explore options for optimizing program size or speed.
I made the measurements on my aging HP Omnibook 5000 notebook with a 120 Mhz Pentium and 32MB RAM running Windows 95. I ran each program from the command line with a minimum of tasks running in the background. I used command-line redirection (< and >) to specify the input and output files. For input, I used a copy of the C++ Standard in plain text (1,951,043 characters).
Table 1 shows the execution times for each of the programs. Table 2 shows the sizes the of executable (.exe) files for each program. Just look at the numbers and you'll see why I'm only inching my way toward using streams and strings.
The changes I've made to the cross-reference program so far (through version 3) have slowed the program by no more than 30%, and increased the executable file size by no more than 40% (and typically much less). That's more than I was expecting, so I want to discover what caused the increase. Still, it's not too shabby.
On the other hand, adding streams and strings to the program slowed the program by at least another factor of four. With one compiler, the program ran a whopping 63 times slower than the C program. (I don't think that's a mismeasurement. It's the average of three runs, all within 3% of that average.) Adding streams and strings increased the executable file size by anywhere from 50K to 100K bytes.
Using the map and vector classes on top of streams and strings apparently makes the program bigger and slower still. However, using map and vector did bring that one very slow program back from 63 times to only 23 times slower than its C counterpart.
I'm trying not to read too much into these numbers. Again, I have not exercised any compiler options for optimizing the code. I've only started to experiment with programming techniques that will improve the performance of container classes. It looks like I've got a bunch of interesting stuff to explore in future columns.
I plan to tune the programs and make more measurements. (Here's a teaser: I believe I have a technique that enables the version of the program using the map to run nearly as fast as the original C. It'll be the subject of a future column.) Nonetheless, I think these numbers explain why I've been cautious about using certain library components.
Diff'rent Strokes for Diff'rent Folks
In his article, Stroustrup acknowledges that there is no one right way to learn and teach C++. Nonetheless, he suggests that early emphasis on the abstractions in the C++ library is appropriate for experienced programmers as well as beginners. Whereas this approach might be just right for beginning programmers, I have my doubts that it's practical, or even desirable, for the vast majority of on-the-job professionals.
Learning C++ takes time. Months, if not years. It would be wonderful if programmers who want to learn C++ could take the time off from work to study and experiment with the language. Few professionals have that luxury. At best they can take a few days, maybe a week, at a time for training, and then it's back to work. They've got to learn stuff they can apply fairly quickly else they'll forget it.
Beyond the time constraints, many newcomers to C++ must work with legacy code (C and/or C++), and even more must interact with legacy interfaces (APIs), especially if they develop applications for a certain popular family of desktop operating systems. Legacy interfaces traffic heavily in pointers, built-in arrays, and other low-level "C" features. If not contained by classes, that low-level stuff just worms its way into places where it doesn't belong, which is just about everywhere.
One of the virtues of C++ is that experienced C programmers can learn it in steps and apply it right away. If they start by learning how to use classes to encapsulate pointers and arrays, most will see the value immediately. If they start by learning to use library abstractions too far removed from what they deal with from day to day, I suspect many will have trouble applying the lessons in their work.
I think the essence of my disagreement with Stroustrup over learning and teaching C++ is in how each of us perceives C++. I get the impression he thinks C++ is a high-level language, which lets you descend to lower levels as needed. I think C++ is a fairly low-level language, which lets you program at much higher levels. We don't disagree about the nature of good C++ programs, only about the focus it takes to develop them in an industrial setting.
If C++ were a high-level language, you could program most of the time as if the low-level stuff weren't there. I don't think you can. Even when you program with pretty high-level abstractions, the low-level stuff is lurking just below the surface waiting to sneak bugs into your program. You have to respect the low-level nature of C++ so you can keep it at bay and get on with your work.
References
[1] Joe Henderson. Think Fast : Mental Toughness Training for Runners (Plume, 1991).
[2] Dan Saks. "C++ Theory and Practice: Basing Style on Design Principles," CUJ, March 1998.
[3] Dan Saks. "C++ Theory and Practice: Partitioning with Namespaces, Part 1," CUJ, April 1998.
[4] Dan Saks. "C++ Theory and Practice: Isolating Design Decisions, Part 2," CUJ, September 1999.
[5] Clovis Tondo and Scott Gimpel. The C Answer Book, 2nd edition (Prentice Hall, 1989).
Dan Saks is the president of Saks & Associates, which offers training and consulting in C++ and C. He is active in C++ standards, having served nearly seven years as secretary of the ANSI and ISO C++ standards committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield, OH 45504-4906 USA, by phone at +1-937-324-3601, or electronically at [email protected].