An Editors View: A First, Uncertain Step
The Perils of Research
This conformance roundup is, in a sense, a research project, something that has never been attempted before. Research is a risky business even for very capable people, and the participants in this roundup are among the most capable in the world. There may be false starts and dead ends; the best and brightest people can arrive at inconclusive answers. That is the possibility I want to address here.
First, I want to say that Herb Sutter, the testers, the compiler and library vendors, the experts panel, and the many other participants in this project deserve high commendations for their efforts. They have, in combination, spent literally thousands of hours putting together this roundup. The participants also have quite laudable goals, which include raising awareness of conformance issues, and coaxing C++ implementations toward higher levels of conformance. Yet as an outside observer I must say, quite frankly, I am not sure the results are meaningful. I am especially doubtful about the library portion of the roundup. In this sidebar I want to make readers aware of some of the issues that impinge on the validity of the data. Then I will make recommendations for how the testing process might be improved, under the assumption that this roundup represents but the first step in a long and difficult (but ultimately worthwhile) process.
Please note that this roundup was not put together overnight. It is the end product of a long and labor-intensive process that Herb Sutter initiated in June of 2000. The participants generated volumes of correspondence that passed back and forth on a reflector (sort of like a newsgroup or bulletin board). What follows are the most significant issues and questions (in my view, anyway) that came up on the reflector, or that came up for me as I reviewed the correspondence on the reflector.
Testing Issues
Validation suites the wrong but possibly best tools? The tests performed for this article were done with what are known as validation suites. These are suites of test cases (typically numbering in the thousands) which are designed for just one purpose: to help vendors find areas in which their own products do not conform to the Standard. Typically, these findings are kept private between the vendor and the validation suite provider; often this privacy is even enforced via contract. Furthermore, the test cases that make up a validation suite are usually not made public. (If they were, the validation suite providers business would be effectively ruined.)
Using validation suites for a conformance roundup is problematic for several reasons. Since the test code is not available for public scrutiny, accountability is lacking. Even if the code were available, reviewing the thousands of test cases would be a daunting task for most people, as it is even for implementers who have full access to the code. Finally, validation suites are simply not designed for comparative purposes.
So this roundup represents, if you will, the abuse of a tool for a good cause. Although there are alternative testing techniques (see Recommendations below), there are currently no other tools that can test conformance to the same level of detail. And it is highly unlikely that anyone will develop one, much less make it available to the public.
Testing errors isolated incidents or tip of the iceberg? Just as no software is without bugs, neither is software designed to test other software, and neither is a testers procedures. One compiler vendor I spoke with made a very good point in this regard. He said, Our compiler is beat on by thousands of users every day; our users let us know about bugs in our products, and we fix those bugs. But validation suites are used by maybe a few dozen people, tops. Who will find all the bugs in the validation suites?
Several times during the process of putting together this roundup, participants pointed out test scores that did not make sense; the scores were obviously too low, or in one case, too high a score of 5 out of 10 that plainly should have been a zero. When these errors were pointed out, the testers did correct both the scores and the testing flaws that created them, but this left unanswered questions hovering in the air: how many more such errors lay lurking in the validation suites, or in the testers procedures? Do the errors found indicate a process that is fundamentally flawed, or would the remaining errors, if found, have a negligible impact on the scores? Alas, it is impossible for me, or for readers to judge, because we dont have access to the test suites (and wed be overwhelmed by code if we did).
Testing libraries one bona fide, certified, Grade A can of worms. It is impossible to test a Standard C++ library implementation without using a compiler. Furthermore, most C++ library implementations are built on top of a Standard C library, a library which may have shortcomings of its own, and which may have been provided by a different vendor. It is difficult to tease out a C++ librarys conformance score from the conformance of the underlying compiler and C library. In fact, there was much debate on the reflector as to whether we should even try to separate them.
In my view, the answer depends on the purpose of the roundup. From an end-users perspective, the only library tests that are immediately useful are ones that show how particular library/compiler combinations conform to the Standard. So if this roundup were intended to be a buyers guide and it most certainly is not Perennials way of testing would probably make the most sense. However, since the purpose of this article is to recognize vendors who seem to have placed an emphasis on conformance, I think Dinkumwares approach makes the most sense. That is, I think we should try to factor out weaknesses in the underlying compiler or C library that might negatively affect a vendors score.
Still, the Dinkumware approach, if more reasonable, is hardly comforting. It is tantamount to assigning a kind of pseudo-conformance to a library, a conformance that presumably would have existed had all things been right below decks. In other words, in my opinion we really are driving through the fog.
Recommendations
As I stated previously, I make these recommendations under the assumption that this effort is going to continue, and that participants will seek to keep improving the tests. I also admit it is easy to make such recommendations from the sidelines! So if my ideas are somewhat unrealistic, I hope they will inspire people with better heads on their shoulders.
Special-Purpose Validation Suites?
In my wildest dreams, some altruistic folks with nothing better to do with their lives would create an Open Source validation suite. Then at least if our favorite compiler got a bad score, we the public could turn to the suite to figure out why. I am not holding my breath, though.
More realistically, it might be helpful if we selected just one validation suite for testing, and had a team of C++ experts assign weights to each of the test cases (before any testing had begun, of course). Thus, the cases the experts considered more significant would figure more prominently in the final scores. Alternatively, perhaps the experts could put together a standard comparative validation suite by picking and choosing the most relevant tests from individual suites.
Perhaps even more realistically, it would be nice if at least someone who was knowledgeable of the Standard and who had access to the test suites could study the results and point out specific ones that might mean something important to users. The Perennial tests, in particular, generated an enormous amount of data, all of which is available on the CUJ website. Perhaps someone who was knowledgeable could glean something more from these numbers.
Better Accountability
Clearly one of the biggest problems with the current test effort is the lack of an adequate mechanism to hold testers accountable. Due to the proprietary nature of validation suites, the testers do what they do in a black hole. This is not to imply that the testers are in any way devious or ill-intentioned; in fact, Dinkumware, Perennial, and Plum Hall are to be commended for contributing many unpaid hours to the project. But testers are human and humans make mistakes. It is common knowledge that the more eyeballs you can get focused on a piece of code, the more likely it is to get better. Again, it seems to me that some sort of team of impartial observers might be needed, who could monitor the tests and resolve conflicts between the vendors and the testers.
Test for Capabilities, Not Just Clauses
Validation suites are concerned with fidelity to the Standard, so they typically adopt a clause-by-clause approach to testing. The results are far more interesting to implementers than to end users. End users do care about conformance, but in a different way: it is important insofar as it enables them to employ best practices the techniques, idioms, patterns, etc. that contribute to effective programming and are essential parts of a professionals evolving body of knowledge.
Thus, from an end users perspective, it is more important to know about the wholesale presence or absence of features than about strict adherence to the Standard. For instance, it would be nice to know which compilers support function-try-blocks in some fashion. To my knowledge, the semantics of function-try-blocks cannot be simulated (if theyre missing, there is no workaround), so it is more important to know if a compiler supports them at all than if it supports them perfectly. Alas, the test scores give us scant information in this regard.
The web version of this article shows test scores for compiler and library features that were identified as important by real-world developers. I think these sorts of tests especially are a step in the right direction. Unfortunately, the current scores were arrived at by performing some analysis (very questionable analysis, in my view) on Plum Halls test data. What we need in the future are publicly accessible test cases designed specifically to test the features in question. Sample applications that used the various features would also be beneficial. The corresponding test reports should consist of descriptive prose, not opaque numbers. This prose would include compiler diagnostics in the case of code that would not compile, and behavioral descriptions in the case of code that compiled but ran incorrectly.
So Why Are We Publishing This?
The short answer is: because you have to start somewhere. If we can liken this roundup to a large software project, the results published here are like the first iteration. Even if you end up throwing out all the code, you cant skip that first step. For better or worse, this roundup represents the current state of the art in conformance testing.
All the roundup participants I spoke with have told me the same thing: when this project was first proposed, they thought it was a great idea and they were generally enthusiastic. As time went on, reality set in: testing conformance is harder than anyone could have dreamed! We welcome reader feedback on this article. Tell us if you think it was worthwhile, and especially how the testing process could be improved. My hope is that the roundup participants will be encouraged to continue the effort they started, and to continue to make the process better.