C++ syntax rules won't always catch invalid declarations, such as pointers to references. But we can still detect them by applying the appropriate semantic constraints.
Copyright © 1996 by Dan Saks
C++ uses declarators in many places. In some contexts, such as simple declarations or member declarations, each declarator must have a declarator-id. That is, each declarator must actually declare a name. In other contexts, such as cast expressions or template argument lists, a declarator must not have a declarator-id. In such cases, a declarator is not declaring a name, but rather specifying part of a type.
Last month, I focused on the problem of parsing a declarator in a context where the declarator-id is optional, such as in parameter declarations or exception declarations. For example, in the function declaration
int atexit(void (*func)());
func is the declarator-id of the declarator in the parameter declaration. Here, the declarator-id is optional. You can also write the function declaration as
int atexit(void (*)());
to refer to the same atexit function.
By the way, this function declaration actually has two declarators, one inside the other. (*)() is the declarator in the parameter declaration void (*)(). That parameter declaration in turn forms the bulk of the enclosing declarator atexit(void (*)()). In this outer declarator, the declarator-id atexit is not optional.
The grammar in the draft C++ standard refers to a declarator that must have a declarator-id as a plain ordinary declarator. It refers to a declarator that cannot have a declarator-id as an abstract-declarator. (These terms came from Kernighan and Ritchie's original description of C [1], and also appear in the C Standard.) The grammar uses one set of productions for declarator, and a nearly identical, but separate, set for abstract-declarator. In those places where a declarator-id is optional, the grammar presents an explicit choice between declarator and abstract-declarator, as in
parameter-declaration = decl-specifier-seq ( declarator | abstract-declarator ) .
Last month, I explained why using separate productions for declarators and abstract declarators leads to difficult parsing problems that require elaborate lookahead. I believe I eliminated those problems by merging the productions into a single set for all declarators. I also introduced new semantic rules so that we could still distinguish those declarators with a declarator-id from those without.
Table 1 shows the grammar for a simple-declaration with one set of productions for declarator. (As always, the grammar uses the EBNF notation, summarized in Table 2. ) In Table 1, the non-terminal declarator represents a declarator that may or may not have a declarator-id. For those contexts where the declarator must have a declarator-id, the grammar uses the non-terminal concrete-declarator defined by the production
concrete-declarator = declarator.
This production does not by itself convey the fact that a concrete declarator must have a declarator-id. You need a separate semantic statement for that. However, as I noted last time, there is ample precedent in C and C++ for using semantic rules to disambiguate the syntax. Such statements do not appear in the grammar, but rather in text (such as this) that accompanies the grammar.
The grammar in Table 1 uses plain declarator (which may or may not have a declarator-id) in the production for parameter-declaration. It uses concrete-declarator (which must have a declarator-id) in the production for declarator-list, which, in turn, appears directly in simple-declaration. A complete grammar for C++ also requires a production for
abstract-declarator = declarator.
along with a rule explaining that an abstract-declarator must not have a declarator-id. However, simple-declaration does not need this production, so it does not appear in Table 1.
By the way, I've been grappling with whether concrete and abstract are the right names for the two different flavors of declarator. Concrete is indeed an antonym for abstract. So given that both C and C++ already use the term abstract-declarator as I have, I suppose concrete-declarator is an appropriate term. On the other hand, I can't help but think that named-declarator and unnamed-declarator might be better terminology. I'm reluctant to invent too much new terminology, so for now I'll stick with concrete and abstract. But I'll keep thinking about it.
Parsing Declarators
Now, at last, it's time to put this theory into practice by applying it to my decl program, which translates C++ declarations into English. The last time I presented the parser in its entirety (in "The Column That Needs a Name: Recovering from Parsing Errors," CUJ, April 1996), it recognized function declarators, but only with empty parameter lists. The source code for a new decl parser appears in Listing 1. This program now recognizes non-empty parameter lists in function declarators. Moreover, the parameter declarations may use abstract declarators as well as concrete declarators.
As suggested in the grammar, there is a single parsing function called declarator that recognizes both concrete declarators and abstract declarators. Although a parameter declaration accepts either, a simple declaration accepts only concrete declarators. Therefore, the declarator function now has a parameter that indicates if it should only accept one kind of declarator or the other or both. That parameter has an enumeration type:
enum declarator_category { ABSTRACT, CONCRETE, EITHER };
defined as a member of the parser class.
The declarator function doesn't really do anything with its declarator_category parameter except pass it to direct_declarator, which recognizes the presence or absence of a declarator-id. direct_declarator doesn't let the declarator_category value influence the way it parses, only what it finally accepts. That is, when direct_declarator encounters a declarator-id (an identifier), it simply checks that the declarator is not supposed to be abstract before proceeding, using statements of the form:
if (tc == token::IDENTIFIER) { if (dc == ABSTRACT) error("tsk, tsk"); // continue parsing concrete declarator } else if (tc == token::LEFT_PAREN) ...
Similarly, when direct_declarator determines that the declarator has no declarator-id, it checks that the declarator is not supposed to be concrete before proceeding, using statements of the form:
if (dd == "") { if (dc == CONCRETE) error("tsk, tsk"); // continue parsing abstract declarator }
Neatness Counts
As I explained earlier, parameter lists introduce the notion of declarators nested within declarators. Indenting a nested construct with respect to its enclosing construct typically yields more readable output. Therefore, decl displays the output for a parameter list indented with respect to the output of its function name and return type.
For example,
int atexit(void (*)());
translates into
atexit is ... function with parameter(s) ... <unnamed> is ... pointer to ... function with no parameters ... returning ... void returning ... int
Notice also that <unnamed> takes the place of the parameter name for an unnamed parameter (from an abstract declarator).
The parser implements the indenting using a private data member:
string indent;
whose value is simply a sequence of zero or more spaces. (You can easily change the indent string to a single tab character if you prefer.) Various parsing functions prefix their result string with indent.
indent is initially the empty string. The parameter_list parsing function increases the indent level using
indent += tab_stop;
where += is a string concatenation operator and tab_stop is a constant string of spaces. parameter_list decreases the indent level as it leaves using
indent = string (indent, 0, indent.length() - tab_stop.length());
The string constructor string(s, p, n) constructs a string by copying n character from s starting at position p. (Positions start at zero.)
Checking Semantic Constraints
As with the grammar in the draft C++ Standard, the grammar in Table 1 allows declarations that, while syntactically correct, are semantically incorrect. For instance, the production
direct-declarator = [ declarator-id | "(" declarator ")" ] { array-suffix | function-suffix } .
suggests that a declarator can have any number of array and function suffixes in any order. That is, the grammar allows all of the following declarators:
x()() - function returning function x()[N] - function returning array x[N]() - array of function x[N][N] - array of array
when, in fact, the semantic rules of C++ allow only the last.
For example, the decl parser ( Listing 1) accepts
int f(int)();
and reports that
f is ... function with parameter(s) ... <unnamed> is ... int returning ... function with no parameters ... returning ... int
However, C++ does not actually accept a function that returns a function. This is different from a function that returns a pointer to a function, such as
int (*f(int))();
which C++ certainly does allow.
C++ semantics also severely constrains declarators involving references. Although the grammar allows
&&r - reference to reference &*r - pointer to reference &T::*r - pointer to member to reference &r[N] - array of reference
none actually make it past a C++ compiler's semantic checks.
It's tempting to try to express these constraints in the grammar. The pattern appears to be that:
- if a declarator has a function-suffix, it cannot have
an array-suffix or another function-suffix.
- if a declarator has an array-suffix, it can have another array-suffix (representing a multi-dimensional array), but it cannot have a function-suffix.
Were it not for grouping parentheses in declarators, rewriting the production for direct-declarator as
direct-declarator = [ declarator-id | "(" declarator ")" ] [ { array-suffix } | function-suffix ] .
would do the trick of allowing exactly those patterns for array and function suffixes that are also semantically correct. However, grouping parentheses render this production ineffective.
For example, the declarator (x())() satisfies the production just above, even though it specifies a function returning a function. The problem is that production prohibits consecutive function suffixes only at the same parenthetical level. Though it may be possible to devise a grammar to express the constraints properly, I think it's much easier and clearer to implement the constraints as semantic checks.
The CUJ online sources and code disk (see p. 3) include an implemention of the decl parser that includes additional checks to catch the semantic violations described above.
My implementation for the semantic checks maps each declarator operator into a type_category value, where type_category is a private member of the parser class defined as:
enum type_category { SIMPLE, ARRAY, FUNCTION, POINTER, REFERENCE };
The declarator and direct_declarator functions return a type_category value through a reference parameter. That value describes the type category of the outermost declarator operator just parsed. (The outermost operator is the last one bound to the declarator.) The outermost type category of a declarator with no operators is SIMPLE. The type categories ignore cv-qualifiers and regard a pointer to member as just POINTER.
For example, the outermost type category for the declarator f(int) is FUNCTION. For *f(int), the category is POINTER (because the * has lower precedence than the parameter list). For (*const f)(int), the outermost type category is once again FUNCTION.
When declarator recognizes another operator (by calling ptr_operator), it simply checks the type category of that operator against the outermost type category of the rest of the declarator. If the type categories conflict, declarator reports an error. direct_declarator has similar logic.
Compiling the Code
As in the past, the entire decl program consists of two source files (decl.cpp and scanner.cpp) and one header file (scanner.h). decl.cpp is as shown in Listing 1 with the changes for semantic checks described above. The scanner files are largely unchanged since I last presented them a few months ago. (See "C++ Theory and Practice: Abstract Declarators, Part 2," CUJ, July 1996). You can obtain the complete source code for decl (including the scanner) on CUJ's ftp site and code disk. See page 3 for ftp information.
I recently upgraded several of my compilers. I can now compile the entire program using Borland C++ 5.0, Microsoft Visual C++ 4.1, and Watcom C++ 10.6 (all under Windows 95). Both the Borland and Microsoft compilers come with versions of STL (the Standard Template Library that's part of the draft C++ standard). The Watcom compiler does not include the STL, but I found a copy specifically tailored to the Watcom compiler in Hewlett-Packard's ftp site at butler.hpl.hp.com, in the file stl/wat_stl.zip. The documentation for this version says it was developed for Watcom C++ 10.0, but it seems to work just fine when using version 10.6.
The Borland STL implementation (licensed from Rogue Wave Software) places the library components in namespace std. The C++ namespace feature is currently the sharpest point on the bleeding edge -- not all compilers support it, and there's still considerable variation among those that do. I had to add the using-directive
using namespace std;
to scanner.h, immediately after #include directives for the STL headers to use Borland's library. Since not all compilers support namespaces, I wrapped the using-directive inside a preprocessor conditional:
#ifdef NAMESPACE using namespace std; #endif
Thus, I could build the program using the command line
bcc32 -DNAMESPACE decl.cpp scanner.cpp
For some reason, Borland's 16-bit compiler, bcc, gags on the program.
The Microsoft compiler (version 4.1) supports namespaces, but the STL that comes with it does not place the components in namespace std. However, one of the compiler's README files explains how to tweak the headers to place the components in namespace std. I made the changes, and compiled decl with both the original headers, and again with the modified headers.
References
[1] Brian Kernighan and Dennis Ritchie. The C Programming Language (Prentice Hall, 1978).
Dan Saks is the president of Saks &Associates, which offers consulting and training in C++ and C. He is secretary of the ANSI and ISO C++ committees. Dan is coauthor of C++ Programming Guidelines, and codeveloper of the Plum Hall Validation Suite for C++ (both with Thomas Plum). You can reach him at 393 Leander Dr., Springfield OH, 45504-4906, by phone at (513)324-3601 (the area code changes to 937 after September 1996), or electronically at [email protected].