But Things Get Tricky...
Where declarations are concerned, C is much more permissive than you might think. This permissiveness may lead to poor readability and understandability, as in:
1: typedef unsigned long int ulint_t; 2: void function() 3: { 4: ulint_t another_var; /* ulint_t is used as type name */ 5: char ulint_t; /* ulint_t is used as identifier */ 6: /*...*/ 7: }
Despite first impressions, this code is perfectly legal ANSI C, but the front-end based on the lexer in Listing One fails to recognize it.
More precisely, line 1 defines ulint_t as a user type; therefore, ulint_t must be tokenized here as IDENTIFIER. Then, line 4 declares a variable of this type; therefore, ulint_t is tokenized here as TYPE_NAME, as the scoping rules require. So far, nothing is new. Now, line 5 declares a variable named ulint_tthis is allowed. Inner blocks can declare new variables and types that hide outer objects with the same names. Moreover, in the current block, there are no conflicts because no previous declarations in this block involve the name ulint_t. Since line 5 declares a variable, the lexer should tokenize ulint_t in line 5 as IDENTIFIER. Unfortunately, the lexer as specified in Listing One tokenizes it as TYPE_NAME because it finds a type declaration with that name in the outer block.
int category() { Block * blockp = Block::currentp; ... while (blockp) { if (blockp->symbol_table.name_is_defined(yytext)) { if (blockp->symbol_table.name_is_typedef(yytext)) { if (Block::currentp->type_stack.is_valid_type()) return IDENTIFIER; else return TYPE_NAME; } else return IDENTIFIER; } blockp = blockp->parentp; } return IDENTIFIER; } ...
To solve this, the lexer must cooperate with the parser and the semantic layer. Listing Two presents a solution. You force the current token (ulint_t) as IDENTIFIER when it is the last part of a variable declaration; for instance, when what appears before the current token is a valid type specifier. Although this condition seems simple to verify, the simple condition:
if (Block::currentp-> type_stack.is_valid_type())
hides a full implementation of the C type system provided by the semantic layer. Declarations in C are defined by 108 syntax rules, which is a large part of the entire grammar of the C language. The declaration syntax of C is complex and counterintuitive, and even experienced programmers find it troublesomequalifiers (const and volatile) may appear in different orders, type specifiers (such as long or unsigned) may modify another type specifier or come alone (then int is implied), declarations of pointers to functions are poorly readable, and much more. A real-life example of a nontrivial declaration is:
void (*signal(int sig, void (*func)(int))) (int);
My approach maintains a stack of type records in each block, which describes the state of the current declaration. When each element in a declaration (type record) is parsed, according to the correct precedence implied by the grammar rules, it is pushed onto a stack. Whenever a direct declarator identifier is encountered, a snapshot of the current type stack is taken, saved into the symbol table of the current block, and associated to that identifier. Figure 4 illustrates this mechanism.
Type stacks are easy data structures to process, they capture the full complexity of C declarations; and unlike declarations, they are easy to read. To get the type of a variable in plain English, just read its type stack from top to bottom. For example, "c" in Figure 4 is an array of five pointers to integers. (For brevity, I don't discuss type-stack normalization.)
Finally, type stacks are the key to solve the issue described at the beginning of this section. In short, when the records in the current type stack constitute a valid type, the lexer prevents tokens that are declared as user type names outside (such as ulint_t) to be tokenized as TYPE_NAME, and forces them to be IDENTIFIER.