Myth #2: Lexers Are "Dumb" Objects
Listing One is the simplified flex specification for a lexer that distinguishes between identifiers and type names, depending on the context. As flex requires, sections are separated by "%%" lines.
Digit [0-9] Letter [a-zA-Z_] ... %% "," { return ','; } ":" { return ':'; } "=" { return '='; } ... "+=" { return ADD_ASSIGN; } ">>=" { return RIGHT_ASSIGN; } "<<=" { return LEFT_ASSIGN; } ... "auto" { return AUTO; } "break" { return BREAK; } "case" { return CASE; } ... {Letter}({Letter}|{Digit})* { return category(); } ... %% #include <blocks.hpp> ... int category() { Block * blockp = Block::currentp; ... while (blockp) { if (blockp-> symbol_table.name_is_defined(yytext)) { if (blockp-> symbol_table.name_is_typedef(yytext)) return TYPE_NAME; else return IDENTIFIER; } blockp = blockp->parentp; } return IDENTIFIER; } ...
- The first section defines a shorthand notation for digits and letters.
- The second section gives rules for each lexical element of the language; for example, for one-character operators ("+", "=", ...), the lexer returns a token code equal to the character encoding, by convenience; for multicharacter operators ("+=", "<<=", ...) and keywords ("auto," "break," "case," ...), it returns symbolic token codes.
- Finally, for strings of letters and digits starting with a letter, the lexer invokes function category() to determine whether the string must be recognized as an IDENTIFER or a TYPE_NAME. The third section gives the implementation of function category(). The function category() complies with the C scoping rules: It looks up the current token (stored in variable yytext) in the current block, then in the block that immediately surrounds it, and so on, until an entry is found or the outmost block is hit with no success. This is the purpose of the while loop in the function.
Remember that in C, any open brace "{" creates a new block and a new scope, where new variables and types can be declared that hide the objects with the same names declared in the outer blocks. Our front-end maintains a representation of all the blocks encountered in the input source code so far, in the form of a tree of objects of class Block (and yes, I write semantic actions and support code in C++). Each block contains a pointer parentp to its surrounding block, which is NULL for the global block. Each block also contains an object of class SymbolTablea symbol table with all the variable and type declarations that have appeared so far in that block. Class SymbolTable provides a method named name_is_typedef(...), which tells whether the given name appears as a defined type name in that symbol table current block.