Regular expressions form a central role in many programming languages, including Perl and Awk, as well as many familiar UNIX utilities such as grep and sed. The intrinsic nature of pattern matching in these languages has made them ideally suited to text processing applications, particularly for those web applications that have to process HTML. Traditionally, C/C++ users have had a hard time of it, usually being forced to use the POSIX C API functions regcomp
, regexec
, and the like. These primitives lack support for search and replace operations and are tied to searching narrow character C-strings. Some time ago, I began work on a modern regular expression engine that would support both narrow- and wide-character strings, as well as standard library-style iterator-based searches. This library became the regex library in Boost, which is accepted as part of the peer- reviewed boost library (see http://www.boost.org/). In this article, I'll show how regex++ can be used to make C++ as versatile for text processing as script-based languages such as Awk and Perl.
Data Validation
One of the simplest applications of regular expressions is data-input validation. Imagine that you need to store credit-card numbers in a database. If such numbers are stored in machine-readable format they will consist of a string of either 15 or 16 digits. The regular expression:
[[:digit:]]{15,16}
can be used to verify that the number is in the correct format; here I have used the extended regular expression syntax used by egrep, Awk, and Perl. Regex++ also supports the more basic syntax used by the grep and sed utilities. However, most people find that the extended syntax is both more natural and more powerful, so that is the form I will use throughout this article. I do not intend to discuss the regular expression syntax in this article, but the syntax variations supported by regex++ are described online (http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html). The documentation for Perl, Awk, sed, and grep are other useful sources of information, as is the Open UNIX Standard (http://www.opengroup.org/onlinepubs/7908799/xbd/re.html).
To use the aforementioned expression, you will need to convert it into some kind of machine-readable form. In regex++, regular expressions are represented by the template class reg_expression<charT, traits, Allocator>
; this acts as a repository for the machine-readable expression and is responsible for parsing and validating the expression. reg_expression
is modeled closely on the standard library class std::basic_string
, and like that class, is usually used as one of two typedef
s:
typedef reg_expression<char> regex; typedef reg_expression<wchar_t> wregex;
Listing One contains some code for validating a credit-card format; in fact, this code could hardly be simpler, consisting of just two lines.
Listing One
bool validate_card_format(const std::string& s) { static const boost::regex e("\\d{15,16}"); return regex_match(s, e); }
The first line declares a static instance of boost::regex
, initialized with the regular expression string; note that I have replaced the verbose (albeit POSIX standard) [[:digit:]]
with the Perl-style shorthand \d
. Note also that the escape character has had to be doubled up to give \\d
. This is an annoying aspect of regular expressions in C/C++. Since character strings are seen by the compiler before the regular expression parser, whenever an escape character should be passed to the regular expression engine, a double backslash must be used in the C/C++ code. The second line simply calls the algorithm regex_match
to verify that the input string matches the expression. My use of a static instance of boost::regex
here is important this ensures that the expression is parsed only once (the first time that it is used) and not each time that the function is called. Although the algorithm regex_match
is defined inside namespace boost
, I haven't prefixed the usage of the algorithm with the boost::
qualifier. This is because the Koenig lookup rules ensure that the right algorithm will be found anyway, as long as one of its arguments is a type also declared inside namespace boost
. It should be noted, however, that not all compilers currently support Koenig lookup. For these compilers, a boost::
qualifier is required in front of the call to regex_match
. For simplicity, however, all the examples in this article assume that the Koenig lookup is supported.
Now suppose that at some point, the application using this code is converted to Unicode. Using traditional C APIs, this could be difficult, however, the library makes this trivial I just had to change std::string
to std::wstring
and boost::regex
to boost::wregex
(see Listing Two).
Listing Two
bool validate_card_format(const std::wstring& s) { static const boost::wregex e(L"\\d{15,16}"); return regex_match(s, e); }
Search and Replace
Frankly, the examples given so far are not all that interesting. One of the key features of languages such as Perl is the ability to perform simple search and replace operations on character strings. Consider the credit-card example again while it may be machine friendly to store credit-card numbers as long strings of digits, this is not very human friendly. Normally, people expect to see credit-card numbers as groups of three or four digits separated by spaces or hyphens. If you print out receipts containing the customer's card number, you would expect to see the number in a human-friendly form. Conversely, if you receive an order by e-mail, the chances are that the card number has not been typed in a machine-friendly form. Fortunately, regular expression search-and-replace comes to the rescue.
In Listing Three, I have defined a single regular expression that will match a card number in almost any format, along with two format strings that define how the reformatted text should look one for a machine-readable form and one for a standardized human-readable form. The regular expression and the format strings are used by two functions (machine_readable_card_number
and human_readable_card_number
) that perform the text reformatting by calling the algorithm regex_merge
. This algorithm searches through the input string and replaces each regular expression match with the format string. Note, however, that the format string is not treated as a string literal; instead, it acts as a template from which the actual text is generated. In this example, I've used a sed-style format string where each occurrence of \n
is replaced by what matched the n
th subexpression in the regular expression. Users of sed or Perl should be familiar with this kind of usage, and the library lets you choose which format string syntax you want to use by passing the appropriate flags to regex_merge
. By the way, the name regex_merge
comes from the idea that the algorithm merges two strings (the input text and the format string) to produce one new string.
Listing Three
// match any format with the regular expression: const boost::regex e("\\A" // asserts start of string "(\\d{3,4})[- ]?" // first group of digits "(\\d{4})[- ]?" // second group of digits "(\\d{4})[- ]?" // third group of digits "(\\d{4})" // forth group of digits "\\z"); // asserts end of string // format strings using sed syntax: const std::string machine_format("\\1\\2\\3\\4"); const std::string human_format("\\1-\\2-\\3-\\4"); std::string machine_readable_card_number(const std::string& s) { std::string result = regex_merge(s, e, machine_format, boost::match_default | boost::format_sed | boost::format_no_copy); if(result.size() == 0) throw std::runtime_error ("String is not a credit card number"); return result; } std::string human_readable_card_number(const std::string& s) { std::string result = regex_merge(s, e, human_format, boost::match_default | boost::format_sed | boost::format_no_copy); if(result.size() == 0) throw std::runtime_error ("String is not a credit card number"); return result; }
Error handling in Listing Three is quite simple by passing the flag boost::format_no_copy
to regex_merge,
sections of the input text that do not match the regular expression are ignored and do not appear in the output string. This means that if the input does not match the expression, then an empty string will be returned by regex_merge
, and the appropriate exception can be thrown. The algorithm regex_merge
will search the input for all possible matches, but in this case, it requires that the expression must match the whole of the input string or nothing at all. Therefore, the expression in Listing Three starts with \\A
and ends with \\z.
Taken together, these ensure that the expression will only match the whole of the input string and not just one part of it (these are what Perl calls "zero width assertions").
If you study the regular expression in Listing Three, you should notice one big improvement over script-based languages; C++ lets you specify a single-string literal as a series of shorter string literals. I've taken advantage of this in Listing Three to split the regular expression up into logical sections, and then to comment each section. When the compiler sees that section of code, the comments will get discarded and the strings will merge into one long-string literal. Perhaps surprisingly, this makes regular expressions much more readable in C++ than in those traditional scripting languages that require regular expressions to be specified as a single long string.
Nontrivial Search and Replace
So far, the examples have concentrated on simple search-and-replace operations that use an existing syntax (either sed or Perl) for the format string. However, it is sometimes necessary to compute the new string to be inserted. A typical example would be a web application that uses a regular expression to locate a custom HTML tag in a file, then uses the match to perform a database lookup. The output would then be another HTML file with the custom tags replaced by the current database information. Imagine that the custom tag looks something like this:
<mergedata table="tablename" item="itemname" field="fieldname">