Some programs need to compare a lot of text strings quickly. Text output filters, search engines, and natural-language processors are examples of applications where rapid text string comparison is important. Besides simple character-by-character comparisons, many text comparison routines need to support matching on wildcard characters. One of the strings might treat the "*" character as a special wildcard character; for example, *sip
* should match mississippi
.
When I searched for a textbook wildcard string-matching algorithm that would fit this need, I discovered that a good algorithm is harder to find than I had expected. The algorithms that I found posted on the Web were buggy, slow, and/or needlessly complex -- typically all three. So I developed my own. Some colleagues then told me that I could have found similar algorithms among open-source regular expression matching routines. I looked at some of those routines, but the ones that I found use recursion. I find recursive algorithms difficult to understand and debug. They’re also exposed to stack overflows; for example, if the input strings are lacking proper termination. It’s difficult, if not impossible, to prevent stack overflows in a recursive algorithm and still keep it performing fast -- and stack overflows are often fatal even in the face of exception-handling logic. When no stack space is available, it’s tricky to do so much as output a diagnostic message. There’s no such difficulty with my nonrecursive approach.
Listing One is a coded-up version of my algorithm, which does all of its processing in a single while loop. It not only handles wildcard ("*") characters, but also alternate string-termination characters. And it does all of this in a minimal number of clock cycles. It’s relatively compact and understandable, compared to the wildcard comparison routines I found by searching.
Listing One
// This function compares text strings, one of which can // have wildcards ("*"). BOOL GeneralTextCompare( char * pTameText, // A string without wildcards char * pWildText, // A (potentially) corresponding string with wildcards BOOL bCaseSensitive = FALSE, // By default, match on "X" vs "x" char cAltTerminator = '\0' // For function names, for example, you can stop at the first "(" ) { BOOL bMatch = TRUE; char * pAfterLastWild = NULL; // Location after last "*", if we've encountered one char t, w; // Walk the text strings one character at a time. while (1) { t = *pTameText; w = *pWildText; // How do you match a unique text string? if (!t || t == cAltTerminator) { // Easy: unique up on it! if (!w || w == cAltTerminator) { break; // "x" matches "x" } else if (w == '*') { pWildText++; continue; // "x*" matches "x" or "xy" } bMatch = FALSE; break; // "x" doesn't match "xy" } else { if (!bCaseSensitive) { // Lowercase the characters to be compared. if (t >= 'A' && t <= 'Z') { t += ('a' - 'A'); } if (w >= 'A' && w <= 'Z') { w += ('a' - 'A'); } } // How do you match a tame text string? if (t != w) { // The tame way: unique up on it! if (w == '*') { pAfterLastWild = ++pWildText; continue; // "*y" matches "xy" } else if (pAfterLastWild) { pWildText = pAfterLastWild; w = *pWildText; if (!w || w == cAltTerminator) { break; // "*" matches "x" } else if (t == w) { pWildText++; } pTameText++; continue; // "*sip*" matches "mississippi" } else { bMatch = FALSE; break; // "x" doesn't match "y" } } } pTameText++; pWildText++; } return bMatch; }
The two text strings passed as input into the algorithm include a tame string
, which contains no wildcard characters, and a wild
string that may contain one or more wildcard characters. The two strings are traversed in parallel during processing. In Listing One, the first if clause in the loop checks for termination of the tame
string. If the wild
string has additional characters by the time processing reaches the end of the tame string, and if at least one of those additional characters in the wild string is not a wildcard character, then the strings don’t match. Most of the remaining code is in the big else
clause corresponding to this first if
clause.
In the big else
clause, the first thing that happens is that an instant character from each string (both tame
and wild
) is lowercased, if the caller has specified case insensitivity. The core wildcard matching code is found in the several lines of code following the lowercasing operation. This core code executes only when the instant characters from the tame
and wild
strings don’t match. One of those nonmatching characters will be either a wildcard character, or it won’t. If it is, then a pointer to the next character after the wildcard is saved. Otherwise, the strings don’t match, unless such a pointer has already been saved during earlier processing, meaning the algorithm has encountered a wildcard character already. When that is the case, processing continues from the character after that previous wildcard character, which is compared with the instant character from the tame string. Then the following occurs:
- The string termination condition is checked, this time for the
wild
string, to ensure that it actually has additional valid characters to compare at this point. - If there are no valid characters, then there is nothing more to do.
- Otherwise, one of two things can happen: 1. If the instant characters from the tame and wild strings now match, the algorithm continues by comparing the next characters from each string; or 2. if those characters don’t match, the algorithm continues by comparing the same instant character from the
wild
string with the next character from the tame string.
After the big else
clause, if the core code has not already continued from the top of the loop, the algorithm moves on to select the next pair of characters from each string, then it returns to the top of the loop to compare them.
If you use the debugger to check the algorithm's ability to compare the text string Mississippi
against *sip*
, this is what you’ll see looking at the Watch window view associated with the tame
and wild
string pointers. The algorithm compares the left-most characters of each string, moving through each string one character at a time, such that the debugger shows this representation of the two string pointers as you’re stepping through:
mississippi *sip* mississippi sip* ississippi sip* ssissippi sip* sissippi ip* sissippi sip* issippi ip* ssippi p* ssippi sip* sippi ip* sippi sip* ippi ip* ppi p* pi * i *
Clearly, the number of character comparisons performed by this algorithm is near to the minimum that can possibly be achieved, because that number is not much greater than the number of characters in the tame string Mississippi
. Alternate approaches, such as the use of strchr()
, typically lack equivalent computational performance, particularly when multiple wildcard characters are specified, as in this *sip* example.
The algorithm can be enhanced to support Unicode without much trouble. There are several ways to go about that, depending on which platform and toolset you’re using. If you’re a Windows programmer, for example, you could do what MFC does to arrange Unicode support. The CString
class from the MFC library provided with Visual Studio .NET 2005 provides UTF-8 support by invoking multibyte string-handling routines from the C runtime library. The CString
class includes a ::CompareNoCase()
method that invokes _mbsicmp(). If you build a sample program that passes a Unicode string to CString::CompareNoCase()
and debug it, you can follow the call into _mbsicmp()
to observe how the UTF-8 conversions are arranged. You could then code the same underlying calls to do similar conversions to endow this algorithm with CString
-like locale enablement.
Improvement from a Reader
Dear Dr. Dobb's,
Dr. Dobb's reader Benjamin Lunt has kindly done some performance comparisons using my algorithm and a recursive algorithm from wildcard text matching code used by a colleague of his. He found the recursive code slightly faster than the version of my code that appears in Dr. Dobb's, in some circumstances. In other circumstances, my code was faster. But that got me thinking about performance.
I realized that my algorithm could benefit from skipping an extra assignment of the w
variable, which occurred any time after a wildcard ("*") character had been encountered. Needless assignments can force needless cache memory updates, possibly leading to serious slowdowns on modern CPUs. Then I thought of a way to correct this. The new performance-tuned code is below. It retains the fixes for
repeating character patterns and for case sensitivity that I previously provided. I've tested the new version in every way I could think of, and I've sent it out to Ben and some other interested Dr. Dobb's readers who've also shared with me various thoughts about the algorithm. Ben has very helpfully tested the performance, as he did with the original version; his remarks appear after the new code below.
In answer to Ben's comment about nonmatching situations requiring more cycles than matching situations, let me point out the following. Because this is wildcard-based matching, you may have to do a lot of checking to know for certain that you don't have a match. Perhaps there's no way around that sort of performance limitation in wildcard text matching algorithms. At least I haven't thought of a faster approach, and I have yet to see evidence that one exists -- but then again you never know.
Regards,
--Kirk
//This function compares text strings, one of which can have wildcards ('*'). // BOOL GeneralTextCompare( char * pTameText, // A string without wildcards char * pWildText, // A (potentially) corresponding string with wildcards BOOL bCaseSensitive = FALSE, // By default, match on 'X' vs 'x' char cAltTerminator = '\0' // For function names, for example, you can stop at the first '(' ) { BOOL bMatch = TRUE; char * pAfterLastWild = NULL; // The location after the last '*', if we’ve encountered one char * pAfterLastTame = NULL; // The location in the tame string, from which we started after last wildcard char t, w; // Walk the text strings one character at a time. while (1) { t = *pTameText; w = *pWildText; // How do you match a unique text string? if (!t || t == cAltTerminator) { // Easy: unique up on it! if (!w || w == cAltTerminator) { break; // "x" matches "x" } else if (w == '*') { pWildText++; continue; // "x*" matches "x" or "xy" } else if (pAfterLastTame) { if (!(*pAfterLastTame) || *pAfterLastTame == cAltTerminator) { bMatch = FALSE; break; } pTameText = pAfterLastTame++; pWildText = pAfterLastWild; continue; } bMatch = FALSE; break; // "x" doesn't match "xy" } else { if (!bCaseSensitive) { // Lowercase the characters to be compared. if (t >= 'A' && t <= 'Z') { t += ('a' - 'A'); } if (w >= 'A' && w <= 'Z') { w += ('a' - 'A'); } } // How do you match a tame text string? if (t != w) { // The tame way: unique up on it! if (w == '*') { pAfterLastWild = ++pWildText; pAfterLastTame = pTameText; w = *pWildText; if (!w || w == cAltTerminator) { break; // "*" matches "x" } continue; // "*y" matches "xy" } else if (pAfterLastWild) { if (pAfterLastWild != pWildText) { pWildText = pAfterLastWild; w = *pWildText; if (!bCaseSensitive && w >= 'A' && w <= 'Z') { w += ('a' - 'A'); } if (t == w) { pWildText++; } } pTameText++; continue; // "*sip*" matches "mississippi" } else { bMatch = FALSE; break; // "x" doesn't match "y" } } } pTameText++; pWildText++; } return bMatch; }
From Dr. Dobb's reader Benjamin Lunt...
Hi Kirk,
Yes, much faster. At 10,000,000 iterations,
wild = "*ip*";
wild = "m*i*.*r";
wild = "m*x*.*r";tame = "mississippi.river";
wild = "*.*";
original code: 57 seconds
new code: 2 seconds
my friends code: 46 seconds
47, 1, 43 respectively
77, 3, 66 respectively
69, 23, 66 respectively
Good job. However, if you will notice, it is extremely faster if it is a match, but quite slow (though still faster than the original) if not a match. I figured it should be the other way around. Once you find the match, you stop looking.
It is finding non matches that should be the faster code.
I didn't look too closely at your code yet, but could it be switched so that it was faster on a non match than on a match?
Anyway, I was surprised in the speed increase from the last code. I thought I did something wrong, but no, it was your code that was just much faster :-)
Thanks,
--Ben
Kirk is a software developer working on IBM Rational PurifyPlus. He can be contacted at [email protected].