Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Faster String Searches


JUL89: FASTER STRING SEARCHES

Costas Menico is a senior software developer and part owner of The Software Bottling Company. He can be reached at 6600 Long Island Expressway, Maspeth, NY 11378.


While programming several of my company's products, I have many times come across the need to perform a string search to accomplish a certain task. A string search is the function used to find a string A (the pattern) in a string B (the text) and return the position of A in B.

After realizing that the built-in string search functions of many language compilers used simple brute-force algorithms, I set out to find and implement an equivalent string-search function that would perform at a fraction of the speed of the built-in function.

The Brute-Force Method

As the name implies, the brute-force method of performing a string search involves an exhaustive search from the start of the pattern until there is a match or the end of the string is reached.

Each character in the pattern is compared left to right with each character in the string. The scanning is continued until all of the characters in the pattern match an equivalent length in the string. If no match occurs, the pattern is moved one character to the right and a new search is begun.

Assume we want to search for the pattern HEAD in the string "MAXIMOODHEADROOM." The search is shown in Figure 1. The method shown is time-consuming, because every character in the string must be compared with every other character until a final match is found. Nine steps were taken to discover the sample pattern. Using a program that must perform thousands of searches is a time-consuming way of looking up a pattern.

Figure 1: Brute-force search

  Pattern= HEAD
  String=  M A X I M O O D H E A D R O O M
  1)      H                        H Does not match, move right.
  2)        H                      H Does not match, move right.
  3)          H                    H Does not match, move right.
  4)            H                  H Does not match, move right.
  5)              H                H Does not match, move right.
  6)                H              H Does not match, move right.
  7)                  H            H Does not match, move right.
  8)                    H          H Does not match, move right.
  9)                      H E A D    HEAD finally matches.

The Boyer-Moore Method

After looking through some impressive (and complex) algorithms, I came across an elegant and easily implemented algorithm called the Boyer-Moore method, named for the pair of scientists who invented it. This is a hybrid algorithm that uses pattern analysis combined with brute force.

To implement the algorithm in assembler, I used MASM and linked it into Turbo Pascal 4.0. The algorithm is also easily converted to work with C and Basic. For the assembly language program (called POSBM), see Listing One; for the benchmark test program see Listing Two.

The basic idea behind the Boyer-Moore technique is the movement of the pattern more than one character at a time during the search for a matching character. This saves search steps, thereby increasing search speed. You may wonder how a whole pattern can be moved to the right without a search through any of the characters in between. The answer is simple. All that is required is a 256-character array that defines the pattern. This array (called a SKIPARRAY) is built every time the search function is called. Each position corresponds to the value of each character in the ASCII character set.

The algorithm is simple. A SKIPARRAY of 256 bytes is filled with the length of the pattern. Starting from the right of the pattern, each character is read and its offset placed into the SKIPARRAY, using the character's ASCII value as the index (see Figure 2).

Figure 2: SKIPARRAY values with the HEAD pattern

                         A   .   .   D   E   .   .   H   . .  ..    Z
  SKIPARRAY position...  65  66  67  68  69  70  71  72  73   ...   91
  Original values        4   4   4   4   4   4   4   4   4    ....  4
  Values with "HEAD"     1   4   4   0   2   4   4   3   4    ....  4

The pattern is then placed against the string for the search. At the rightmost position of the pattern the corresponding character in the string is read and its ASCII value used as an index into the SKIPARRAY. The value in the SKIPARRAY will determine how far to slide the pattern to the right, with one exception. A value of 0 means that the rightmost character in the pattern is lined up with an exact character in the string. Because this is a possible match, a reverse compare is performed on the remaining characters in the string (that is, a reverse brute-force compare). If the pattern matches, the offset is noted in the string. Otherwise, the pattern slides over by one position and the process is repeated.

To better explain this process I will use as an example the pattern HEAD. The SKIPARRAY is first filled with the length of the pattern. Once this is completed the SKIPARRAY will contain 4s in all 256 positions.

Next, we put the skip value for each character in the pattern into the SKIPARRAY. To do this, the offset of each character is taken from the end of HEAD and placed at the ASCII offset position in the SKIPARRAY. For the pattern HEAD, the SKIPARRAY is transformed by putting in the corresponding values of 3, 2, 1, and 0 at the 72 (H), 69 (E), 65 (A) and 68 (D) positions (see Figure 3).

Figure 3: Using Boyer-Moore, we find a match in four steps

  Pattern= H E A D
  String=   M A X I M O O D H E A D R O O M
  1)        H E A D
  2)                H E A D
  3)                  H E A D
  4)                        H E A D

The pattern is then scanned right to left to determine how far to move the pattern for each character in the string. To find the pattern HEAD in MAXIMOODHEADROOM, HEAD is positioned at the start of the string. This places the character D of the pattern HEAD under the character I in the string (see step 1 in Figure 3).

Because there is no match, that is, D is not equal to I, the SKIPARRAY is indexed at the position of the value for I, which is found to be 4. As I is not in the pattern, HEAD is positioned four characters to the right of the mismatched position.

The D in the pattern (see step 2 in Figure 3) is now positioned under the D in the string. This indicates that the last character of the pattern matches the corresponding character in the string. We therefore proceed to match the remaining characters by using a reverse brute-force method. In this case, the A in the pattern will not match the O in the string. Because this was a brute-force compare, we can only move the pattern one character to the right.

The D (see step 3 in Figure 3) is now positioned under the H in the string. As D is not equal to H, we index in to the SKIPARRAY and find that the skip value for H is 3. HEAD now slides to the right three characters.

The D in the pattern (see step 4 in Figure 3) is again lined up under the D in the string. By comparing the remaining characters of HEAD backwards from the D position, we find that the pattern matches successfully and record the starting position in the string.

The program used to perform the Boyer-Moore search method is shown in Listing One. It is assembled using MASM (or TASM) and the OBJ file is linked in to the Turbo Pascal program. It is called by using the name POSBM the same way you would call the POS() function in Turbo Pascal. The way this is set up will only work on parameters of type STRING (see Listing Two), which limit the search to 255 characters. You can easily modify the program to pass arrays the same way they are passed in C.

To call POSBM from C you must modify the structure CSTK to conform to the C calling sequence. You must change the initial code to copy the addresses of the PATADDR and STRADDR from the calling stack, to PATTXT and STRTXT. The length of the strings must also be computed (by searching for a 0 using a SCASB instruction) and placed in the variables PATLEN and STRLEN. Last, the way in which you return from the function should be changed to leave the calling values on the stack. Note that the function POSBM is declared FAR.

Limitations to the Boyer-Moore Method

Some patterns may not be suitable for this method. If your pattern has repeating characters matching against a text of repeating characters, the Boyer-Moore method will actually be slower than brute force. For example, the bit-map pattern 01111111 in a bit-map string of all 1111111111..... would require a lengthy search because the 0 in the pattern would not be compared until all the 1s have been matched from right to left. The pattern would then be moved over by one position and the search tried again, thus defeating the algorithm's purpose.

There have been some clever ways to avoid this problem by analyzing the pattern but, for all intents and purposes, the simple algorithm works very well for text searches.

Benchmarks

To illustrate the speed difference between brute force and Boyer-Moore, I have put together a small test program that compares the speed of the built-in function POS( ) of Turbo Pascal 4.0 (which uses the brute-force method) and the assembly language function POSBM( ) presented in this article.

To run the program, you must first assemble the function POSBM.ASM in Listing Two with MASM or TASM. This will create an .OBJ module. Next, compile the Turbo Pascal program in Listing One. The function will be linked in when the {$L POSBM} directive is encountered. Because the function is declared as FAR, we must include the directive {$F+} in the program to notify Pascal that the function should be called with FAR instructions.

The program starts out by generating a random string 255 characters long and uses the last five characters of the string as the pattern. The pattern is then searched in string using POS( ) and POSBM( ) by putting the searches in a long loop to emphasize the time difference. You may want to increase the value of longloop if you have a fast machine to see the significantly higher start and end times. The POSBM( ) is usually 400 percent faster than the POS( ) function.

Conclusion

I have used this algorithm in a number of programs with impressive results. If you are looking for a way to speed up your program, check to see if the bottleneck is your string search functions. The Boyer-Moore algorithm will give immediate results.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063, or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

_FASTER STRING SEARCHES_ by Costas Menico

[LISTING ONE]

<a name="0157_000e">


;====================================================================
;   Faster string search routine to substitute the POS()
;   function in Turbo Pascal 4 or 5. Based on the Boyer-Moore
;   algorithm.
;   Program author: Costas Menico.
;   Declare as follows:
;   {$F+}
;   {$L search.obj}
;   function posbm(pat,str:string): byte; external;
;   Call as follows from Turbo 4 or Turbo 5
;   location := posbm(pat, str);
;====================================================================

skiparrlength equ 256      ; Length of the skip array

; Function work stack
dstk   struc
patlen    dw    ?      ; pattern length (also BP base work area)
strlen   dw    ?      ; string length
skiparr   db    skiparrlength dup(?)   ; skip array
pattxt   dd    0      ; pattern address
strtxt   dd    0      ; string text address
dstk   ends

; Total stack (Callers plus work stack)
cstk   struc
ourdata   db   size dstk dup(?); work stack size
bpsave   dw   0      ; save bp here
retaddr   dd   0      ; points to return address
straddr   dd    0      ; points to string address
pataddr   dd   0      ; points to pattern address
cstk   ends

paramsize equ size pataddr + size straddr ; Size of parameter list.

public   posbm         ; Function name declaration

code   segment para public 'code'
   assume cs:code
;
; ---- Entry point to POSBM function.
;
posbm    proc    far
   push   bp      ; Save BP
   sub   sp, size dstk   ; Create work area
   mov   bp, sp      ; Adjust our base pointer

   push   ds      ; Save callers data segment

   xor   ah, ah      ; Clear register and
   cld         ; Set direction flag

; Get and save the length and address of the pattern
   lds    si, [bp.pataddr];
   mov    word ptr [bp.pattxt][2], ds
   lodsb         ; Get length of pattern (1 byte)
   or    al, al      ; If pattern length is null then exit
   jne    notnullp
   jmp    nomatch

notnullp:
   mov   cx, ax      ; Save length to check if 1 later.

   mov    [bp.patlen], ax   ; Save length of pattern
   mov    word ptr [bp.pattxt], si; Save address

; Get and save the length and address of the string text
   lds    si, [bp.straddr]
   mov    word ptr [bp.strtxt][2], ds
   lodsb         ; Get length of string
   or    al, al      ; If string text is null then exit
   jne    notnulls
   jmp    nomatch

notnulls:
   mov    [bp.strlen], ax   ; Save length of string
   mov    word ptr [bp.strtxt], si; Save address

   cmp   cx, 1      ; Is length of pattern 1 char?
   jne   do_boyer_moore   ; No - Do Boyer-Moore.
   lds   si, [bp.pattxt]   ; Yes - Do a straight search.
   lodsb         ; Get the single character pattern.
   les   di, [bp.strtxt]   ; Get the address of the string.
   mov   cx, [bp.strlen]   ; Get length of string.
   repne   scasb      ; Search.
   jz   match1      ; Found - Adjust last DI pos.
   jmp   nomatch      ; Not Found - Exit.
match1:
   mov   si, di      ; Transfer DI pos to SI.
   sub   si, 2      ; Adjust SI position.
   jmp   exactmatch   ; Determin offset

do_boyer_moore:

; Fill the ascii character skiparray with the
; length of the pattern
   lea    di, [bp.skiparr]   ; Get skip array address
   mov    dx, ss
   mov    es, dx
   mov    al,byte ptr [bp.patlen]   ; Get size of pattern
   mov    ah,al         ; Put in to AH as well
   mov    cx, skiparrlength / 2   ; Get size of array
   rep    stosw         ; Fill with length of pat

; Replace in the ascii skiparray the corresponding
; character offset from the end of the pattern minus 1
   lds    si, [bp.pattxt]   ; Get pattern address
   lea    bx, [bp.skiparr]; Get skip array address
   mov    cx, [bp.patlen]   ; Get length minus one.
   dec    cx

   mov   bx, bp      ; save BP
   lea    bp, [bp.skiparr]; Get skip array address
   xor   ah, ah
fill_skiparray:
   lodsb         ; Get character from pattern
   mov    di, ax      ; Use it as an index
   mov    [bp+di], cl   ; Store its offset in to skip array
   loop    fill_skiparray

   lodsb
   mov   di, ax
   mov    [bp+di], cl   ; Store the last skip value
   mov   bp, bx      ; Recover BP

; Now initialize our pattern and string text pointers to
; start searching
   lds    si, [bp.strtxt]   ; Get the string address.
   lea    di, [bp.skiparr]; Get the skip array address.
   mov    dx, [bp.strlen]   ; Get the string length.
   dec    dx      ;   minus 1 for eos check.
   mov    ax, [bp.patlen]   ; Get the pattern length.
   dec    ax      ; Starting skip value.
        xor    bh, bh      ; Zero high of BX.
   std         ; Set to reverse compare.

; Get character from text. Use the character as an index
; in to the skip array, looking for a skip value of 0 .
; If found, execute a brute force search on the pattern.
searchlast:
   sub    dx, ax      ; Check if string exhausted.
   jc     nomatch      ; Yes - no match.
   add    si, ax      ; No - slide pattern with skip value.
   mov     bl, [si]   ; Get character, use as an index
   mov    al, ss:[di+bx]   ;   and get the new skip value.
   or    al, al      ; If 0, then possible match
   jne    searchlast   ;   try again by sliding to right.

; We have a possible match, therefore
; do the reverse Brute-force compare
   mov    bx, si      ; Save string address.
   mov    cx, [bp.patlen]   ; Get pattern length.
   les    di, [bp.pattxt]   ; Get pattern address
   dec    di      ;  adjust
   add    di, cx      ;  and add to point to eos.
   repe    cmpsb      ; Do reverse compare.
   je    exactmatch   ; If equal we found a match
   mov    ax, 1      ;   else set skip value to 1.
   lea    di, [bp.skiparr]; Get address of skip array.
   mov    si, bx      ; Get address of string.
        xor    bh, bh      ; No - Zero high of BX.
   jmp short searchlast   ; Try again.

exactmatch:
   mov   ax, si      ; Save current position in string.
   lds    si, [bp.strtxt]   ; Get start of strtxt.
   sub    ax, si      ; Subtract and add 2 to get position
   add    ax, 2      ;   in strtxt where pattern is found.
   jmp    short endsearch   ; Exit function

nomatch:
   xor    ax, ax      ; No match, return a 0

endsearch:
   cld
   pop    ds      ; Recover DS for Turbo Pascal

   mov   sp, bp      ; Recover last stack position.
   add    sp, size dstk   ; Clear up work area.
   pop    bp      ; Recover BP
   ret   paramsize   ; Return with ax the POSBM value.
posbm    endp

code   ends
end





<a name="0157_000f"><a name="0157_000f">
<a name="0157_0010">
[LISTING TWO]
<a name="0157_0010">

{ -- Benchmark program to demonstrate the speed difference
  -- between the POS() in Turbo Pascal 4 or 5 brute-force
  -- and the Boyer-Moore Method function POSBM()
  -- Program Author: Costas Menico
}
program search;
uses dos,crt;

{ -- Link in the POSBM Boyer-Moore function -- }
{$F+}
{$L POSBM}
function posbm(pat,str:string):byte; external;

{ Prints bencmark timing information }
procedure showtime(s:string; t: registers);
begin
  writeln(s,' Hrs:',t.ch,' Min:',t.cl,' Sec:',t.dh,' Milsec:',t.dl);
end;

var
  pat,str: string;
  i:integer;
  j:integer;
  start,finish: registers;

const
  longloop = 2000;

begin

  clrscr;

  { -- Create a random string of length 255 -- }
  randomize;
  for i:=1 to 255 do str[i]:=chr(random(255)+1);
  str[0]:=chr(255);

  { -- Initialize a pattern string with the last five characters
    -- in the random string as the pattern to search for. -- }
  pat:=copy(str,251,5);

  { -- First do a search with the regular POS function -- }
  writeln('Search using Brute-Force Method');
  start.ah := $2c;
  msdos(start);       { -- Get start time -- }

  for j:=1 to longloop do
    i:=pos(pat,str);  { -- Do search a few times Brute-Force -- }

  finish.ah := $2c;   { -- Get finish time -- }
  msdos(finish);

  showtime('Start ',start);    { -- Show start time -- }
  showtime('Finish',finish);   { -- Show finish time -- }
  writeln('Pattern found at =',i); { -- Print string position -- }
  writeln;

  { -- Now do search with the POSBM() (Boyer-Moore) function -- }
  writeln('Search using Boyer-Moore Method');
  start.ah := $2c;
  msdos(start);       { -- Get start time -- }

  for j:=1 to longloop do
    i:=posbm(pat,str);{ -- Do search a few times Boyer-Moore -- }

  finish.ah := $2c;   { -- Get finish time -- }
  msdos(finish);

  showtime('Start ',start);    { -- Show start time -- }
  showtime('Finish',finish);   { -- Show finish time -- }
  writeln('Pattern found at =',i); { -- Print string position -- }

  writeln;
  writeln('DONE.. PRESS ENTER');
  readln;
end.













Copyright © 1989, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.