FINDING STRING DISTANCES
A rose by any other name has a particular Levenshtein distance
Ray Valdes
Ray is a technical editor at DDJ. He can be reached at 411 Borel Avenue, San Mateo, CA 94402.
This article discusses the theory and practice of sequence comparison. This topic has lurked mostly unnoticed on the sidelines of computer science, but has proved tremendously important in biotech research and may now have widespread application in the areas of handwriting and speech recognition.
Background
Comparing two sequences of characters is ostensibly a rather narrow, simple problem. Standard texts on algorithms virtually ignore this subject, possibly because it seems relatively inconsequential. More comprehensive texts discuss the problem of finding an exact match quickly, relying on the now classic Boyer-Moore algorithm.
Finding multiple or partial matches between two strings is a subject that has been well studied, for it falls within the traditional domain of formal languages and parsing theory. The familiar grep utility (and other UNIX utilities that parse regular expressions, such as awk, sed, and lex) established finite state machines as the method of choice for efficient processing of multiple matches that can be described by regular expressions.
Since the time of those classic programs, there's been little reason for the mainstream to investigate further. Recently, however, more precise notions of sequence comparison have become important. When two strings are different, what metric can be used to describe the difference? What are the basic operations that cause sequences to change? Given two different strings, can we arrive at the optimal sequence of steps that will transform one string into the other?
Actually, some well-known computer scientists (Aho, Hopcroft, and Ullman, for instance) have published papers on what is known as "the string-to-string correction problem," but these have not been widely cited.
All this may seem academic until you realize that sequence comparison has served as one of the underpinnings of research in molecular biology over the last decade, with benefits that are now coming into fruition, as any glance at the science section of your local newspaper will show.
Comparing Genetic Sequences
As you may recall, all living creatures grow, develop, and reproduce under the control of a stored program, the biochemical instructions of which are recorded digitally on a thin polymer filament--DNA. In the case of humans, this digital tape is about three feet long, broken up into the 26 chromosomes that constitute the human genome, and wound up into a compact package within the cell nucleus.
Because researchers have not yet transcribed and disassembled this program completely (a feat likely to occur during this decade), molecular biologists must presently conduct research by trying to match snippets of DNA (sometimes haphazardly collected) against previously transcribed sequences in a database. These sequence comparisons are done by computer, using symbolic representations of biochemical instructions (basically, strings of text that represent the base pairs out of which DNA is constructed). By finding related sequences and measuring the distance between them, researchers gain a greater understanding of how life was created and has evolved, and about the different roles played by the many biochemical components of the human body.
Handwriting and Speech Recognition
Another area where it is important to know the extent to which two strings differ is in the area of handwriting recognition and signature verification. As Ron Avitzur shows in his article "Your Own Handprinting Recognition Engine" (on page 32 of this issue), you can represent the salient features of a handprinted character by a string that describes the direction and position of the stylus as it marks strokes on a surface. Avitzur's algorithm is straightforward, and uses very short strings ("NWESWS," for example) to describe the trajectory of the stylus.
A more sophisticated algorithm might use longer string descriptions and require a more precise measure of the difference and similarity between a candidate string and the stored templates in the database.
Using such an algorithm to recognize regular handprinted text would be overkill (not to mention slow). But in the case of verifying handwritten signatures, which have longer strokes and more intricate shapes, the limits of Avitzur's fast, straightforward approach--which relies on exact matching--become apparent. You're less likely to get an exact match in the case of long strokes with detailed representations. Also, a way of "warping" the data stream becomes necessary, because people may arbitrarily extend or shorten parts of strokes when writing in cursive script.
Other applications that require this kind of sequence comparison include voice recognition and speech processing. Also, some researchers have used sequence comparison to study bird songs. Other unusual applications include gas chromatography, geological analysis, and fingerprint analysis.
The Levenshtein Distance
For each of these diverse applications, a number of sequence comparison algorithms are in use. Most of these stem from the basic work of V.I. Levenshtein, published in Russian in 1965. His advances in coding theory were independently rediscovered and published in the early '70s by a dozen or so researchers in a number of fields (mostly speech processing and molecular biology). Computer scientists joined the fray rather late in the game, with a JACM paper by Wagner and Fischer in 1974.
Levenshtein introduced two measures of difference between two strings. One is the smallest number of substitutions, insertions, and deletions, required to transform one string to another. The other measure is similar, except that substitutions are not allowed. Both metrics have been called the "Levenshtein distance." We reserve the term for the first, shorter distance. Example 1 shows an example, using the two strings "Axolotl" and "Axl Rose." The Levenshtein distance between these strings is 5, consisting of 4 substitutions and 1 insertion.
Example 1: The five steps from Axolotl to Axl Rose
Axolotl Axolote (Substitute E for L) Axolose (Substitute S for T) AxoRose (Substitute R for L) Ax Rose (Substitute space for O) Axl Rose (Insert L)
The algorithm uses a problem-solving strategy known as "dynamic programming." Dynamic programming does not mean programming a computer with a lot of loud, animated motions. Rather, it is a term of unfortunate coinage that comes from decision theory and operations research, and refers to one particular approach to solving resource allocation problems. The approach consists of separating an optimization problem into a series of smaller, interrelated steps (known as "stages"), each of which represents a partial optimization, and each of which includes one additional stage. Basically, you undertake stages one at a time until you arrive at the solution, which is the final stage.
In general, the optimal value at each intermediate-stage decision depends on conditions which are unknown until the final-stage optimization has been reached. So you have to keep the results of the intermediate-stage calculations until you reach the end, when the optimal solution can be determined. These intermediate-stage results are kept in a decision table or cost matrix.
Some well-known optimization problems solved with dynamic programming are the shortest-route problem and the knapsack problem. Another example is Donald Knuth's algorithm for composing streams of text into paragraphs, to arrive at the optimal set of line breaks and hyphenations.
Finding the Levenshtein distance between two strings is similar to finding the shortest route between two cities. Using dynamic programming, there are two basic ways to solve these problems: forwards and backwards. Say you're trying to find the shortest route between two cities connected by a network of highways that traverse intermediate cities. You can either start at the first city and work forwards, finding the minimum distance between that city and its neighbors, and then moving on to the next stage. Or you can start at the destination and work backwards, until you reach the starting point. Listing One (page 107) shows the program LEVDIST.C, which implements the classic, forward version of the algorithm.
The algorithm is straightforward, as a result of various constraints. One is that string distances have the property of a distance in metric space, such as found in high-school geometry. Remember the triangle inequality? This says that the distance from point A to point C cannot be less than the distance from A to B added to the distance from B to C. In the case of travel, it cannot be shorter to go from San Francisco to Los Angeles by way of Memphis (unless you use Federal Express). In the case of strings, the cheapest way to change character A to character C is by replacing A with C, as opposed to replacing it first with B and then with C. These statements may sound obvious, but realize that these are ways of formally constraining a general problem that can be quite complex into one that is more tractable. Other related constraints are: Distance cannot be negative, the distance between two identical strings is zero, and the distance between two strings is the same in either direction (known as the "reflexivity property").
How the Algorithm Works
The top-level routine is main( ), which asks for two strings: A and B. It then calls functions to initialize the matrix M, fills the cells of M with partial distances, and backtracks from the final stage at the lower-right corner of M to display the optimum sequence of string edits.
The procedure calculate_matrix( ) fills the cells in M with the minimum edit distances at each stage of the transformation from string A to B. The characters in string A lie along the vertical axis of M at column 0, while those in string B lie on the horizontal axis at row 0. We start at cell M[0,0], the upper-left corner of the matrix, and work southeast over to cell M[m,n].
Moving by one unit from northwest to southeast means substituting the character at A[i] with character B[j]. Moving one unit east means inserting character B[j], while moving one unit down means deleting character A[i]. This is expressed in Example 2 as a recurrence relation resulting in a weighted sum. At each stage, represented by cell M[i,j], the minimum distance between partial strings A[1..i] and B[1..j] is the three-way minimum of the accumulated distance in predecessor stages (the cells immediately north, west, and northwest of cell M[i,j]) plus the cost of moving from those neighboring cells over to the current cell--the cost of deleting, inserting, or substituting the corresponding characters.
At the north boundary, there are no northern predecessors. We can therefore initialize each cell M[0][j] in that first row with the cost of inserting character B[j]. Likewise, at the western boundary, we initialize each cell M[i][0] with the cost of deleting character A[i]. This happens in the routine initialize_matrix( ).
After calculate_matrix( ) has done its work, we know that the Levenshtein distance is in cell M[m][n]. We can backtrack from this southeasternmost cell to find the optimum sequence of insertions, deletions, and substitutions that will result in string B from string A. This happens in the routine backtrack_matrix( ). Note that while there may be multiple paths with the same minimum number of edit operations, only one of these is traced and displayed.
The output of the program is shown in Example 3. The matrix shows the edit distance for each cell, followed by the operation (INS, DEL, SUB, or EQU) that brought us there from its northern and/or westerly neighbors. The program allows you to change the weights assigned to each kind of edit operation. This is useful for speech recognition, where you want to be able to "time-warp" (stretch or compress) one signal with respect to another, and would therefore want insertions and deletions to be much cheaper than substitutions. For matching DNA and protein sequences, the reverse holds true; substitutions are cheaper than insertions and deletions.
Example 3: Output of LEVDIST.C
\ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | \COL |-- --|-- --|-- --|-- --|-- --|-- --|-- --|-- --|-- --| ROW \ | | A | x | 1 | | R | o | s | e | --------------------------------------------------------------- 0 \ | 0 DEL 1 INS 2 INS 3 INS 4 INS 5 INS 6 INS 7 INS 8 INS 1 A | 1 DEL 0 EQU 1 INS 2 INS 3 INS 4 INS 5 INS 6 INS 7 INS 2 x | 2 DEL 1 DEL 0 EQU 1 INS 2 INS 3 INS 4 INS 5 INS 6 INS 3 o | 3 DEL 2 DEL 1 DEL 1 SUB 2 SUB 3 SUB 3 EQU 4 INS 5 INS 4 l | 4 DEL 3 DEL 2 DEL 1 EQU 2 SUB 3 SUB 4 SUB 4 SUB 5 SUB 5 o | 5 DEL 4 DEL 3 DEL 2 DEL 2 SUB 3 SUB 3 EQU 4 INS 5 SUB 6 t | 6 DEL 5 DEL 4 DEL 3 DEL 3 SUB 3 SUB 4 SUB 4 SUB 5 SUB 7 l | 7 DEL 6 DEL 5 DEL 4 DEL 4 SUB 4 SUB 4 SUB 5 SUB 5 SUB Levenshtein distance is 5. Backtrace: Axolotl D(7,8)=5 SUB -> Axolote D(6,7)=4 SUB -> Axolose D(4,5)=3 SUB -> AxoRose D(3,4)=2 SUB -> Ax Rose D(2,3)=1 INS -> Axl Rose
The costs are stored in the optable[] array. A more general way of allowing varied cost would be a two-dimensional weight matrix, containing all letters of the alphabet and specifying the cost of substituting any one letter for any other. Even this is rudimentary in the case of genetic sequences, where researchers use more elaborate algorithms to dynamically vary operation weights.
Complexity
The algorithm is of complexity O(nm), where n and m are the lengths of the two strings. If the strings are equal in length, this of course means O(n2). This complexity means it is probably unsuitable for recognizing handprinted characters rapidly, but likely to work well in the case of signature verification.
One way of speeding it up is known as the "Four Russians" algorithm, a rather juvenile name from a 1980 paper by American researchers Masek and Paterson that cites a 1970 publication by Arlazarov, Dinic, Kronrod, and Faradzev. The algorithm works faster by splitting the computation up into many smaller computations, which, if small enough and if the alphabet is finite, can be precomputed and combined to derive the larger computation. In other words, the matrix is partitioned into submatrices, and all possible computations on submatrices are precomputed. This version is of complexity O(mn/min(m,log n)), assuming that m<= n. Unfortunately, the performance gain only shows up in the case of very long strings. (The example provided by Masek and Paterson does not pay off until the string is longer than 262,419 characters.)
Molecular biologists have taken these general-case algorithms and modified them for specific circumstances, such as the FASTA family of algorithms by Lipman and Pearson. In these special cases, complexity has been lowered to roughly O(m). But given databases now more than 30 million characters in length, even better methods must be found. Current research focuses on developing parallel algorithms that can take advantage of architectures such as that from Thinking Machines Inc.
Eventually, we will be able to know, in megabytes of sub-microscopic detail, exactly how one rose differs from a rose by another name.
References
Tyler, Elizabeth, Martha Horton, and Philip Krause. "A Review of Algorithms for Molecular Sequence Comparison." Computers and Biomedical Research (vol. 24, 1991).
Sankoff, David and Joseph B. Kruskal (eds.). Time Warps, String Edits and MacroMolecules: The Theory and Practice of Sequence Comparison. Reading, Mass.: Addison-Wesley, 1983.
Miclet, Laurent. Structural Methods in Pattern Recognition. New York, N.Y.: Springer-Verlag, 1986.
Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Doklady Akadaemii Nauk SSR 163(4):845-848, 1965. (Russian reference from Sankoff and Kruskal.)
_FINDING STRING DISTANCES_ by Ray Valdes[LISTING ONE]
<a name="00e8_000e"> /*********************************************************************** LEVDIST.C -- Computing the Levenshtein distance (string-to-string edit) by Ray Valdes, DDJ April 92 ***********************************************************************/ #define TRUE 1 #define FALSE 0 #define private static #define public /**/ typedef int bool; private bool verbose_mode = TRUE; typedef enum { MATCH, INS, DEL, SUB } opcode; typedef struct { int cost; char* name; int delta_row; int delta_col; } operation; #define COST(op) (optable[(int)op].cost) // for convenience #define OPSTR(op) (optable[(int)op].name) // for convenience private operation optable[] = //costs defined on a per-op basis { /*--cost, name, delta_row, delta_col---------------------------------*/ { 0, "EQU", -1, -1}, /* a match or no-op backtracks to NorthWest */ { 1, "INS", 0, -1}, /* insert op backtrack to the West */ { 1, "DEL", -1, 0}, /* delete op backstracks to the North */ { 1, "SUB", -1, -1}, /* substitution op backtracks to NorthWest */ }; typedef struct { int distance; opcode op; } matrix_cell; #define NUM_ROWS 64 #define NUM_COLS NUM_ROWS #define SIZEOF_STRING NUM_ROWS private char A [SIZEOF_STRING]; private char B [SIZEOF_STRING]; private matrix_cell M [NUM_ROWS] [NUM_COLS]; // this is The Matrix #define DIST(i,j) (M [(i)][(j)].distance) // for convenience /****************************************************************/ private void say_hello (void); private bool get_strings (void); private void initialize_matrix (void); private void calculate_cell (int row,int col); private void print_cell (int row,int col); private void calculate_matrix (int num_rows,int num_cols); private void backtrack_matrix (int num_rows,int num_cols); /****************************************************************/ public int main(int argc,char **argv) { say_hello(); while(get_strings()) { initialize_matrix(); calculate_matrix(strlen(A),strlen(B)); backtrack_matrix(strlen(A),strlen(B)); } return 0; } /****************************************************************/ private void say_hello(void) { if(verbose_mode) printf("\nLevenshtein distance, V1.0"); } /****************************************************************/ private bool get_strings(void) { char buffer[SIZEOF_STRING*3]; //arbitrarily big buffer printf("\nEnter first string > "); gets(buffer); if(buffer[0]=='\0') return FALSE; strcpy(A,buffer); printf("\nEnter second string > "); gets(buffer); if(buffer[0]=='\0') return FALSE; strcpy(B,buffer); return TRUE; } /****************************************************************/ private void initialize_matrix(void) { int row,col; for(row=0,col=0; col<NUM_COLS; col++) // initialize the first row { M [row][col].distance = col; M [row][col].op = INS; } for(row=0,col=0; row<NUM_ROWS; row++) // initialize the first column { M [row][col].distance = row; M [row][col].op = DEL; } } /****************************************************************/ private void calculate_cell(int row,int col) { int dNorthWest = DIST(row-1, col-1); int dWest = DIST(row , col-1); int dNorth = DIST(row-1, col ); if(dWest < dNorth) { if(dWest < dNorthWest) { M [row][col].op = INS; M [row][col].distance = dWest + COST(INS); } else // dNorthWest <= dWest < dNorth { opcode op; op = ( A[row-1]==B[col-1] ) ? MATCH : SUB; M [row][col].op = op; M [row][col].distance = dNorthWest + COST(op); } } else // dNorth <= dWest { if(dNorth < dNorthWest) { M [row][col].op = DEL; M [row][col].distance = dNorth + COST(DEL); } else // dNorthWest <= dNorth <= dWest { opcode op; op = ( A[row-1]==B[col-1] ) ? MATCH : SUB; M [row][col].op = op; M [row][col].distance = dNorthWest + COST(op); } } } /****************************************************************/ private void calculate_matrix(int num_rows,int num_cols) { int row,col; if(verbose_mode) { printf("\n\\\n \\COL |"); for(col=0; col < num_cols+1; col++) printf("____%d____|"); printf("\nROW \\ | |"); for(col=1; col < num_cols+1; col++) printf(" %c |", B [col-1]); printf("\n 0 \\|"); for(row=0,col=0; col < num_cols+1; col++) print_cell(row,col); } for(row=1; row<num_rows+1; row++) { if(verbose_mode) printf("\n% 2d %c |",row, A [row-1]); print_cell(row,0); for(col=1; col<num_cols+1; col++) { calculate_cell(row,col); if(verbose_mode) print_cell(row,col); } } } /****************************************************************/ private void print_cell(int row,int col) { printf(" %d %s ",DIST(row,col),OPSTR( M [row][col].op)); } /****************************************************************/ private void backtrack_matrix(int num_rows,int num_cols) { int dx,dy; int i,j; int row = num_rows; int col = num_cols; printf("\n\nLevenshtein distance is %d.\n",DIST(row,col)); printf("\nBacktrace: %s",A); while(row>0 || col>0) { if( ( M [row][col].op != MATCH) && verbose_mode) { printf("\nD(%d,%d)=%d ", row,col,DIST(row,col)); printf("%s --> ",OPSTR( M [row][col].op)); for(i=1;i<row-(M[row][col].op==DEL?1:0);i++) printf("%c", A[i-1]); /*printf("_");*/ for(j=col-(M[row][col].op==INS?1:0);j<num_cols+1;j++) printf("%c", B[j-1]); } dy = optable[(int)( M [row][col].op)].delta_row; dx = optable[(int)( M [row][col].op)].delta_col; row += dy; col += dx; } }
Copyright © 1992, Dr. Dobb's Journal