### Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

**Authors:**

Chang, William I.

Lampe, Jordan
**Technical Report Identifier:** CSD-91-653
**October 1991**

CSD-91-653.pdf

**Abstract:** We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the *k* differences approximate string matching problem specifies a text string of length *n*, a pattern string of length *m*, the number *k* of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various *O*(*kn*) algorithms based on dynamic programming (DP), paying particular attention to dependence on *b* the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on *b*. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is *O*(*kn*) for random text. Furthermore, we give a heuristic argument that our algorithm is *O*(*kn*/(the square root of *b* - 1)) on the average, when alphabet size is taken into consideration.