UC BERKELEY
EECS technical reports
TECHNICAL REPORTS


CSD-91-653.pdf
Oskicat catalog record
Conditions of Use

Archive Home Page

Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

Authors:
Chang, William I.
Lampe, Jordan
Technical Report Identifier: CSD-91-653
October 1991
CSD-91-653.pdf

Abstract: We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(the square root of b - 1)) on the average, when alphabet size is taken into consideration.