Algorithms On String Trees and Sequences
Algorithms On String Trees and Sequences
Dan Gusfield
University of California, Davis
CAMBRIDGE
UNIVERSITY PRESS
Contents
Preface
xiii
2 Exact Matching: Classical Comparison-Based Methods 2.1 2.2 2.3 2.4 2.5
Introduction The Boyer-Moore Algorithm The Knuth-Morris-Pratt algorithm Real- time string matching Exercises
3 Exact Matching: A Deeper Look at Classical Methods 3.1 3.2 3.3 3.4 3.5 3.6 3.7
A Boyer-Moore variant with a "simple" linear time bound Cole's linear worst-case bound for Boyer-Moore The original preprocessing for Knuth-Momis-Pratt Exact matching with a set of patterns Three applications of exact set matching Regular expression pattern matching Exercises
CONTENTS
8.10 8.11
Computing alignments in only linear space Faster algorithms when the number of differences are bounded Exclusion methods: fast expected running time Yet more suffix trees and more hybrid dynamic programming A faster (combinatorial) algorithm for longest common subsequence Convex gap weights The Four-Russians speedup Exercises
Parametric sequence alignment Computing suboptimal alignments Chaining diverse local alignments Exercises
Why multiple string comparison? Three "big-picture" biological uses for multiple string comparison Family and superfamily representation
viii
II Suffix Tees and Their Uses 5 Introduction to Suffix Trees
5.1 5.2 5.3 5.4
CONTENTS
A short history Basic definitions A motivating example A naive algorithm to build a suffix tree
Ukkonen's linear-time suffix tree algorithm Weiner's linear- time suffix tree algorithm McCreight's suffix tree algorithm Generalized suffix tree for a set of strings Practical implementation issues Exercises
7 First Applications of Suffix Trees 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20
APL 1: Exact string matching APL2: Suffix trees and the exact set matching problem APL3: The substring problem for a database of patterns APL4: Longest common substring of two strings APL5: Recognizing DNA contamination APL6: Common substrings of more than two strings APL7: Building a smaller directed graph for exact matching APL8: A reverse role for suffix trees, and major space reduction APL9: Space-efficient longest common substring algorithm APL10: All-pairs suffix-prefix matching Introduction to repetitive structures in molecular strings APLI 1: Finding all maximal repetitive structures in linear time APL 12: Circular string linearization APL 13: Suffix arrays - more space reduction APL 14: Suffix trees in genome-scale projects APL 15: A Boyer-Moore approach to exact set matching APL16: Ziv-Lempel data compression APL17: Minimum length encoding of DNA Additional applications Exercises
CONTENTS
19 Models of Genome-Level Mutations 19.1 Introduction 19.2 Genome rearrangements with inversions 19.3 Signed inversions 19.4 Exercises Epilogue - where next? Bibliography Glossary Index
CONTENTS
Multiple sequence comparison for structural inference Introduction to computing multiple string alignments Multiple alignment with the sum-of-pairs (SP) objective function Multiple alignment with consensus objective functions Multiple alignment to a (phylogenetic) tree Comments on bounded-error approximations Common multiple alignment methods Exercises
Preface
biochemical phenomena to interactions between defined sequences . . . .[449] and The ultimate rationale behind all purposeful structures and behavior of Living things is embodied in the sequence of residues of nascent polypeptide chains . . . In a real sense it is at this level of organization that the secret of life (if there is one) is to be found. [330] So without worrying much about the more difficult chemical and biological aspects of DNA and protein, our computer science group was empowered to consider a variety of biologically important problems defined primarily on sequences, or (more in the computer science vernacular) on strings: reconstructing long strings of DNA from overlapping string fragments; determining physical and genetic maps from probe data under various experimental protocols; storing, retrieving, and comparing DNA strings; comparing two or more strings for similarities; searching databases for related strings and substrings; defining and exploring different notions of string relationships; looking for new or illdefined patterns occurring frequently in DNA; looking for structural patterns in DNA and
The other long-term members were William Chang, Gene Lawler, Dalit Naor. and Frank Olken.
xiii
PREFACE
xv
ence, although it was an active area for statisticians and mathematicians (notably Michael Waterman and David Sankoff who have largely framed the field). Early on, seminal papers on computational issues in biology (such as the one by Buneman [83]) did not appear in mainstream computer science venues but in obscure places such as conferences on cornputational archeology [226]. But seventeen years later, computational biology is hot, and many computer scientists are now entering the (now more hectic, more competitive) field [280]. What should they learn? The problem is that the emerging field of computational molecular biology is not well defined and its definition is made more difficult by rapid changes in molecular biology itself. Still, algorithms that operate on molecular sequence data (strings) are at the heart of computational molecular biology. The big-picture question in computational molecular biology is how to "do" as much "real biology" as possible by exploiting molecular sequence data (DNA, RNA, and protein). Getting sequence data is relatively cheap and fast (and getting more so) compared to more traditional laboratory investigations. The use of sequence data is already central in several subareas of molecular biology and the full impact of having extensive sequence data is yet to be seen. Hence, algorithms that operate on strings will continue to be the area of closest intersection and interaction between computer science and molecular biology. Certainly then, computer scientists need to learn the string techniques that have been most successfully applied. But that is not enough. Computer scientists need to learn fundamental ideas and techniques that will endure long after today's central motivating applications are forgotten. They need to study methods that prepare them to frame and tackle future problems and applications. Significant contributions to computational biology might be made by extending or adapting algorithms from computer science, even when the original algorithm has no clear utility in biology. This is illustrated by several recent sublinear-time approximate matching methods for database searching that rely on an interplay between exact matching methods from computer science and dynamic programming methods already utilized in molecular biology. Therefore, the computer scientist who wants to enter the general field of computational molecular biology, and who learns string algorithms with that end in mind, should receive a training in string algorithms that is much broader than a tour through techniques of known present application, Molecular biology and computer science are changing much too rapidly for that kind of narrow approach. Moreover, theoretical computer scientists try to develop effective algorithms somewhat differently than other algorithmists. We rely more heavily on correctness proofs, worst-case analysis, lower bound arguments, randomized algorithm analysis, and bounded approximation results (among other techniques) to guide the development of practical, effective algorithms, Our "relative advantage" partly lies in the mastery and use of those skills. S o even if I were to write a book for computer scientists who only want to do computational biology, I would still choose to include a broad range of algorithmic techniques from pure computer science. In this book, I cover a wide spectrum of string techniques - well beyond those of established utility; however, I have selected from the many possible illustrations, those techniques that seem to have the greatestpotential application in future molecular biology. Potential application, particularly of ideas rather than of concrete methods, and to anticipated rather than to existing problems is a matter of judgment and speculation. No doubt, some of the material contained in this book will never find direct application in biology, while other material will find uses in surprising ways. Certain string algorithms that were generally deemed to be irrelevant to biology just a few years ago have become adopted
xiv
PREFACE
protein; determining secondary (two-dimensional) structure of RNA; finding conserved, but faint, patterns in many DNA and protein sequences; and more. We organized our efforts into two high-level tasks. First, we needed to learn the relevant biology, laboratory protocols, and existing algorithmic methods used by biologists. Second we sought to canvass the computer science literature for ideas and algorithms that weren't already used by biologists, but which might plausibly be of use either in current problems or in problems that we could anticipate arising when vast quantities of sequenced DNA or protein become available.
Our problem
None of us was an expert on string algorithms. At that point 1 had a textbook knowledge of Knuth-Morris-Pratt and a deep confusion about Boyer-Moore (under what circumstances it was a linear time algorithm and how to do strong preprocessing in linear time). I understood the use of dynamic programming to compute edit distance, but otherwise had little exposure to specific string algorithms in biology. My general background was in combinatorial optimization, although I had a prior interest in algorithms for building evolutionary trees and had studied some genetics and molecular biology in order to pursue that interest. What we needed then, but didn't have, was a comprehensive cohesive text on string algorithms to guide our education. There were at that time several computer science texts containing a chapter or two on strings, usually devoted to a rigorous treatment of Knuth-Morris-Pratt and a cursory treatment of Boyer-Moore, and possibly an elementary discussion of matching with errors. There were also some good survey papers that had a somewhat wider scope but didn't treat their topics in much depth. There were several texts and edited volumes from the biological side on uses of computers and algorithms for sequence analysis. Some of these were wonderful in exposing the potential benefits and the pitfalls of using computers in biology, but they generally lacked algorithmic rigor and covered a narrow range of techniques. Finally, there was the seminal text Time Warps, String Edits, and Macromolecules: The Theory und Practice of Sequence Comnparison edited by D. Sankoff and J. Kruskal, which served as a bridge between algorithms and biology and contained many applications of dynamic programming. However, it too was much narrower than our focus and was a bit dated. Moreover, most of the available sources from either community focused on string matching, the problem of searching for an exact or "nearly exact" copy of a pattern in a given text. Matching problems are central, but as detailed in this book, they constitute only a part of the many important computational problems defined on strings. Thus, we recognized that summer a need for a rigorous and fundamental treatment of the general topic of algorithms that operate on strings, along with a rigorous treatment of specific string algorithms of greatest current and potential import in computational biology. This book is an attempt to provide such a dual, and integrated, treatment.
PREFACE
xvii
rithm will make those important methods more available and widely understood. I connect theoretical results from computer science on sublinear-time algorithms with widely used methods for biological database search. In the discussion of multiple sequence alignment I bring together the three major objective functions that have been proposed for multiple alignment and show a continuity between approximation algorithms for those three multiple alignment problems. Similarly, the chapter on evolutionary tree construction exposes the commonality of several distinct problems and solutions in a way that is not well known. Throughout the book, I discuss many computational problems concerning repeated substrings (a very widespread phenomenon in DNA). I consider several different ways to define repeated substrings and use each specific definition to explore computational problems and algorithms on repeated substrings. In the book I try to explain in complete detail, and at a reasonable pace, many complex methods that have previously been written exclusively for the specialist in string algorithms. I avoid detailed code, as I find it rarely serves to explain interesting ideas,3 and I provide over 400 exercises to both reinforce the material of the book and to develop additional topics.
In summary
This book is a general, rigorous text on deterministic algorithms that operate on strings, trees, and sequences. It covers the full spectrum of string algorithms from classical computer science to modern molecular biology and, when appropriate, connects those two fields. It is the book I wished I had available when I began learning about string algorithms.
Acknowledgments
I would like to thank The Department of Energy Human Genome Program, The Lawrence Berkeley Laboratory, The National Science Foundation, The Program in Math and MolecHowever, many of the algorithms in the book have been coded in C and are available at
xvi
PREFACE
by practicing biologists in both large-scale projects and in narrower technical problems. Techniques previously dismissed because they originally addressed (exact) string problems where perfect data were assumed have been incorporated as components of more robust techniques that handle imperfect data.
PART I
xviii
PREFACE
ular Biology, and The DIMACS Center for Discrete Mathematics and Computer Science special year on computational biology, for support of my work and the work of my students and postdoctoral researchers. Individually, I owe a great debt of appreciation to William Chang, John Kececioglu, Jim Knight, Gene Lawler, Dalit Naor, Frank Olken, R. Ravi, Paul Stelling, and Lusheng Wang. I would also like to thank the following people for the help they have given me along the way: Stephen Altschul, David Axelrod, Doug Brutlag, Archie Cobbs, Richard Cole, Russ Doolittle, Martin Farach, Jane Gitschier, George Hartzell, Paul Horton, Robert Irving, Sorin Istrail, Tao Jiang, Dick Karp, Dina Kravets, Gad Landau, Udi Manber, Marci McClure, Kevin Murphy, Gene Myers, John Nguyen, Mike Paterson, William Pearson, Pavel Pevzner, Fred Roberts, Hershel Safer, Baruch Schieber, Ron Shamir, Jay Snoddy, Elizabeth Sweedyk, Sylvia Spengler, Martin Tompa, Esko Ukkonen, Martin Vingron, Tandy Warnow, and Mike Waterman.
for other applications. Users of Melvyl, the on-line catalog of the University of California library system, often experience long, frustrating delays even for fairly simple matching requests. Even grepping through a large directory can demonstrate that exact matching is not yet trivial. Recently we used GCG (a very popular interface to search DNA and protein databanks) to search Genbank (the major U.S. DNA database) for a thirty-character string, which is a small string in typical uses of Genbank. The search took over four hours (on a local machine using a local copy of the database) to find that the string was not there.2 And Genbank today is only a fraction of the size it will be when the various genome programs go into full production mode, cranking out massive quantities of sequenced DNA. Certainly there are faster, common database searching programs (for example, BLAST), and there are faster machines one can use (for example, an e-mail server is available for exact and inexact database matching running on a 4,000 processor MasPar computer). But the point is that the exact matching problem is not so effectively and universally solved that it needs no further attention. It will remain a problem of interest as the size of the databases grow and also because exact matching will continue to be a subtask needed for more complex searches that will be devised. Many of these will be illustrated in this book. But perhaps the most important reason to study exact matching in detail is to understand the various ideas developed for it. Even assuming that the exact matching problem itself is sufficiently solved, the entire field of string algorithms remains vital and open, and the education one gets from studying exact matchingmay be crucial for solving less understood problems. That education takes three forms: specific algorithms, general algorithmic styles, and analysis and proof techniques. All three are covered in this book, but style and proof technique get the major emphasis.
Overview of Part I
In Chapter 1 we present naive solutions to the exact matching problem and develop the fundamental tools needed to obtain rnore efficient methods. Although the classical solutions to the problem will not be presented until Chapter 2, we will show at the end of Chapter 1 that the use of fundamental tools alone gives a simple linear-time algorithm for exact matching. Chapter 2 develops several classical methods for exact matching, using the fundamental tools developed in Chapter 1. Chapter 3 looks more deeply at those methods and extensions of them. Chapter 4 moves in a very different direction, exploring methods for exact matching based on arithmetic-like operations rather than character comparisons. Although exact matching is the focus of Part I, some aspects of inexact matching and the use of wild cards are also discussed. The exact matching problem will be discussed again in Part II, where it (and extensions) will be solved using suffix trees.
i and ends at position j of S . In particular, S[1..i] is the prefix of string S that ends at is the of string S that begins at position i , where denotes position i, and the number of characters in string S.
Definition S[i.. j] is the empty string if i > j,
For example, california is a string, lifo is a substring, cal is a prefix, and ornia is a suffix.
Definition A proper prefix, suffix, or substring of S is, respectively, a prefix, suffix, or substring that is not the entire string S, nor the empty string. Definition For any string S, S(i) denotes the i th character of S.
We will usually use the symbol S to refer to an arbitrary fixed string that has no additional assumed features or roles. However, when a string is known to play the role of a pattern or the role of a text, we will refer to the string as P or T respectively. We will use lower y, to refer to variable strings and use lower case roman case Greek characters characters to refer to single variable characters.
Definition When comparing two characters, we say that the characters match if they are equal; otherwise we say they mismatch. Terminology confusion
The words "string" and " w o r d are often used synonymously in the computer science literature, but for clarity in this book we will never use " word when "string" is meant. (However, we do use "word" when its colloquial English meaning is intended.) More confusing, the words "string" and "sequence" are often used synonymously, particularly in the biological literature. This can be the source of much confusion because "substrings" and "subsequences" are very different objects and because algorithms for substring problems are usually very different than algorithms for the analogous subsequence problems. The characters in a substring of S must occur contiguously in S, whereas characters in a subsequence might be interspersed with characters not in the subsequence. Worse, in the biological literature one often sees the word "sequence" used in place of "subsequence". Therefore, for clarity, in this book we will always maintain a distinction between "subsequence" and "substring" and never use "sequence" for "subsequence". We will generally use "string" when pure computer science issues are discussed and use "sequence" or "string" interchangeably in the context of biological applications. Of course, we will also use "sequence" when its standard mathematical meaning is intended. The first two parts of this book primarily concern problems on strings and substrings. Problems on subsequences are considered in Parts IIIand IV.
smarter method was assumed to know that character a did not occur again until position 5,1 and the even smarter method was assumed to know that the pattern abx was repeated again starting at position 5. This assumed knowledge is obtained in the preprocessing stage. For the exact matching problem, all of the algorithms mentioned in the previous section preprocess pattern P. (The opposite approach of preprocessing text T is used in other algorithms, such as those based on suffix trees. Those methods will be explained later in the book.) These preprocessing methods, as originally developed, are similar in spirit but often quite different in detail and conceptual difficulty. In this book we take a different approach and d o not initially explain the originally developed preprocessing methods. Rather, we highlight the similarity of the preprocessing tasks needed for several different matching algorithms, by first defining a fundamental preprocessing of P that is independent of any particular matching algorithm. Then we show how each specific matching algorithm uses the information computed by the fundamental preprocessing of P. The result is a simpler more uniform exposition of the preprocessing needed by several classical matching methods and a simple linear time algorithm for exact matching based only on this preprocessing (discussed in Section 1.5). This approach to linear-time pattern matching was developed in [202].
In other words, is the length of the longest prefix of of S . For example, when S = anbcaabxaaz then
= 3 (aabc...aabx ...), = 1 (aa ...a b ...),
= 2 (aab...aaz).
When S is clear by context, we will use in place of To introduce the next concept, consider the boxes drawn in Figure 1.2. Each box starts is greater than zero. The length of the box starting at at some position j > 1 such that j is meant to represent Therefore, each box in the figure represents a maximal-length
Figure 1.2: Each solid box represents a substring of S that matches a prefix of Sand that starts between positions 2 and i . Each box is called a Z-box. We use to denote the right-most end of any Z-box that begins at or to the left of position i and a to denote the substring in the Z-box ending at Then denotes the left end of a. The copy of that occurs as a prefix of S is also shown in the figure.
EXACT MATCHING
0 1 1234567890123 T: xabxyabxyabxz P: abxyabxz
0 1 1234567890123 T: xabxyabxyabxz P: abxyabxz
abxyabxz abxyabxz
abxyabxz abxyabxz
Figure 1.1 : The first scenario illustrates pure naive matching, and the next two illustrate smarter shifts. A caret beneath a character indicates a match and a star indicates a mismatch made by the algorithm.
comparisons of the naive algorithm will be mismatches. This smarter algorithm skips over the next three shift/compares, immediately moving the left end of P to align with position 6 of T, thus saving three comparisons. How can a smarter algorithm do this? After the ninth comparison, the algorithm knows that the first seven characters of P match characters 2 through 8 of T. If it also knows that the first character of P (namely a )does not occur again in P until position 5 of P, it has enough information to conclude that character a does not occur again in T until position 6 of T. Hence it has enough information to conclude that there can be no matches between P and T until the left end of P is aligned with position 6 of T. Reasoning of this sort is the key to shifting by more than one character. In addition to shifting by larger amounts, we will see that certain aligned characters d o not need to be compared. An even smarter algorithm knows the next occurrence in P of the first three characters of P (namely abx) begin at position 5. Then since the first seven characters of P were found to match characters 2 through 8 of T, this smarter algorithm has enough information to conclude that when the left end of P is aligned with position 6 of T, the next three comparisons must be matches. This smarter algorithm avoids making those three comparisons. Instead, after the left end of P is moved to align with position 6 of T, the algorithm compares character 4 of P against character 9 of T. This smarter algorithm therefore saves a total of six comparisons over the naive algorithm. The above example illustrates the kinds of ideas that allow some comparisons to be skipped, although it should still be unclear how an algorithm can efficiently implement these ideas. Efficient implementations have been devised for a number of algorithms such as the Knu th-Morris-Pratt algorithm, a real-time extension of it, the Boyer-Moore algorithm, and the Apostolico-Giancarlo version of it. All of these algorithms have been implemented to run in linear time (O(n m) time). The details will be discussed in the next two chapters.
S
r
of S.
In this
Begin 1. If k > r, then find by explicitly comparing the characters starting at position k to the characters starting at position 1 of S , until a mismatch is found. The length of the match is If > 0, thensetr tok - 1 and set l tok. 2. If k r , then position k is contained in a 2-box, and hence S(k) is contained in substring S[l..r] (call it a ) such that l > 1 and a matches a prefix of S. Therefore, character S(k) also appears in position k' = k - l + 1 of S. By the same reasoning, substring S[k..r] (call It follows that the substring beginning at position k -it must match substring and (which is r - k 1). must match a prefix of S of length at least the minimum of See Figure 1.3.
We consider two subcases based on the value of that minimum. 2a. If < then = and r, l remain unchanged (see Figure 1.4). = 2b. If then the entire substring S[k..r] must be a prefix of S and r - k + 1. However, might be strictly larger than so compare the characters + 1 of S until a starting at position r + 1 of S to the characters starting a position mismatch occurs. Say the mismatch occurs at character q r 1. Then is set to - k, r is set to q - I , and l is set to k (see Figure 1.5). End
in Case 1, is set correctly since it is computed by explicit comparisons. Also (since k > r in Case 1), before is computed, no 2-box has been found that starts
EXACT MATCHING
substring of S that matches a prefix of S and that does not start at position one. Each such box is called a 2-box. More formally, we have: is greater than zero, the Z-box at i is Definition For any position i > 1 where defined as the interval starting at i and ending at position i + - 1.
Definition For every i > 1, is the right-most endpoint of the 2-boxes that begin at -1 or before position i. Another way to state this is: is the largest value of j over all I < j i such that > 0. (See Figure 1.2.)
We use the term for the value of j specified in the above definition. That is, is In case there is more than one the position of the left end of the 2-box that ends at 2-box ending at then can be chosen to be the left end of any of those 2-boxes. As an = 7, = 16, and = 10. example, suppose S = a a b a a b c a x a a b a a b c y ; then The linear time computation of 2 values from S is the fundamental preprocessing task that we will use in all the classical linear-time matching algorithms that preprocess P. But before detailing those uses, we show how to do the fundamental preprocessing in linear time.
1.6. EXERCISES
11
for the n characters in P and also maintain the current l and r. Those values are sufficient to compute (but not store) the Z value of each character in T and hence to identify and = n. output any position i where There is another characteristic of this method worth introducing here: The method is considered an alphabet-independent linear-time method. That is, we never had to assume that the alphabet size was finite or that we knew the alphabet ahead of time - a character comparison only determines whether the two characters match or mismatch; it needs no further information about the alphabet. We will see that this characteristic is also true of the Knuth-Morris-Pratt and Boyer-Moore algorithms, but not of the Aho-Corasick algorithm or methods based on suffix trees.
1.6. Exercises
The first four exercises use the that fundamental processing can be done in linear time and that all occurrences of P i n can be found in linear time.
1. Use the existence of a linear-time exact matching algorithm to solve the following problem in linear time. Given two strings and determine if is a circular (or cyclic) rotation of that is, if and have the same length and a consists of a suffix of followed by a prefix of For example, defabcis a circular rotation of abcdef. This is a classic problem with a very elegant solution.
2. Similar to Exercise 1, give a linear-time algorithm to determine whether a linear string is a substring of a circular string A circular string of length n is a string in which character n is considered to precede character 1 (see Figure 1.6). Another way to think about this
10
EXACT MATCHING
between positions 2 and k - 1 and that ends at or after position k. Therefore, when >0 in Case 1, the algorithm does find a new Z-box ending at or after k , and it is correct to change r to k + - 1. Hence the algorithm works correctly in Case 1. In Case 2a, the substring beginning at position k can match a prefix of S only for < If not, then the next character to the right, character k must match length But character k matches character k' + (since c SO character 1 character k' must match character 1 However, that would be a contradiction for it would establish a substring longer than that starts at k' to the definition of and matches a prefix of S . Hence = in this case. Further, k + - 1 < r , SO r and l remain correctly unchanged. In Case 2b, must be a prefix of S (as argued in the body of the algorithm) and since any extension of this match is explicitly verified by comparing characters beyond r to characters beyond the prefix the full extent of the match is correctly computed. Hence is correctly obtained in this case. Furthermore, since k + - 1 r, the algorithm correctly changes r and 1.
+ +
Corollary 1.4.1. Repeating Algorithm Z for each position i > 2 correctly yields all the values.
Theorem 1.4.2. All the
PROOF
time.
The time is proportional to the number of iterations, IS], plus the number of character comparisons. Each comparison results in either a match or a mismatch, so we next bound the number of matches and mismatches that can occur. Each iteration that performs any character comparisons at all ends the first time it finds a mismatch; hence there are at most mismatches during the entire algorithm. To bound for every iteration k. Now, let k be an the number of matches, note first that at least. Finally, iteration where q > 0 matches occur. Then is set to so the total number of matches that occur during any execution of the algorithm is at most
+ +
+ +
1.6. EXERCISES and m, and finds the longest suffix of should run in O(n + m) time. that exactly matches a prefix of
13
The algorithm
4. Tandem arrays. A substring contained in string S is called a tandem array of (called the base) if consists of more than one consecutive copy of For example, if S = xyzabcabcabcabcpq, then = abcabcabcabc is a tandem array of =:abc. Note that S also contains a tandem array of abcabc(i.e., a tandem array with a longer base). A maximal tandem array is a tandem array that cannot be extended either left or right. Given the base p, a tandem array of p in S can be described by two numbers (s, k), giving its starting location in S and the number of times p is repeated. A tandem array is an example of a . 11 . I ) . repeated substring (see Section 7
/ '
Suppose S has length n. Give an example to show that two maximal tandem arrays of a given base fi can overlap. Now give an O(n)-time algorithm that takes Sand fi as input, finds every maximal tandem array of p, and outputs the pair (s, k ) for each occurrence. Since maximal tandem arrays of a given base can overlap, a naive algorithm would establish only an O(r?)-time bound.
5. If the Z algorithm finds that Z2 = q > 0,all the values Z3,, . . , Zq+l,Z9+2 can then be obtained immediately without additional character comparisons and without executing the main body of Algorithm Z. Flesh out and justify the details of this claim.
6. In Case 2b of the Z algorithm, when Zkt >- !Dlrthe algorithm does explicit comparisons until it finds a mismatch. This is a reasonable way to organize the algorithm, but in fact Case 2b can be refined so as to eliminate an unneeded character comparison. Argue that when Zkt > lp(then Zk = I#? I and hence no character comparisons are needed. Therefore, explicit character comparisons are needed only in the case that = ]PI.
Z k t
7 . If Case 2b of the Z algorithm is spiit into two cases, one for Zk, > IpI and one for Z k r = IpI, would this result in an overall speedup of the algorithm? You must consider all operations, not just character comparisons.
8. Baker [43] introduced the following matching problem and applied it to a problem of software maintenance: "The application is to track down duplication in a large software system. We want to find not only exact matches between sections of code, but parameterized matches, where a parameterized match between two sections of code means that one section can be transformed into the other by replacing the parameter names (e.g., identifiers and constants) of one section by the parameter names of the other via a one-to-one function".
Now we present the formal definition. Let C and l 7 be two alphabets containing no symbols l is called aparameter. in common. Each symbol in C is called a tokenand each symbol in I A string can consist of any combinations of tokens and parameters from C and I l . For l is the lower case alphabet then example, if C is the upper case English alphabet and l XYabCaCXZdd W is a legal string over C and m. Two strings St and & ! are said to p-match if and only if
a. Each token in S1 (or Sz) is opposite a matching token in Sz (or SI).
b. Each parameter in S 1 (or SZ)is opposite a parameter in SZ (or St).
c. For any parameter x , if one occurrence of x in S,(&) is opposite a parameter y in Sz (ST), then every occurrence of x in S , ( S ) must be opposite an occurrence of y in & (SI). In
z defines a one-one correspondence other words, the alignment of parameters in S, and S between parameter names in St and parameter names in Sz. For example, S1 = XYabCaCXZddbW pmatches & = X Y d x C d C X Z c c x W. Notice that parameter a in S1 maps to parameter d in &, while parameter d in S1 maps to c in &. This does not violate the definition of pmatching. In Baker's application, a token represents a part of the program that cannot be changed,
EXACT MATCHING
Figure 1.6: A circular string p. The linear string derived from it is accatggc. problem is the following. Let $ be the linearstring obtained from p starting at character 1 and ending at character n. Then a is a substring of circular string B if and only if a is a substring of some circular rotation of 6. A digression o n circular strings i n DNA The above two problems are mostly exercises in using the existence of a linear-time exact matching algorithm, and we don't know any critical biological problems that they address. However, we want to point out that circular DNA is common and important. Bacterial and mitochondria1 DNA is typically circular, both in its genomic DNA and in additional small double-stranded circular DNA molecules called plasmids, and even some true eukaryotes (higher organisms whose cells contain a nucleus) such as yeast contain plasmid DNA in addition to their nuclear DNA. Consequently, tools for handling circular strings may someday be of use in those organisms. Viral DNA is not always circular, but even when it is linear some virus genomes exhibit circular properties. For example, in some viral populations the linear order of the DNA in one individual will be a circular rotation of the order in another individual [450]. Nucleotide mutations, in addition to rotations, occur rapidly in viruses, and a plausible problem is to determine if the DNA of two individual viruses have mutated away from each other only by a circular rotation, rather than additional mutations. It is very interesting to note that the problems addressed in the exercises are actually "solvedn in nature. Consider the special case of Exercise 2 when string a has length n. Then the problem becomes: Is a a circular rotation of This problem is solved in linear n time as in Exercise 1. Precisely this matching problem arises and is "solved in E. coli replication under the certain experimental conditions described in [475].In that experiment, an enzyme (RecA) and ATP molecules (for energy) are added to E. colicontaining a single strand of one of its plasmids, called string p , and a double-stranded linear DNA molecule, one strand of which is called string a. If a is a circular rotation of 8 then the strand opposite to a (which has the DNA sequence complementary to or) hybridizes with p creating a proper double-stranded plasmid, leaving or as a single strand. This transfer of DNA may be a step in the replication of the plasmid. Thus the problem of determining whether a is a circular rotation of is solved by this natural system.
B?
Other experiments in [475] can be described as substring matching problems relating to circular and linear DNA in E. coli. Interestingly, these natural systems solve their matching problems faster than can be explained by kinetic analysis, and the molecular mechanisms used for such rapid matching remain undetermined. These experiments demonstrate the role of enzyme RecA in E. coli repiication, but do not suggest immediate important computational problems. They do, however, provide indirect motivation for developing computational tools for handling circular strings as well as linear strings. Several other uses of circular strings will be discussed in Sections 7.13 and 16.17 of the book.
3. Suffix-prefix matching. Give an algorithm that takes in two strings a and p , of lengths n
1.6. EXERCISES
15
nations of the DNA string and the fewest number of indexing steps (when using the codons to look up amino acids in a table holding the genetic code). Clearly, the three translations can be done with 3n examinations of characters in the DNA and 3n indexing steps in the genetic code table. Find a method that does the three translations in at most n character examinations and n indexing steps.
Hint: If you are acquainted with this terminology, the notion of a finite-state transducer may be helpful, although it is not necessary.
11. Let T be a text string of length m and let S be a multiset of n characters. The problem is to find all substrings in T of length n that are formed by the characters of S. For example, let S = (a, a, b, c} and T = abahgcabah. Then caba is a substring of T formed from the characters of S.
Give a solution to this problem that runs in O(m)time. The method should also be able to state, for each position i , the length of the longest substring in T starting at i that can be formed from S.
Fantasy protein sequencing. The above problem may become useful in sequencing protein from a particular organism after a large amount of the genome of that organism has been sequenced. This is most easily explained in prokaryotes, where the DNA is not interrupted by introns. In prokaryotes, the amino acid sequence for a given protein is encoded in a contiguous segment of DNA - one DNA codon for each amino acid in the protein. So assume we have the protein molecule but do not know its sequence or the location of the gene that codes for the protein. Presently, chemically determining the amino acid sequence of a protein is very slow, expensive, and somewhat unreliable. However, finding the muttiset of amino acids that make up the protein is relatively easy. Now suppose that the whole DNA sequence for the genome of the organism is known. One can use that long DNA sequence to determine the amino acid sequence of a protein of interest. First, translate each codon in the DNA sequence into the amino acid alphabet (this may have to be done three times to get the proper frame) to form the string T; then chemically determine that are the multiset S of amino acids in the protein; then find all substrings in Tof length JSI formed from the amino acids in S. Any such substrings are candidates for the amino acid sequence of the protein, although it is unlikely that there will be more than one candidate. The match also locates the gene for the protein in the long DNA string.
12. Consider the two-dimensional variant of the preceding problem. The input consists of two-
dimensional text (say a filled-in crossword puzzle) and a rnultiset of characters. The problem is to find a connected two-dimensional substructure in the text that matches all the characters in the rnultiset. How can this be done? A simpler problem is to restrict the structure to be rectangular.
13. As mentioned in Exercise 10, there are organisms (some viruses for example) containing intervals of DNA encoding not just a single protein, but three viable proteins, each read in a different reading frame. So, if each protein contains n amino acids, then the DNA string encoding those three proteins is only n + 2 nucieotides (characters) long. That is a very compact encoding.
(Challenging problem?) Give an algorithm for the following problem: The input is a protein string S1 (over the amino acid alphabet) of length n and another protein string of length m > n. Determine if there is a string specifying a DNA encoding for & that contains a substring specifying a DNA encoding of S,. Allow the encoding of S,to begin at any point in the DNA string for & (i.e., in any reading-frame of that string). The problem is difficult because of the degeneracy of the genetic code and the ability to use any reading frame.
14
EXACT MATCHING whereas a parameter represents a program's variable, which can be renamed as long as all occurrences of the variable are renamed consistently. Thus if S,and & pmatch, then the variable names in St could be changed to the corresponding variable names in &, making the two programs identical. If these two programs were part of a larger program, then they could both be replaced by a call to a single subroutine. The most basic pmatch problem is: Given a text T and a pattern P, each a string over C and l 7 ,find all substrings of T that prnatch P. Of course, one would like to find all those occurrences in O() P I 1 T I )time. Let function qP for a string S be the length of the longest string starting at position i in S that pmatches a prefix of Sfl..i]. Show how to modify algorithm Z to compute all the qp values in O(1S j )time (the implementation details are slightly more involved than for function Zi,but not too difficult). Then show how to use the modified algorithm Z to find all substrings of T that pmatch P, in O(i Pi I T I )time.
In [43]and [239], more involved versions of the pmatch problem are solved by more complex methods.
The following three problems can be solved without the Zalgorithm or other fancy tools. They only require thought.
9. You are given two strings of n characters each and an additional parameter k. In each string there are n - k + 1 substrings of length k,and so there are @($) pairs of substrings, where one substring is from one string and one is from the other. For a pair of substrings, we define the match-countas the number of opposing characters that match when the two substrings of length k are aligned. The problem is to compute the match-count for each of the @(n2) pairs of substrings from the two strings. Clearly, the problem can be solved with 0 ( k n 2 )operations (character comparisons plus arithmetic operations). But by better organizing the computations, the time can be reduced to O($)operations. (From Paul Horton.)
10. A DNA molecule can be thought of as a string over an alphabet of four characters {a. t, c , g } (nucleotides), while a protein can be thought of as a string over an alphabet of twenty characters (amino acids). A gene, which is physically embedded in a DNA molecule, typically encodes the amino acid sequence for a particular protein. This is done as follows. Starting at a particutar point in the DNA string, every three consecutive DNA characters encode a single amino acid character in the protein string. That is, three DNA nucleotides specify one amino acid. Such a coding triple is called a codon, and the full association of codons to amino acids is called the genetic code. For example, the codon ttt codes for the amino acid Phenylalanine (abbreviated in the single character amino acid alphabet as 0,and the codon gtt codes for the amino acid Valine (abbreviated as V). Since there are 43= 64 possible triples but only twenty amino acids, there is a possibility that two or more triples form codons for the same amino acid and that some triples do not form codons. In fact, this is the case. For example, the amino acid Leucine is coded for by six different codons.
Problem: Suppose one is given a DNA string of n nucleotides, but you don't know the correct "reading frame". That is, you don't know if the correct decomposition of the string into codons begins with the first, second, or third nucleotide of the string. Each such "frameshift" potentially translates into a different amino acid string. (There are actually known genes where each of the three reading frames not only specifies a string in the amino acid alphabet, but each specifies a functional, yet different, protein.) The task is to produce, for each of the three reading frames, the associated amino acid string. For example, consider the string atggacgga. The first reading frame has three complete codons, atg, gac, and gga, which in the genetic code specify the amino acids Met, Asp, and Gly. The second reading frame has two complete codons, tgg and acg, coding for amino acids Trp and Thr,The third reading frame has two complete codons, gga and cgg, coding for amino acids Glyand Arg. The goat is to produce the three translations, using the fewest number of character exami-
17
T: P:
xpbctbxabpqxctbpq tpabxab
To check whether P occurs in T at this position, the Boyer-Moore algorithm starts at the right end of P, first comparing T(9) with P(7). Finding a match, it then compares T(8) with P(6), etc., moving right to left until it finds a mismatch when comparing T(5) with P(3). At that point P is shifted right relative to T (the amount for the shift will be discussed below) and the comparisons begin again at the right end of P . Clearly, if P is shifted right by one place after each mismatch, or after an occurrence of P is found, then the worst-case running time of this approach is O(nm)just as in the naive algorithm. So at this point it isn't clear why comparing characters from right to left is any better than checking f r o h left to right. However, with two additional ideas (the bad character and the good suBx riles), shifts of more than one position often occur, and in typical situations large shifts are common. We next examine these two ideas.
It is easy to preprocess P in O ( n ) time to collect the R ( x ) values, and we leave that as an exercise. Note that this preprocessing does not require the fundamental preprocessing discussed in Chapter 1 (that will be needed for the more complex shift rule, the good suffix rule). We use the R values in the following way, called the bad chnmcter shift rule: Suppose for a particular alignment of P against T, the right-most n - i characters of P match their counterparts in T , but the next character to the left, P(i), mismatches with its counterpart, say in position k of T . The bad character rule says that P should be shifted right by max [ I , i - R(T(k))] places. That is, if the right-most occurrence in P of character T(k) is in position j < i (including the possibility that j = O), then shift P so that character j of P is below character k of T. Otherwise, shift P by one position. The point of this shift rule is to shift P by more than one character when possible. In the above example, T(5) = t mismatches with P(3) and R ( t ) = 1, so P can be shifted right by two positions. After the shift, the comparison of P and T begins again at the right end of P.
2.1. Introduction
This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing pattern P. (Methods that preprocess T will be considered in Part I1 of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite different in conceptual difficulty. Some of the original preprocessing methods are quite difficult.' This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing, developed in the previous chapter, to implement the needed preprocessing for each specific matching algorithm. Also, in contrast to previous expositions, we emphasize the Boyer-Moore method over the Knuth-Morris-Pratt method, since Boyer-Moore is the practical method of choice for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for historical reasons, but mostly because it generalizes to problems such as real-time string matching and matching against a set of patterns more easily than Boyer-Moore does. These two topics will be described in this chapter and the next.
Sedgewick [401] writes "Both the Knuth-Morris-Pratt and the Boyer-Moore algorithms require some complicated preprocessing on the pattern that is dificult to understand and has limited the extent to which they arc uscd". In agreement with Sedgrwick, I still do not understand the original Boyer-Moore preprocessing mrrhod h r the rtrorlg good suffix rule,
19
P before shift
P after shift
t'
I YI
Z
I
I
t'
Figure 2.1: Good suffix shift rufe, where character x of T mismatches with character y of P. Characters y and z of Pare guaranteed to be distinct by the good suffix rule, so r has a chance of matching x.
good s u f i rule. The original preprocessing method 12781 for the strong good suffix rule is generally considered quite difficult and somewhat mysterious (although a weaker version of it is easy to understand). In fact, the preprocessing for the strong rule was given incorrectly in 12781 and corrected, without much explanation, in [384]. Code based on [384] is given without real explanation in the text by Baase [32], but there are no published .~ code for strong preprocessing, based sources that try to fully explain the r n e t h ~ dPascal on an outline by Richard Cole [107], is shown in Exercise 24 at the end of this chapter. In contrast, the fundamental preprocessing of P discussed in Chapter 1 makes the needed preprocessing very simple. That is the approach we take here. The strong good su& rule is:
Suppose for a given alignment of P and T , a substring r of T matches a suffix of P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' o f t in P such that t' is not a suffix of P and the characrer to the left oft' in P dders from the character to the lefi oft in P. Shift P to the right so that substring t' in P is below substring t in T (see Figure 2.1). If t' does not exist, then shift the left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T . If no such shift is possible, then shift P by n places to the right. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T . If no such shift is possible, then shift P by n places, that is, shift P past t in T .
P:
qcabdabdab 1233567890
When the mismatch occurs at position 8 of P and position 10 of T, t = nb and t' occurs in P starting at position 3. Hence P is shifted right by six places, resulting in the following alignment:
18
Theorem 2.2.2. L(i) is the largest index j less than n such thar N j ( P ) 2 IP [i..n]l (which isn-i+l). L1(i)isthelargestindexj lessthanrisuchthatNj(P) = IP[i..n]l = (n-i+1).
Given Theorem 2.2.2, it follows immediately that all the L (i) values can be accumulated in linear time from the N values using the foilowing algorithm:
f
Z-based Boyer-Moore
L ( i ) marks the right end-position of the right-most substring of P that matches P[i..n] and is not a suffix of P[l..n].Therefore, that substring begins at position L(i)-n+i, which we will denote by j. We will prove that L(i) = max[L(i - I), L'(i)]by considering what character j - 1 is. First, if j = 1 then character j - 1 doesn't exist, so L(i - I ) = 0 and Lt(i) = 1. So suppose that j > 1. If character j - 1 equals character i - 1 then L(i) = L(i - 1). If character j - 1 does not equal character i - 1 then L ( i ) = L'(i). Thus, in all cases, L(i) must either be L (i) or L(i - 1). However, L(i) must certainly be greater than or equal to both L (i) and L(i - I). In summary, L ( i ) must either be L f ( i )or L(i - I), and yet it must be greater or equal to both of them; hence L(i) must be the maximum of L'(i) and L(i - 1).
f
Final preprocessing detail The preprocessing stage must also prepare for the case when L1(i) = 0 or when an occurrence of P is found. The following definition and theorem accomplish that.
Definition Let l'(i) denote the length of the largest suffix of P[i..n]that is also a prefix of P , if one exists. If none exists, then let If(i) be zero.
+ 1, such that
We leave the proof, as well as the problem of how to accumulate the ll(i) values in linear time, as a simple exercise. (Exercise 9 of this chapter)
P:
qcabdabdab
Note that the extended bad character rule would have shifted P by only one place in this example.
Theorem 2.2.1. The use of the good sufJix rule never shifts P past an occurrence in T .
Suppose the right end of P is aligned with character k of 7"before the shift, and suppose that the good suffix rule shifts P so its right end aligns with character k t > k. Any occurrence of P ending at a position 1 strictly between k and k' would immediately violate the selection rule for k', since it would imply either that a closer copy of t occurs in P or that a longer prefix of P matches a suffix o f t .
PROOF
The original published Boyer-Moore algorithm [75] uses a simpler, weaker, version of the good suffix rule. That version just requires that the shifted P agree with the t and does not specify that the next characters to the left of those occurrences of t be different. An explicit statement of the weaker rule can be obtained by deleting the italics phrase in the first paragraph of the statement of the strong good suffix rule. In the previous example, the weaker shift rule shifts P by three places rather than six. When we need to distinguish the two rules, we will call the simpler rule the weak good suffix rule and the rule stated above the strong good suffix rule. For the purpose of proving that the search part of Boyer-Moore runs in linear worst-case time, the weak rule is not sufficient, and in this book the strong version is assumed unless stated otherwise.
L ( i ) gives the right end-position of the right-most copy of P[i..n] that is not a suffix of P, whereas L'(i) gives the right end-position of the right-most copy of P [ i . . n ]that is not a suffix of P, with the stronger, added condition that its preceding character is unequal to P(i - I). So, in the strong-shift version of the Boyer-Moore algorithm, if character i - 1 of P is involved in a mismatch and L 1 ( i )> 0, then P is shifted right by n - L ( i ) positions. The result is that if the right end of P was aligned with position k of T before the shift, then position L ( i ) is now aligned with position k . During the preprocessing stage of the Boyer-Moore algorithm L ' ( i )(and L ( i ) , if desired) will be computed for each position i in P. This is done in O ( n )time via the following definition and theorem.
f
u m of the substring Definition For string P, N j ( P ) is the length of the longest s P [ I .. j ] that is also a s u m of the full string P.
23
Boyer-Moore method has a worst-case running time of O(m)provided that the pattern does not appear in the text. This was first proved by Knuth, Moms, and Pratt [278], and an alternate proof was given by Guibas and Odlyzko [196]. Both of these proofs were quite difficult and established worst-case time bounds no better than 5m comparisons. Later, Richard Cole gave a much simpler proof [I081 establishing a bound of 4m comparisons and also gave a difficult proof establishing a tight bound of 3m comparisons. We will present Cole's proof of 4m comparisons in Section 3.2. When the pattern does appear in the text then the original Boyer-Moore method runs in O(nm)worst-case time. However, several simple modifications to the method correct this prcblem, yielding an O(m)time bound in all cases. The first of these modifications was due to Galil[168]. After discussing Cole's proof, in Section 3.2, for the case that P doesn't occur in T, we use a variant of Galil's idea to achieve the linear time bound in all cases. At the other extreme, if we only use the bad character shift rule, then the worst-case running time is O(nm),but assuming randomly generated strings, the expected running time is sublinear. Moreover, in typical string matching applications involving natural language text, a sublinear running time is almost always observed in practice. We won't discuss random string analysis in this book but refer the reader to [I 841. Although Cole's proof for the linear worst case is vastly simpler than earlier proofs, and is important in order to complete the full story of Boyer-Moore, it is not trivial. However, a fairly simple extension of the Boyer-Moore algorithm, due to Apostolico and Giancarlo [26], gives a "Boyer-Moore-like" algorithm that allows a fairly direct proof of a 2m worst-case bound on the number of comparisons. The Apostolico-Giancarlo variant of Boyer-Moore is discussed in Section 3.1.
We will present several solutions to that set problem including the Aho-Cotasick method in Section 3.4. For those reasons, and for its historical role in the field, we fully develop the Knuth-Morris-Pratt method here.
22
The Boyer-Moore algorithm {Preprocessing stage) Given the pattern P, Compute L (i) and l f ( i ) for each position i of P, and compute R ( x ) for each character x E C. {Search stage) k := n; while k 5 m do begin 1 := n;
i
h := k; while i > 0 and P ( i ) = T ( h ) do begin i:=i-I; h := h - 1; end; if i = 0 then begin report an occurrence of P in T ending at position k. k := k n - l'(2); end else shift P (increase k ) by the maximum amount determined by the (extended) bad character rule and the good suffix rule. e od;
Note that although we have always talked about "shifting P", and given rules to determine by how much P should be "shifted, there is no shifting in the actual implementation. Rather, the index k is increased to the point where the right end of P would be "shifted". Hence, each act of shifting P takes constant time. We will later show, in Section 3.2, that by using the strong good suffix rule alone, the
25
T
I
a
I
P
I
1
k-1
r
I
I
I
I
1 I
I
I
I
I 1 I
P before shift
P after shift
"missed occurrence of P"
I
1
I I I I
1 t
I
I
L
I
1
Figure 2.2: Assumed missed occurrence used in correctness proof for Knuth-Morris-Pratt.
Theorem 2.3.1. After a mismatch at position i + I o f P and a shrj Cr of i - spfpl'aces to the right, the lefr-mosr sp: characters of P are guaranteed to match their counterpurls in T.
Theorem 2.3.1 partially establishes the correctness of the Knuth-Morris-Pratt algorithm, but to fully prove correctness we have to show that the shift rule never shifts too far. That is, using the shift rule no occurrence of P will ever be overlooked.
Theorem 23.2. For any aligrzrnenf of P with T , ifcharacrers I through i of P march the opposing characters of T btrr character i + 1 mismatches T ( k ) , then P can be shifted by i - spi places to the right without passing any occurrence of P in T.
Suppose not, so that there is an occurrence of P starting strictly to the left of the shifted P (see Figure 2.2), and let a and j 3 be the substrings shown in the figure. In particular, B is the prefix of P of length sp:, shown relative to the shifted position of P. The unshifted P matches T up through position i of P and position k 1 of T, and aII characters in the (assumed) missed occurrence of P match their counterparts in T.Both of these matched regions contain the substrings u and B, so the unshifted P and the assumed occurrence of P match on the entire substring up. Hence rrp is a suffix of P [ l . . i ] that matches a proper prefix of P. Now let 1 = (cr/l[ 1 so that position I in the "missed occurrence" of P is opposite position k in T . Character P(1) cannot be equal to P(i I ) since P(1) is assumed to match T ( k )and P(i -t 1) does not match T ( k ) .Thus ap is a proper suffix of P [ l..i] that matches a prefix of P , and the next character is unequal to P(i 1). But la/> 0 due to the assumption that an occurrence of P starts strictly before the shifted P,so lap I > IB I = sp:, contradicting the definition of spf Hence the theorem is proved. CI
PROOF
Theorem 2.3.2says that the Knuth-Morris-Pratt shift rule does not miss any occurrence of P in T, and so the Knuth-Moms-Pratt algorithm will conectly find all occurrences of P in T. The time analysis is equally simple.
24
EXACT MATCH1NG:CLASSICALCOMPARISON-BASED METHODS Definition For each position i in pattern P,define sp,( P) to be the length of the longest proper sufu of P [I ..i] that matches a prefix of P.
Stated differently, s p ; ( P ) is the length of the longest proper substring of P [ l . . i ] that ends at i and that matches a prefix of P. When the stting is clear by context we will use spiin place of the full notation. For example, if P = abcaeabcabd, then sp2 = spa = 0,sp4 = I , sps = 3, and splo = 2. Note that by definition, spi = 0 for any string. An optimized version of the Knuth-Morris-Pratt algorithm uses the foIIowing values.
Definition For each position i in pattern P, define s p : ( P ) to be the length of the longest proper suffix of P[l ..i] that matches a prefix of P. with the added condition that charucfersP(i $- 1 ) and P(spl + I ) are unequal.
Clearly, s p i ( P ) 5 s p i ( P ) for all positions i and any string P. As an example, if P = bbccaebbcabd, then sp8 = 2 because string bb occurs both as a proper prefix of P [ l . . g ] and as a suffix of P[1.,8].However, both copies of the stting are followed by the same character c, and so spi < 2, In fact, sph = 1 since the single character 6 occurs as both the first and last character of P [1 ..8] and is followed by character b in position 2 and by character c in position 9.
The Knuth-Morris-Pratt shift rule We will describe the algorithm in terms of the sp' values, and leave it to the reader to modify the algorithm if only the weaker s p values are used.' The Knuth-Morris-Pratt algorithm aligns P with T and then compares the aligned characters from left to right, as
the naive algorithm does.
For any alignment of P and T, if the first mismatch (comparing from left to sight) occurs in position i 1 of P and position k of T,then shift P to the right (relative . k - 11. En other words, shift P exactly to T ) so that PII ..spS] aligns with T [k - spi . i + 1 (sp: + 1) = i - spj places to the right, so that character sp,' 1 of P will align with character k of T . In the case that an occurrence of P has been found (no mismatch), shift P by n spi places.
T h e shift rule guarantees that the prefix PI l..spl] of the shifted P matches its opposing I]. substring in T.The next comparison is then made between characters T ( k )and P[sp: i
The use of the stronger shift rule based on spi guarantees that the same mismatch will not occur again in the new alignment, but it does not guarantee that T ( k )= P[spj 11. In the above example, where P = abcxabcde and sp; = 3, if character 8 of P mismatches then P will be shifted by 7 3 = 4 places. This is true even without knowing T or how P is positioned with T. The advantage of the shift ntle is twofold. First, it often shifts P by more than just a single character. Second, after a shift, the left-most spf characters of P are guaranteed to match their counterparts in T. Thus, to determine whether the newly shifted P matches its counterpart in T , the algorithm can start comparing P and T at position spi + 1 of P (and position k of T). For example, suppose P = abcxabcde as above, T = xyabcxabcxadcdq f eg, and the left end of P is aligned with character 3 of T.Then P and T will match for 7 characters but mismatch an character 8 of P, and P will be shifted
The reader should be alentd that traditionally the Knuth-Morris-Pratt algorirbm has been described in [ems of /oiiurefwcrions. which a= reJated to the spi values. Failure functions will k explicitly deRned in Section 2.3.3.
Definition For each position i from 1 to n + 1, define the failure function F'(i) to be SPi- 1 + I (and define F (i) = spi- 1 + 1 ), where sph and spo are defined to be zero.
!
We will only use the (stronger) failure function F'(i) in this discussion but will refer to F(i) later, After a mismatch in position i 1 > 1 of P , the Knuth-Morris-Pratt algorithm *'shiftsw P so that the next comparison is between the character in position c of T and the character in position sp: + 1 of P . But spj 1 = F'(i l), so a general "shift" can be implemented in constant time by just setting p to F1(i+ 1). Two special cases remain. When the mismatch occurs in position 1 of P , then p is set to F1(I) = I and c is incremented by one. When an occurrence of P is found, then P is shifted right by n - sp; places. This is implemented 1. by setting F1(n+ 1) to sp; Putting all the pieces together gives the full Knuth-Morris-Pratt algorithm.
Knuth-Morris-Pratt algorithm
begin Preprocess P to find F1(k) = sp;-, 1 for k from 1 to n I. c := 1; p := I; While c (n - p ) 5 m do begin While P(p) = T(c) and p 5 n do begin p:=p+l; c:=c+l; end; if p = n + 1 then report an occurrence of P starting at position c - n of T . i f p := 1 t h e n c : = c + 1 p := F1(p); end; end.
26
Theorem 2 . 3 . 3 . In the Knuth-Morris-Pratt method, the number of character comparisons is at most 2m.
Divide the algorithm into compare/shift phases, where a single phase consists of the comparisons done between successive shifts. After any shift, the comparisons in the phase go left to right and start either with the last character of T compared in the previous phase or with the character to its right. Since P is never shifted left, in any phase at most one comparison involves a character of T that was previously compared. Thus, the total number of character comparisons is bounded by m s , where s is the number of shifts done in the algorithm. But s < m since after m shifts the right end of P is certainly to the right of the right end of T, so the number of comparisons done is bounded by 2m.
PROOF
Theorem 2.3.4. Fur any i > 1, s p l ( P ) = Z j = i - j + 1, where j > 1 is the smallest position that maps to i . J f there is no such j then s p j ( P ) = 0. For any i > I, s p i ( P )= i - j I, where j is the smallestposition in the range 1 < j 5 i that maps to i o r beyond. if there is no such j , then s p i ( P ) = 0.
If s p l ( P ) is greater than zero, then there is a proper suffix cr of P [ l . . i ] that matches a prefix of P , such that P [ i 11 does not match P[lal I]. Therefore, letting j denote the start of cr, Z , = Icrl = s p l ( P ) and j maps to i . Hence, if there is no j in the range 1 < j 5 i that maps to i, then s p : ( P ) must be zero. Now suppose s p j ( P ) > 0 and let j be as defined above. We claim that j is the smallest position in the range 2 to i that maps to i. Suppose not, and let j* be a position in the range 1 < j* < j thatmapstoi.Then P [ j * . . i ]wouldbe apropersuffixof P [ l . . i ]that matches a prefix (call it B ) of P . Moreover, by the definition of mapping, P(i 1) # P(IBl), so > lal, contradicting the assumption that spi = a . spl(P) 2 The proofs of the claims for s p i ( P )are similar and are left as exercises.
PROOF
Given Theorem 2.3.4, all the sp' and s p values can be computed in linear time using the Zi values as follows:
2-based Knuth-Morris-Pratt
for i := 1 ton d o sp; := 0; for j := n downto 2 do begin i := j Z j ( P )- 1; s p ; := zi; end;
2.5. EXERCISES
29
shift rule, the method becomes real time because it still never reexamines a position in T involved in a match (a feature inherited from the Knuth-Morris-Pratt algorithm), and it now also never reexamines a position involved in a mismatch. So, the search stage of this algorithm never examines a character in T more than once. It follows t h k the search is done in real time. Below we show how to find all the sp;,,,, values in linear time. Together, this gives an algorithm that does linear preprocessing of P and real-time search of T. It is easy to establish that the algorithm finds all occurrences of P in T, and we leave that as an exercise.
Note that the linear time (and space) bound for this method require that the alphabet C be finite. This allows us to do 1 Z I comparisons in constant time. If the size of the alphabet is explicitly included in the time and space bounds, then the preprocessing time and space needed for the algorithm is O(IC In).
2.5. Exercises
1. In "typical" applications of exact matching, such as when searching for an English word in a book, the simple bad character rule seems to be as effective as the extended bad character rule. Give a "hand-waving" explanation for this. 2. When searching for a single word or a small phrase in a large English text, brute force (the naive algorithm) is reported [ I 841 to run faster than most other methods. Give a handwaving explanation for this. In general terms, how would you expect this observation to hold up with smaller alphabets (say in DNA with an alphabet size of four), as the size of the pattern grows, and when the text has many long sections of similar but not exact substrings?
3. "Common sense" and the O(nm) worst-case time bound of the Boyer-Moore algorithm (using only the bad character rule) both would suggest that empirical running times increase with increasing pattern length (assuming a fixed text). But when searching in actual English
28
case an occurrence of P in T has been found) or until a mismatch occurs at some positions i + 1 of P and k of T. In the latter case, if sp: > 0, then P is shifted right by i -spI positions, guaranteeing that the prefix P [ l ..sp,!] of the shifted pattern matches its opposing substring in T, No explicit comparison of those substrings is needed, and the next comparison is between characters T(k) and P(sp: 1). Although the shift based on spi guarantees that P(i 1) differs from P(spf I), it does not guarantee that T(k) = P(spi 1). Hence T(k) might be compared several times (perhaps R(I PI) times) with differing characters in P. For that reason, the Knuth-Morris-Pratt method is not a real-time method. To be real time, a method must do at most a constant amount of work between the time it first examines any position in T and the time it last examines that position. In the Knuth-Morris-Pratt method, if a position of T is involved in a match, it is never examined again (this is easy to verify) but, as indicated above, this is not true when the position is involved in a mismatch. Note that the definition of real time only concerns the search stage of the algorithm. Preprocessing of P need not be real time. Note also that if the search stage is real time it certainly is also linear time. The utility of areal-time matcher Is two fold. First, in certain applications, such as when the characters of the text are being sent to a small memory machine, one might need to guarantee that each character can be fully processed before the next one is due to arrive. If the processing time for each character is constant, independent of the length of the string, then such a guarantee may be possible. Second, in this particular real-time matcher, the shifts of P may be longer but never shorter than in the original Knuth-Morris-Pratt algorithm. Hence, the real-time matcher may run faster in certain problem instances. Admittedly, arguments in favor of real-time matching algorithms over linear-time methods are somewhat tortured, and the real-time matching is more a theoretical issue than a practical one. Still, it seems worthwhile to spend a little timediscussing real-time matching.
Definition Let x denote a character of the alphabet. For each position i in pattern P, define ~ p l , . , ~ , {to P )be the length of the longest proper suffix of P[l..i] that matches a prefix of P, wirh rhe added condition thnr character P{sp: 1) is x .
Knowing the sp;,+,, values for each character x in the alphabet allows a shift rule that converts the Knuth-Morris-Pratt method into a real-time algorithm. Suppose P is compared against a substring of T and a mismatch occurs at characters T(k) = x and P(i + 1). Then P should be shifted right by i - sp;,.,, places. This shift guarantees that the prefix P [ l . . ~ p ; ~ ~matches ,,] the opposing substring in T and that Tjk) matches the next character in P . Hence, the comparison between T(k) and P ( S ~ ; , , , ~ 1 ) can be skipped. The next needed comparison is between characters P(sp;,,,, 2) and T(k 1). With this
31
In Section 2.3.2, we showed that one can compute all the sp values knowing only the Z values for string S (i.e., not knowing S itself). In the next five exercises we establish the converse, creating a linear-time algorithm to compute all the Z values from sp values alone. The first exercise suggests a natural method to accomplish this, and the following exercise exposes a hole in that method. The final three exercises develop a correct linear-time algorithm, detaited in [202]. We say that Spi maps to k if k = 1-spi 1,
16. Suppose there is a position i such that spi maps to k , and let i be the largest such position. Prove that Zk = i- k -t1 = spi and that rk = i . 17. Given the answer to the previous exercise, it is natural to conjecture that Zk always equals spi, where i is the largest position such that spi maps to k. Show that this is not true. Given an example using at least three distinct characters.
Stated another way, give an example to show that Zkcan be greater than zero even when there is no position i such that sp, maps to k.
18. Recall that rk-I is known at the start of iteration k of the Z algorithm (when Z k is computed), but rk is known only at the end of iteration k. Suppose, however, that rk is known (somehow) at the start of iteration k. Show how the Z algorithm can then be modified to compute Z k using no character comparisons. Hence this modified algorithm need not even know the string S.
19. Prove that if Z k is greater than zero, then rkequals the largest position i such that k 3 isp,. Conclude that rk can be deduced from the s p values for every position k where Zk is not zero. 20. Combine the answers to the previous two exercises to create a linear-time algorithm that computes all the Z values for a string S given only the s p values for S and not the string S itself.
30
4. Evaluate empirically the utility of the extended bad character rule compared to the original bad character rule. Perform the evaluation in combination with different choices for the two good-suffix rules. How much more is the average shift using the extended rule? Does the extra shift pay for the extra computation needed to implement it? 5. Evaluate empirically, using different assumptions about the sizes of P and T, the number of occurrences of P in T, and the size of the alphabet, the following idea for speeding up the Boyer-Moore method. Suppose that a phase ends with a mismatch and that the good suffix rule shifts Pfarther than the extended bad character rule. Let x and y denote the mismatching characters in T and P respectively, and let z denote the character in the shifted Pbelow x. By the suffix rule, z wiH not be y , but there is no guarantee that it will be x. So rather than starting comparisons from the right of the shifted P, as the Boyer-Moore method would do, why not first compare x and z? If they are equal then a right-to-left comparison is begun from the right end of P, but if they are unequal then we apply the extended bad character rule from z in P. This will shifl Pagain. At that point we must begin a right-to-left comparison of Pagainst T.
6. The idea of the bad character rule in the Boyer-Moore algorithm can be generalized so that instead of examining characters in P from right to left, the algorithm compares characters in P i n the order of how unlikely they are to be in T (most unlikelyfirst). That is, it looks first at those characters in P that are least likely to be in T. Upon mismatching, the bad character rule or extended bad character rule is used as before. Evaluate the utility of this approach, either empirically on real data or by analysis assuming random strings.
7. Construct an example where fewer comparisons are made when the bad character rule is used alone, instead of combining it with the good suffix rule.
8. Evaluate empirically the effectivenessof the strong good suffix shifl for Boyer-Moore versus the weak shift rule. 9. Give a proof of Theorem 2.2.4. Then show how to accumulate all the l'(i) values in linear time.
10. If we use the weak good suffix rule in Boyer-Moore that shifts the closest copy of t under the matched suffix t, but doesn't require the next character to be different, then the preprocessing for Boyer-Moore can be based directly on sp, values rather than on Z values. Explain this. 11. Prove that the Knuth-Morris-Pratt shift rules (either based on sp or sp') do not miss any occurrences of P i n T. 12. It is possible to incorporate the bad character shift rule from the Boyer-Moore method to the Knuth-Morris-Pratt method or to the naive matching method itself. Show how to do that. Then evaluate how effective that rule is and explain why it is more effective when used in the Boyer-Moore algorithm. 13. Recall the definition of l ion page 8. It is natural to conjecture that spi = i - li for any index i , where i 2 li. Show by example that this conjecture is incorrect. 14. Prove the claims in Theorem 2.3.4 concerning sp/(P). 15. Is it true that given only the sp values for a given string P, the sp' values are completely determined? Are the sp values determined from the sp' values alone?
2.5. EXERCISES
if (p[kl = p[jl) then begin I3 1 kmp-shift [k] : = j - k ; j :=j-1; end I31 else kmp-shift [k]: = j- k + l ; end; I21 {stage 21 j :=j+l; j-old:=I; while (j <= m) do begin (2) for i:=j-old to j-1 do if (gs-shift[il > j-1) then gs-shift [il :=j-1; j-old:=j; j:=j+kmp-shift [jl; end; ( 2 1 end: {I} begin {main} writeln('input a string on a single line'); readstring ( p , m) ; gsshift (p,rnatchshift,m) ; writeln('the value in cell i is the number of positions to shift'); writeln('after a mismatch occurring in position i of the pattern'); for i:= 1 to m do write(matchshift[i] : 3 ) ; writeln ; end. {main}
25. Prove that the shift rule used by the real-time string matcher does not miss any occurrences of P i n T.
26. Prove Theorem 2.4.1.
27. In this chapter, we showed how to use Z values to compute both the sp,! and spi values used in Knuth-Morris-Pratt and the sp:, values needed for its real-time extension. instead of using Z values for the sp:, values, show how to obtain these values from the sp, and/or sp; values in linear [O(nl C I)] time, where n is the length of P and IC I is the length of the alphabet.
28. Although we don't know how to simply convert the Boyer-Moore algorithm to be a real-time method the way Knuth-Morris-Pratt was converted, we can make similar changes to the strong shift rule to make the Boyer-Moore shift more effective. That is, when a mismatch occurs between fli) and T(h) we can look for the right-most copy in Pof P [ i 1..n] (other than P[i + l..n] itself) such that the preceding character is T(h). Show how to modify
32
for k:=m-1 downto 1 do begin (2 1 go-on:=true; while ( p [ j l <> p l k l ) and go-on do begin I31 if fgs-shift[j] > j - k ) then gs-shifttj] : = j-k; if [j c m) then j:= j+kmp,shift[j+l] else go-on:=false; end; E31
Figure 2.3: The pattern P = aqra labels two subpaths of paths starting at the root. Those paths start at the root, but the subpaths containing aqra do not. There is also another subpath in the tree labeled aqra (it starts above the character z), but it violates the requirement that it be a subpath of a path starting at the root, Note that an edge label is displayed from the top of the edge down towards the bottom of the edge. Thus in the figure, there is an edge labeled "qra", not "arq".
the Boyer-Moore preprocessing so that the needed information is collected in linear time, assuming a fixed size alphabet. 29. Suppose we are given a tree where each edge is labeled with one or more characters, and we are given a pattern P . The label of a subpath in the tree is the concatenation of the labels on the edges in the subpath. The problem is to find all subpaths of paths starting at the root that are labeled with pattern P . Note that although the subpath must be part of a path directed from the root, the subpath itself need not start at the root (see Figure 2.3). Give an algorithm for this problem that runs in time proportional to the total number of . characters on the edges of the tree plus the length of P
37
4. If M(h) > Niand Ni < i, then P matches T from the right end of P down to character i - Ni 1 of P , but the next pair of characters mismatch [i.e., P(i - N i ) # T ( h - N , ) ] . Hence P matches T for j - h + N, characters and mismatches at position i - N i of P. M ( j ) must be set to a value less than or equal to j - h + N j . Set M t j ) to j - h. Shift P by the Boyer-Moore rules based on a mismatch at position i - Ni of P (this ends the phase). 5 . If M(h) = Ni and 0 < Ni -c i, then P and T must match for at least M(h) characters to the left, but the left end of P has not yet been reached, so set i to i - M(h) and set h to h - M(h) and repeat the phase algorithm.
The following definitions and lemma will be helpful in bounding the work done by the algorithm.
Definition If j is a position where M ( j ) is greater than zero then the interval [ j M ( j ) + 1.. j ] is called a covered interval defined by j . Definition Let j' < j and suppose covered intervals are defined for both j and j ' . We say that the covered intervals for j and j' cross if j - M( j )+ I 5 j' and j' - M ( j ' ) 1 < j - M( j ) + 1 (see Figure 3.2).
Lemma 3.1.1. No covered intervals computed by the algorithm ever cross each o t h e ~ Mareovel; i f the algorithm examines a position h of T in a covered interval, then h is at the right end of that interval.
36
Figure 3.1: Substring rr has length Ni and substring p has length M(h)> Ni. The two strings must match from their right ends for Ni characters, but mismatch at the next character.
at position j . As the algorithm proceeds, a value for M ( j ) is set for every position j in T that is aligned with the right end of P; M ( j )is undefined for a11 other positions in T. The second modification exploits the vectors N and M to speed up the Boyer-Moore algorithm by inferring certain matches and mismatches. To get the idea, suppose the Boyer-Moore algorithm is about to compare characters P ( i ) and T(h), and suppose it knows that M(h) > N, (see Figure 3.1). That means that an N,-length substring of P ends at position i and matches a suffix of P , while an M(h)-length substring of T ends at position h and matches a suffix of P . So the N,-length suffixes of those two substrings must match, and we can conclude that the next Ni comparisons (from P ( i ) and T(h) moving leftward) in the Boyer-Moore algorithm would be matches. Further, if N; = i , then an occurrence of P in T has been found, and if Ni < i , then we can be sure that the next comparison (after the Ni matches) would be a mismatch. Hence in simulating Boyer-Moore, if M(h) r Nj we can avoid at least N; explicit comparisons. Of course, it is not always the case that M(h) Ni, but all the cases are similar and are detailed below.
Phase algorithm 1. If M(h) is undefined or M(h) = Ni= 0, then compare T ( h )and P(i) as follows:
If T(h) = P(i) and i 3= 1, then report an occurrence of P ending at position j of T, set M(j) = n , and shift as in the Boyer-Moore algorithm (ending this phase). If T(h) = P(i) and i > 1, then set h to h - I and i to i - 1 and repeat the phase algorithm. If T(h) # P(i), then set M ( j ) = j - h and shift P according to the Boyer-Moore rules based on a mismatch occurring in position i of P (this ends the phase).
2. If M(h) < N i , then P matches its counterparts in T from position n down to position i - M(h) + I of P. By the definition of M(h). P might match more of T to the left, so set i to i - M(h), set h to h - M(h), and repeat the phase algorithm. 3. If M(h) 5 N iand Ni = i > 0, then declare that an occurrence of P has been found in T ending at position j . M ( j ) must be set to a value less than or equal to n. Set M ( j ) to j - h, and shift according to the Boyer-Moore rules based on finding an occurrence of P ending at j (this ends the phase).
39
if the comparison involving T ( h ) is a match then, at the end of the phase, M ( j ) is set at least as large as j - h 1. That means that all characters in T that matched a character of P during that phase are contained in the covered interval [ j - M ( j ) 1..j ] . Now the algorithm only examines the right end of an interval, and if h is the right end of an interval then M ( h )is defined and greater than 0, so the algorithm never compares a character of T in a covered interval. Consequently, no character of T will ever be compared again after it is first in a match. Hence the algorithm finds at most m matches, and the total number of character comparisons is bounded by 2m. To bound the amount of additional work, we focus on the number of accesses of M during execution of the five cases since the amount of additional work is proportional to the number of such accesses. A character comparison is done whenever Case 1 applies. Whenever Case 3 or 4 applies, P is immediately shifted. Hence Cases 1, 3, and 4 can apply at most O ( m ) times since there are at most O ( m ) shifts and compares. However, it is possible that Case 2 or Case 5 can apply without an immediate shift or immediate character comparison. That is, Case 2 or 5 could apply repeatedly before a comparison or shift is done. For example, Case 5 would apply twice in a row (without a shift or character comparison) if Ni= M ( h ) > 0 and N,-NL= M(h - M(h)). But whenever Case 2 or 5 applies, then j > h and M ( j ) will certainly get set to j - h 1 or more at the end of that phase. So position h will be in the strict interior of the covered interval defined by j . Therefore, h will never be examined again, and M ( h ) will never be accessed again. The effect is that Cases 2 and 5 can apply at most once for any position in T , so the number of accesses made when these cases apply is also O(m).
38
j-M(j)+l
Figure 3.2: a. Diagram showing covered intervals that do not cross, although one interval can contain another. b. Two covered intervals that do cross. PROOF
The proof is by induction on the number of intervals created. Certainly the claim is true until the first interval is created, and that interval does not cross itself. Now assume that no intervals cross and consider the phase where the right end of P is aligned with position j of T . Since h = j at the start of the phase, and j is to the right of any interval, h begins outside any interval. We consider how h could first be set to a position inside an interval, other than the right end of the interval. Rule 1 is never executed when h is at the right end of an interval (since then M(h) is defined and greater than zero), and after any execution of Rule 1, either the phase ends or h is decremented by one place. So an execution of Case 1 cannot cause h to move beyond the right-most character of a covered interval. This is also true for Cases 3 and 4 since the phase ends after either of those cases. So if h is ever moved into an interval in a position other than its right end, that move must follow an execution of Case 2 or 5. An execution of Case 2 or 5 moves h from the right end of some interval I = [k..h]to position k - 1, one place to the left of I. Now suppose that k - 1 is in some interval I' but is not at its right end, and that this is the first time in the phase that h (presently k - 1) is in an interval in a position other than its right end. That means that the right end of I cannot be to the left of the right end of I' (for then position k - 1 would have been strictly inside 1'), and the right ends of I and I' cannot be equal (since M(h) has at most one value for any h). But these conditions imply that I and I' cross, which is assumed to be untrue. Hence, if no intervals cross at the start of the phase, then in that phase only the right end of any covered interval is examined. A new covered interval gets created in the phase only after the execution of Case 1, 3, or 4. In any of these cases, the interval [h 1.. j J is created after the algorithm examines position h. In Case 1, h is not in any interval, and in Cases 3 and 4, h is the right end of an interval, so in all cases h 1 is either not in a covered interval or is at the left end of an interval. Since j is to the right of any interval, and h 1 is either not in an interval or is the left end of one, the new interval [h 1.. j ] does not cross any existing interval. The previously existing intervals have not changed, so there are no crossing intervals at the end of the phase, and the induction is complete.
Theorem 3.1.2. The mod@ed Apostolico-Giancarlo algorithm does at most 2m character comparisons and at most O ( m ) additional work.
Every phase ends if a comparison finds a mismatch and every phase, except the last, is followed by a nonzero shift of P. Thus the algorithm can find at most m mismatches. To bound the matches, observe that characters are explicitly compared only in Case 1, and
PROOF
41
be periodic. For example, abababab is periodic with period abab and also with shorter period ab. An alternate definition of a semiperiodic string is sometimes useful.
Definition A string cr is prejix semiperiodic with period y if cr consists of one or more copies of string y followed by a nonempty prefix (possibly the entire j of string y .
We use the term "prefix semiperiodic" to distinguish this definition from the definition given for "semiperiodic", but the following lemma (whose proof is simple and is left as an exercise) shows that these two definitions are really alternate reflections of the same structure.
Lemma 3.2.2. A string a is semiperiodic with period B ifand only f i it is prefi semiperiodic with the same length period as B.
For example, the string abaabaabaabaabaah is serniperiodic with period aab and is prefix semiperiodic with period aba. The following useful lemma is easy to verify, and its proof is typical of the style of thinking used in dealing with overlapping matches.
Lemma 3.2.3. Suppose pattern P occurs in text T starting at positions p and p' > p, where p' - p 5 Ln/2 J. Then P i s serniperiodic with period p' - p.
The following lemma, called the GCD Lemma, is a very powerful statement about periods of strings. We won't need the lemma in our discussion of Cole's proof, but it is natural to state it here. We will prove it and use it in Section 16.17.5.
Lemma 3.2.4. Suppose string a is semiperiodic with both a period of length p and a period of length q, and I f f 1 2 p q. Theit cr is semiperiodic with a period rvhose length is the greatest common divisor of p and q.
+1
mlcl
Starting from the right end of p , mark off substrings of length s, until less than si characters remain on the left (see Figure 3.4). There will be at least three full substrings since I I = Iti 1 1 > 3si. Phase i ends by shifting P right by si positions. Consider how jj aligns with T before and after that shift (see Figure 3.5). By definition of si and a , a is the part of the shifted P to the right of the original F. By the good suffix rule, the portion
40
characters is denoted t i , and the mismatch occurs just to the left of shifted right by an amount determined by the good suffix rule.
pattern is then
3.2.1. Cole's proof when the pattern does not occur in the text
Definition Let s, denote the amount by which P is shifted right at the end of phase i. Assume that P does not occur in T, so the compare part of every phase ends with a mismatch. In each compare/shift phase, we divide the comparisons into those that compare a character of T that has previously been compared (in a previous phase) and those comparisons that compare a character of T for the first time in the execution of the algorithm. Let gi be the number of comparisons in phase i of the first type (comparisons involving a previously examined character of T), and let g j be the number of comparisons in phase i of the second type. Then, over the entire algorithm the number of comparisons is r = I (gj gi), and our goal is to show that this sum is O(m). Certainly, xy=, g/ 5 rn since a character can be compared for the first time only once. We will show that for any phase i , s, 2 g , / 3 . Then since x y = , si 5 m (because the total length of all the shifts is at most rn) it will follow that x P = , g, 5 3m.Hence the total number of comparisons done by the algorithm is x:='=,(gi g f ) 5 4m.
xq
An initial lemma We start with the following definition and a lemma that is valuable in its own right. Definition For any string b , Pi denotes the string obtained by concatenating together i copies of B .
Lemma 3.2.1. Let y and S be two nonernpty strings such that y6 = 6 y. Then 6 = p i and y = p J for some string p and positive integers i and j .
This lemma says that if a string is the same before and after a circular shift (so that it can be written both as yS and Sy, for some strings y and 6) then y and 6 can both be written as concatenations of some single string p. Forexample,let6 = ababand y = abcrhwb, so6y = abnbababab = y6.Thenp = a b , 6 = p2,and y = p 3. The proof is by induction on 161 1 y 1. For the basis, if 161 [ y] = 2, it must be that 6 = y = p and i = j = 1. Now consider larger lengths. If IS1 = 1 yl, then again 6 = y = p and i = j = I. S o suppose I61 < lyl. Since 6 y = y6 and 161 < J y J 6 , must be a prefix of y , so y = 66' for some string 6'. Substituting this into 6 y = y 6 gives 666' = 66'6. Deleting the left copy of 6 from both sides gives 66' = 6'6. However, 161+ 16'1 = IyI < 161+ f y I , and so by induction, 6 = p i and6' = p i . Thus, y = SS' = p k , where k = i j .
PROOF
Definition A string a is semiperiodic. with period B if a consists of a nonempty suffix of a string B (possibly the entire b) followed by one or more copies of p . String a is called periodic withperiod B if a consists of two or more complete copies of B. We say that string ru is periodic if it is periodic with some period b. For example, bcabcabc is semiperiodic with period abc, but it is not periodic. String abcabc is periodic with period abc. Note that a periodic string is by definition also semiperiodic. Note also that a string cannot have itself as a period although a period may itself
Figure 3.7: The case when the right end of P i s aligned with a right end of A mismatch must occur between ~ ( k 'and ) qk).
concreteness, call that copy and say that its right end is q IB I places to the left of the right of ti, where q 2 1 (see Figure 3.7). We will first deduce how phase h must have ended, and then we'll use that to prove the lemma. Let k' be the position in T just to the left of ti (so T(kl) is involved in the mismatch ending phase i), and let k be the position in P opposite T(k') in phase h. We claim that, in phase h, the comparison of P and T will find matches until the left end of t, but then mismatch when comparing T(k') and P(k). The reason is the following: Strings and ti are semiperiodic with period B, and in phase h the right end of P is aligned with the right end of some B. So in phase h, P and T will certainly match until the left end of string ti. Now p is semiperiodic with B, and in phase h, the right end of P is exactly qlB] places to the left = . = p(1 ql#?[)= P(k). But in of the right end of ti. Therefore, P(1) = P(1 / P I ) phase i the mismatch occurs when comparing T(kf) with P ( l ) , so P(k) = P(1) # T(k ). Hence, if in phase h the right end of P is aligned with the right end of a B, then phase h must have ended with a mismatch between T(k ) and P(k). This fact will be used below to prove the lemma.' Now we consider the possible shifts of P done in phase h. We will show that every possible shift leads to a contradiction, so no shifts are possible and the assumed alignment of P and T in phase h is not possible, proving the lemma. Since h < i , the right end of P will not be shifted in phase h past the right end of ti;consequently, after the phase h shift a character of p is opposite character T(k') (the character of T that will mismatch in phase i). Consider where the right end of P is after the phase h shift. There are two cases to consider: 1. Either the right end of P is opposite the right end of another full copy of /3 (in ti) or 2. The right end of P is in the interior of a full copy of /3.
Case 1 If the phase h shift aligns the right end of P with the right end of a full copy of 8, then the character opposite T(kl) would be P(k - rl/3/) for some r . But since P is
'
Later we will analyze the Boyer-Moore algorithm when P is in T . For that purpose we note here that when phase h is assumed to end by finding an occurrence of P, then the proof of Lemma 3.2.6 is complete at this point, having esVablished a contradiction. That is. on the assumption that the right end of P is aligned with the right end o f a # I in phase h , we proved that phase h ends with a mismatch, which would contradict the assumption that h ends by tinding an occurrence of P in T . S o even if phase h ends by finding an occurrence of P, the right end of P could not be aligned with the right end of a B block in phase h .
42
mismatch occurs here Figure 3.3: String cr has length s,; string
P
Figure 3.4: Starting from the right, substrings of length l a 1 = si are marked off in
P.
Figure 3.5: The arrows show the string equalities described in the proof.
of the shifted P below ti must match the portion of the urrshifted below ti, so the second marked-off substring from the right end of the shifted must be the same as the first substring of the unshifted F . Hence they must both be copies of string a . But the second substring is the same in both copies of p , so continuing this reasoning we see that all the si-length marked substrings are copies of a and the left-most substring is a suffix of a (if it is not a complete copy of a ) .Hence P is semiperiodic with period a . The right-most Itil characters of P match t i , and so t, is also semiperiodic with period a . Then since a = pl, and ti must also be semiperiodic with period B. Recall that we want to bound g,, the number of characters compared in the i th phase that have been previously compared in earlier phases. All but one of the characters compared in phase i are contained in t i , and a character in ti could have previously been examined only during a phase where P overlaps ti. So to bound g i , we closely examine in what ways P could have overlapped ti during earlier phases.
Lemma 3.2.6. I f Iti( + I > 3si, then in any phase h < i, the right end of P coirld not have been aligned opposite the right end of any f~illcopy of B in substring t, of T.
By Lemma 3.2.5, ti is semiperiodic with period B. Figure 3.6 shows string, ti as a concatenation of copies of string B. In phase h , the right end of P cannot be aligned with the right end of ti since that is the alignment of P and T in phase i > h , and P must have moved right between phases h and i . So, suppose, for contradiction, that in phase h the right end of P is aligned with the right end of some other full copy of B in ti. For
PROOF
45
so that the two characters of P aligned with T(kU)before and after the shift are unequal. We claim these conditions hold when the right end of P is aligned with the right end of p'. Consider that alignment. Since P is semiperiodic with period p, that alignment of P and T would match at least until the left end of ti and so would match a\ position k" of T. Therefore, the two characters of P aligned with T(k") before and after the shift cannot be equal. Thus if the end of P were aligned with the end of B' then all the characters of T that matched in phase h would again match, and the characters of P aligned with T(kJ') before and after the shift would be different. Hence the good suffix rule would not shift the right end of P past the right of the end of p'. Therefore, if the right end of P is aligned in the interior of p' in phase h, it must also be in the interior of B' in phase h 1. But h was arbitrary, so the phase-h + 1 shift would also not move the right end of P past p'. So if the right end of P is in the interior of B' in phase h, it remains there forever. This is impossible since in phase i > h the right end of P is aligned with the right end of t i , which is to the right of p'. Hence the right end of P is not in the interior of B', and the Lemma is proved. CI
Note again that Lemma 3.2.8 holds even if phase h is assumed to end by finding an occurrence of P in T. That is, the proof only needs the assumption that phase i ends with a mismatch, not that phase h does. In fact, when phase h finds an occurrence of P in T, then the proof of the lemma only needs the reasoning contained in the first two paragraphs of the above proof.
Theorem 3.2.1. Assuming P does not occur in T, si 2 g,/3 in every phase i .
This is trivially true if s; 2 (Itil 1)/3, so assume Itit + 1 > 3s,. By Lemma 3.2.8, in any phase h < i , the right end of P is opposite either one of the left-most - 1 characters of ti or one of the right-most \j!? 1 characters of ti (excluding the extreme right character). By Lemma 3.2.7, at most Ip I comparisons are made in phase h < i . Hence the only characters compared in phase i that could possibly have been compared before phase i are the left-most IpI - 1 characters of ti, the right-most 2181 characters of t,, or the character just to the left of ti. So gi 5 31BI = 3si when Iti[ 1 > 3s;. In both cases then,si 3 gi/3.
PROOF
Theorem 3.2.2. [I081 Assuming that P does not occur in T, the worst-case number o f comparisons made by the Boyer-Moore algorithm is at most 4m.
PROOF
AS noted before, x f = , gj 5 m and x:='=, s; 5 m, so the total number of comparisons done by the algorithm is r=l (g; gi) 5 3si) m 5 4m.
xq
(xi
3.2.2. The case when the pattern does occur in the text
Consider P consisting of n copies of a single character and T consisting of m copies of the same character. Then P occurs in T starting at every position in T except the last n - 1 positions, and the number of comparisons done by the Boyer-Moore algorithm is O(mn). The O(m) time bound proved in the previous section breaks down because it was derived by showing that gi 5 3si, and that required the assumption that phase i ends with a mismatch. So when P does occur in T (and phases do not necessarily end with mismatches), we must modify the Boyer-Moore algorithm in order to recover the linear running time. Galil [I681 gave the first such modification. Below we present a version of his idea. The approach comes from the following observation: Suppose in phase i that the right end of P is positioned with character k of T, and that P is compared with T down
Figure 3.8: Case when the right end of P i s aligned with a character in the interior of a have a smaller period than B , contradicting the definition of p.
B. Then ti would
serniperiodic with period /3, P(k) must be equal to P(k - rlBI), contradicting the good suffix rule.
Case 2 Suppose the phase h shift aligns P so that its right end aligns with some character
in the interior of a full copy of /3. That means that, in this alignment, the right end of some p string in P is opposite a character in the interior of B. Moreover, by the good suffix rule, the characters in the shifted P below B agree with (see Figure 3.8). Let yS be the string in the shifted P positioned opposite in t i , where y is the string through the end of p and S is the remainder. Since = /3, y is a suffix of /3, S is a prefix of B, and IyI 1 8 1= = 1/31; thus y8 = Sy. By Lemma 3.2.1, however, /3 = p r for t > 1, which contradicts the assumption that /3 is the smallest string such that cr = /3' for some 1. Starting with the assumption that in phase h the right end of P is aligned with the fight end of a full copy of /3, we reached the conclusion that no shift in phase h is possible. Hence the assumption is wrong and the lemma is proved.
+1
Since P is not aligned with the end of any /3 in phase h , if P matches ti in T for /3 or more characters then the right-most /3 characters of P would match a string consisting of a suffix ( y ) of /3 followed by a prefix (8) of /3. So we would again have /3 = y8 = 6 y , and by Lemma 3.2.1, this again would lead to a contradiction to the selection of /3. Note again that this lemma holds even if phase h is assumed to find an occurrence of P. That is, nowhere in the proof is it assumed that phase h ends with a mismatch, only that phase i does. This observation will be used later.
Lemma 3.2.8. I f Iti 1 + 1 > 3si, then in phase h < i if the right end of P is aligned with a character in ti, it can only be aligned with one of the left-most 1/31 - 1 characters o f t ; or one of the right-most 1/31 characters oft,.
Suppose in phase h that the right end of P is aligned with a character of ti other than one of the left-most Ip I - 1 characters or the right-most ( B( characters. For concreteness, say that the right end of P is aligned with a character in copy /3' of string B . Since P' is not the left-most copy of /3, the right end of P is at least characters to the right of the left end of t i , and so by Lemma 3.2.7 a mismatch would occur in phase h before the left end of ti is reached. Say that mismatch occurs at position k" of T. After that mismatch, P is shifted right by some amount determined by the good suffix rule. By Lemma 3.2.6, the phase-h shift cannot move the right end of P to the right end of B', and we will show that the shift will also not move the end of P past the right end of B'. Recall that the good suffix rule shifts P (when possible) by the smallest amount so that all the characters of T that matched in phase h again match with the shifted P and
PROOF
47
all comparisons in phases that end with a mismatch have already been accounted for (in and are ignored here. the accounting for phases not in Q) Let k' > k > i be a phase in which an occurrence of P is found overlapping the earlier run but is not part of that run. As an example of such an overlap, suppose P = axaaxa and T contains the substring axaaxaaxaararaaxaaxa. Then a run begins at the start of the substring and ends with its twelfth character, and an overlapping occurrence of P (not part of the run) begins with that character. Even with the Galil rule, characters in the run will be examined again in phase kf, and since phase k' does not end with a mismatch those comparisons must still be counted. In phase k', if the left end of the new occurrence of P in T starts at a left end of a copy 8 in the run, then contiguous copies of ,8 continue past the right end of the run. But then of , no mismatch would have been possible in phase k since the pattern in phase k is aligned exactly Ip I places to the right of its position in phase k - 1 (where an occurrence of P was found). So in phase k', the left end of the new P in T must start with an interior character characters, Lemma of some copy of p . But then if P overlaps with the run by more than 3.2.1 implies that p is periodic, contradicting the selection of p. So P can overlap the run only by part of the run's left-most copy of 8. Further, since phase k f ends by finding an occurrence of P, the pattern is shifted right by skj = IS/ positions. Thus any phase that finds an occurrence of P overlapping an earlier run next shifts P by a number of positions larger than the length of the overlap (and hence the number of comparisons). It follows then that over the entire algorithm the total number of such additional comparisons in overlapping regions is O(m). All comparisons are accounted for and hence C r E d, a= O(m), finishing the proof of the lemma.
Theorem 3.2.4. When both shift r~iles are used togethel; the worst-case nrnning time of the modijied Boyer-Moore algorithm remains O(m).
PROOF
In the analysis using only the suffix rule we focused on the comparisons done in an arbitrary phase i. In phase i the right end of P was aligned with some character of T. However, we never made any assumptions about how P came to be positioned there. Rather, given an arbitrary placement of P in a phase ending with a mismatch, we deduced bounds on how many characters compared in that phase could have been compared in earlier phases. Hence all of the lemmas and analyses remain correct if P is arbitrarily picked up and moved some distance to the right at any time during the algorithm. The (extended) bad character rule only moves P to the right, so all lemmas and analyses showing the O(m) bound remain correct even with its use.
46
to character s of T . (We don't specify whether the phase ends by finding a mismatch or by finding an occurrence of P in T.) If the phase-i shift moves P so that its left 1 a prefix of P definitely end is to the right of character s of T, then in phase i 1, if the right-to-left commatches the characters of T up to T ( k ) . Thus, in phase i parisons get down to position k of T, the algorithm can conclude that an occurrence of P has been found even without explicitly comparing characters to the left of T(k 1). It is easy to implement this modification to the algorithm, and we assume in the rest of this section that the Boyer-Moore algorithm includes this rule, which we call the Galil rule.
+ +
Theorem 3.2.3. Using the Galil rule, the Boyer-Moore algorithm never does more than O(m) comparisons, no matter how many occurrences or P there are in T.
Partition the phases into those that do find an occurrence of P and those that do not. Let Q be the set of phases of the first type and let di be the number of comparisons d, ti 1 1) is a bound on the total number done in phase i if i E Q. Then CGp of comparisons done in the algorithm. The quantity CigQ(ltiI 1) is again O(m). To see this, recall that the lemmas of the previous section, which proved that gi 5 3si, only needed the assumption that phase i ends with a mismatch and that h < i . In particular, the analysis of how P of phase h < i is aligned with P of phase i did not need the assumption that phase h ends with a mismatch. Those proofs cover both the case that h ends with a mismatch and that h ends by finding an occurrence of P. Hence it again holds that gi 5 3si if phase i ends with a mismatch, even though earlier phases might end with a match. 1)/3, since For phases in Q, we again ignore the case that Si ? (n 1)/3 2 (di 3si 5 3m. the total number of comparisons done in such phases must be bounded by So suppose phase i ends by finding an occurrence of P in T and then shifts by less than n/3. By a proof essentially the same as for Lemma 3.2.5 it follows that P is semiperiodic; let p denote the shortest period of P. Hence the shift in phase i moves P right by exactly JBl positions, and using the Galil rule in the Buyer-Moore algorithm, no character of T compared in phase i + 1 will have ever been compared previously. 1 ends by finding an occurrence of P then P Repeating this reasoning, if phase i will again shift by exactly )PI places and no comparisons in phase i 2 will examine a character of T compared in any earlier phase. This cycle of shifting P by exactly IpI positions and then identifying another occurrence of P by examining only ]PI new characters of T may be repeated many times. Such a succession of overlapping occurrences of P then consists of a concatenation of copies of B (each copy of P starts exactly IpI places to the right of the previous occurrence) and is called a run. Using the Galil rule, it follows immediately that in any single run the number of comparisons used to identify the occurrences of P contained in that run is exactly the length of the run. Therefore, over the entire algorithm the number of comparisons used to find those occurrences is O(m). If no additional comparisons were possible with characters in a run, then the analysis would be complete. However, additional examinations are possible and we have to account for them. A run ends in some phase k > i when a mismatch is found (or when the algorithm terminates), It is possible that characters of T in the run could be examined again in phases after k. A phase that reexamines characters of the run either ends with a mismatch or ends by finding an occurrence of P that overlaps the earlier run but is not part of it. However,
PROOF
+ xi
49
spk
k k+l
Figure 3.1 1:
J must be a suffix of a.
spk 1 = la[ 1, then would be a prefix of P that is longer than a . But 6is also a proper suffix of P[l..k] (because Bx is a proper suffix of P[l..k + 11). Those two facts would contradict the definition of spk (and the selection of a ) . Hence spk+l Ispk 1. Now clearly, spk+l = SPA 1 if the character to the right of a is x, since a x would then be a prefix of P that also occurs as a proper suffix of P[l..k I]. Conversely, if spk+, = spk 1 then the character after a must be x.
Lemma 3.3.1 identifies the largest "candidate" value for spk+l and suggests how to initially look for that value (and for string B). We should first check the character P(spk+ I), just to the right of a . If it equals P(spk 1) then we conclude that equals a , B is a x , and spk+l equals spk 1. But what do we do if the two characters are not equal?
+ B
B is the longest proper prefix of P[l ..spk]that matches a suffix of P[l ..k] and + 1 of P . that is followed by character x in position 1p1
**)
3.3. The original preprocessing for Knuth-Morris-Pratt 3.3.1. The method does not use fundamental preprocessing
In Section 1.3 we showed how to compute all the spi values from Zivalues obtained values was conceptually simple and during fundamental preprocessing of P. The use of Zi allowed a uniform treatment of various preprocessing problems. However, the classical preprocessing method given in Knuth-Morris-Pratt [278] is not based on fundamental preprocessing. The approach taken there is very well known and is used or extended in several additional methods (such as the Aho-Corasick method that is discussed next). For those reasons, a serious student of string algorithms should also understand the classical algorithm for Knuth-Morris-Pratt preprocessing. The preprocessing algorithm computes s p i ( P )for each position i from i = 2 to i = n (spl is zero). To explain the method, we focus on how to compute spk+l assuming that spi is known for each i 5 k . The situation is shown in Figure 3.9, where string a is the prefix of P of length spk. That is, cr is the longest string that occurs both as a proper prefix of P and as a substring of P ending at position k. For clarity, let a' refer to the copy of a that ends at position k . Let x denote character k + 1 of P, and let = gx denote the prefix of P of length spk+l (i.e., the prefix that the algorithm will next try to compute). Finding S P ~ + L is equivalent to finding string p. And clearly,
*)
B is the longest proper prefix of P [ l..k] that matches a suffix of P [ 1..k] and that
+ 1 of P. See Figure 3.10.
Lemma 3.3.1. For any k , spk+l 5 spk 1 . Further; s p k + ~ = spk 1 if and only i f the character after a is X . That is, spk+! = spk 1 if and only if P(spk 1 ) = P ( k 1).
+ +
Let B = j!?x denote the prefix of P of length spk+l. That is, = g x i s the longest proper suffix of P [ 1..k 11 that is a prefix of P. If spk+~is strictly greater than
PROOF
"Pk
Figure 3.9: The situation after finding spk.
k k+l
Figure 3.1 0: spk,, is found by finding
6.
51
each time the for statement is reached; it is assigned a variable number of times inside the while loop, each time this loop is reached. Hence the number of times v is assigned is n - 1 plus the number of times it is assigned inside the while loop. How many times that can be is the key question. Each assignment of v inside the while loop must decrease the value of v , and each of the n - 1 times v is assigned at the for statement, its value either increases by one or it remains unchanged (at zero). The value of v is initially zero, so the total amount that the value of v can increase (at thefor statement) over the entire algorithm is at most n - 1. But since the value of v starts at zero and is never negative, the total amount that the value of v can decrease over the entire algorithm must also be bounded by n - I, the total amount it can increase. Hence v can be assigned in the while loop at most n - 1 times, and hence the total number of times that the value of v can be assigned is at most 2(n - 1) = O(n), and the theorem is proved.
Algorithm SP'(P) sp; = 0; F o r i : = 2 t o n do begin V := Spi; ~f P(V + 1) # P(i + 1) then sp; := v else spj := spu7 end;
1 .
Theorem 3.3.2. Algorithm S P'( P ) correctly computes all the spj val~tes in O(n) time.
PROOF
The proof is by induction on the value of i. Since s p ~ = 0 and spf 5 sp; for all i , then sp', = 0, and the algorithm is correct for i = 1. Now suppose that the value of sp: set by the algorithm is correct for all i < k and consider i = k. If P[spk 11 # P[k 11 then clearly sp; is equal to s p k , since the spk length prefix of P[l..k] satisfies all the needed requirements. Hence in this case, the algorithm correctly sets sp;. If P(spk 1) = P(k + l), then sp; < spk and, since P[l..spk] is a suffix P[l..k], spk can be expressed as the length of the longest proper prefix of P[l..spk] that also occurs as a suffix of P [ l . . s p k ] with the condition that P(k 1) # P(spL I), But since 1). P(k 1) = P(spk I), that condition can be rewritten as P(spk 1) # P(sp; So By the induction hypothesis, that value has already been correctly computed as spiPk. when P(spk 1) = P(k 1) the algorithm correctly sets sp; to ~p,:,~. Because the algorithm only does constant work per position, the total time for the algorithm is O(n).
It is interesting to compare the classical method for computing sp and sp' and the method based on fundamental preprocessing (i.e., on Z values). In the classical method the (weaker) sp values are computed first and then the more desirable sp' values are derived
Figure 3.12: "Bouncing ball" cartoon of original Knuth-Morris-Pratt preprocessing. The arrows show the successive assignments to the variable v.
See the example in Figure 3.12. The entire set of sp values are found as follows:
Algorithm SP(P) sp, = 0 F o r k : = 1 t o n - I do begin x := P(k I); u spk; While P ( v + 1) # x and v # 0 do v := s p v ; end; If P ( v I ) = x then spk+l := v 1 else SPk+l := 0; end;
Theorem 3.3.1. Algorithm SPfinds all the s p i ( P ) values in O ( n ) time, where n is the length of P .
Note first that the algorithm consists of two nested loops, a for loop and a while loop. The for loop executes exactly n - 1 times, incrementing the value of k each time. The while loop executes a variable number of times each time it is entered. The work of the algorithm is proportional to the number of times the value of v is assigned. We consider the places where the value of v is assigned and focus on how the value of v changes over the execution of the algorithm. The value of v is assigned once
PROOF
2
Figure 3.14: Pattern PI is the string pat. a. The insertion of pattern F$ when P2 is pa. b. The insertion when F$ is party.
Tree K 1 just consists of a single path of I P1I edges out of root r. Each edge on this path is labeled with a character of PI and when read from the root, these characters spell out P I . The number 1 is written at the node at the end of this path. To create K 2 from K I , first find the longest path from root r that matches the characters of P2 in order. That is, find the longest prefix of P2 that matches the characters on some path from r. That path either ends by exhausting P2 or it ends at some node v in the tree where no further match is possible. In the first case, P2 already occurs in the tree, and so we write the number 2 at the node where the path ends. In the second case, we create a new path out of v, labeled by the remaining (unmatched) characters of P2,and write number 2 at the end of that path. An example of these two possibilities is shown in Figure 3.14. In either of the above two cases, K 2 will have at most one branching node (a node with more than one child), and the characters on the two edges out of the branching node will be distinct. We will see that the latter property holds inductively for any tree K , . That is, at any branching node v in K , , all edges out of v have distinct labels. In general, to create from K i , start at the root of Ki and follow, as far as possible, the (unique) path in K i that matches the characters in Pi+, in order. This path is unique because, at any branching node v of K i , the characters on the edges out of v are distinct. If pattern Pi+1 is exhausted (fully matched), then number the node where the match ends with the number i 1. If a node v is reached where no further match is possible but Pi+]is not fully matched, then create a new path out of v labeled with the remaining unmatched part of Pi+,and number the endpoint of that path with the number i 1. During the insertion of Pi+l,the work done at any node is bounded by a constant, since the alphabet is finite and no two edges out of a node are labeled with the same character. Hence for any i, it takes 0 (I Pi+ I) time to insert pattern Pi+ into X i , and so the time to construct the entire keyword tree is O(n).
Figure 3.13:
from them, whereas the order is just the opposite in the method based on fundamental preprocessing.
Definition The keyword tree for set P is a rooted directed tree K satisfying three conditions: 1. each edge is labeled with exactly one character; 2. any two edges out of the same node have distinct labels; and 3. every pattern P,in P maps to some node v of K such that the characters on the path from the coot of K to v exactly spell out Pi, and every leaf of K is mapped to by some pattern in P.
For example, Figure 3.13 shows the keyword tree for the set of patterns (potato, poetry, pottery, science, school). Clearly, every node in the keyword tree corresponds to a prefix of one of the patterns in P , and every prefix of a pattern maps to a distinct node in the tree. Assuming a fixed-size alphabet, it is easy to construct the keyword tree for P in O(n) time. Define X ito be the (partial) keyword tree that encodes patterns P I , . . . , Pi of p.
There is a more recent exposition of the Aho-Corasick method in 181, where the algorithm is used just as an "acceptor", deciding whether or not there is an occurrence in T of at least one pattern from P.Because we will want to explicitly find all occurrences, that version of the algorithm is too limited to use here.
For example, consider the set of patterns P = botato, tattoo, theatel; other} and its keyword tree shown in Figure 3.16. Let v be the node labeled with the string potar. Since tat is prefix of tattoo, and it is the longest proper suffix of potat that is a prefix of any pattern in P, lp(v) = 3.
Lemma 3.4.1. Let a be the lp(v)-length suf/ir ofstring C(v).Then there is a unique node in the keyword tree that is labeled by string a.
PROOF
K encodes all the patterns in P and, by definition, the lp(v)-length suffix of L(v) is a prefix of some pattern in P. So there must be a path from the root in K that spells out
54
Numbered nodes along that path indicate patterns in P that start at position 1. For a fixed I, the traversal of a path of K takes time proportional to the minimum of m and n, so by successively incrementing 1from to m and traversing K for each 1, the exact set matching problem can be solved in O(nm) time. We will reduce this to O(n m k) time below, where k is the number of occurrences.
'
+ +
The dictionary problem Without any further embellishments, this simple keyword tree algorithm efficiently solves a special case of set matching, called the dictionary problem. In the dictionary problem, a set of strings (forming a dictionary) is initially known and preprocessed. Then a sequence of individual strings will be presented; for each one, the task is to find if the presented string is contained in the dictionary. The utility of a keyword tree is clear in this context. The strings in the dictionary are encoded into a keyword tree K,and when an individual string is presented, a walk from the root of K determines if the string is in the dictionary. T n this special case of exact set matching, the problem is to determine if the text T (an individual presented string) completely matches some string in P. We now return to the general set matching problem of determining which strings in P are contained in text T.
Definition For any node v of K,define Ep(v) to be the length of the longest proper suffix of string C ( v ) that is a prefix of some pattem in P.
57
to the node n, labeled tat, and 1p(v) = 3. So 1 is incremented to 5 = 8 - 3, and the next comparison is between character T(8) and character t on the edge below tat. With this algorithm, when no further matches are possible, 1 may increase by more than one, avoiding the reexamination of characters of T to the left of c, and yet we may be sure that every occurrence of a pattern in P that begins at character c - Ip(u) of T will be correctly detected. Of course (just as in Knuth-Morris-Pratt), we have to argue that there are no occurrences of patterns of P starting strictly between the old 1 and c - lp(v) in T, and thus 1 can be incremented to c - lp(v) without missing any occurrences. With the given assumption that no pattern in P is a proper substring of another one, that argument is almost identical to the proof of Theorem 2.3.2 in the analysis of Knuth-Morris-Pratt, and it is left as an exercise. When lp(v) = 0,then 1 is increased to c and the comparisons begin at the root of K. The only case remaining is when the mismatch occurs at the root. In this case, c must be incremented by 1 and comparisons again begin at the root. Therefore, the use of function u H nucertainly accelerates the naive search for patterns of P. But does it improve the worst-case running time? By the same sort of argument used to analyze the search time (not the preprocessing time) of Knuth-Morris-Pratt (Theorem 2.3.3), it is easily established that the search time for Aho-Corasick is O(m).We leave this as an exercise. However, we have yet to show how to precompute the function v I+ n, in linear time.
Figure 3.17: Keyword tree used to compute the failure function for node v.
56
string a. By the construction of 7 no two paths spell out the same string, so this path is unique and the lemma is proved. Definition For a node u of K let n u be the unique node in K labeled with the suffix of L ( v ) of length lp(v). When lp(u) = 0 then n, is the root of K. Definition We call the ordered pair (u, n,) afailure link. Figure 3.16 shows the keyword tree for P ' = Ipotato, tatoo, theater, other). Failure links are shown as pointers from every node v to node n, where lp(v) > 0.The other failure links point to the root and are not shown.
To understand the use of the function v H n,, suppose we have traversed the tree to node u but cannot continue (i.e., character T(c) does not occur on any edge out of v). We know that string L(v) occurs in T starting at position 1 and ending at position c - 1. By the definition of the function v H n , , it is guaranteed that string L(n,) matches string T [c - lp(u)..c - I]. That is, the algorithm could traverse K from the root to node n, and be sure to match all the characters on this path with the characters in T starting from position c - Ip(v). So when lp(v) 1 0,1 can be increased to c - Ip(v), c can be left unchanged, and there is no need to actually make the comparisons on the path from the root to node n u .Instead, the comparisons should begin at node a , , comparing character c of T against the characters on the edges out of nu. For example, consider the text T = xrpofnttooxx and the keyword tree shown in Figure 3.16. When 1 = 3, the text matches the string potat but mismatches at the next character. At this point c = 8, and the failure link from the node v labeled potat points
Figure 3.18: Keyword tree showing a directed path from potatto at through tat,
Repeating this analysis for every pattern in P yields the result that all the failure links are established in time proportional to the sum of the pattern lengths in P (i.e., in O ( n ) total time).
Lemma 3.4.2. Suppose in a keyword tree K there is a directed path of failure links (possibly empty) from a node v to a node that is numbered with pattern i . Then pattern Pi must occur in T ending at position c (the current character) whenever node v is reached during the search phase of the Aho-Corasick algorithm.
For example, Figure 3.18 shows the keyword tree for P = (potato, pot, tatter, at) along with some of the failure links. Those links form a directed path from the node v labeled potat to the numbered node labeled at. If the traversal of K reaches v then T certainly contains the patterns tat and at end at the current c . Conversely,
Lemma 3.4.3. Suppose a node v has been reached during the algorithm. Then pattern
58
of the classic preprocessing for Knuth-Morris-Pratt, L(n,) must be a suffix of L(nul)(not necessarily proper) followed by character x . S o the first thing to check is whether there is an edge (n u , w') out of node nut labeled with character x . If that edge does exist, then n, is node w' and we are done. If there is no such edge out of nul labeled with character x, then L(n.) is aproper suffix of L(n,l) followed by x . So we examine nnB, next to see if is k or there is an edge out of it labeled with character x . (Node n,, is known because n , ~ fewer edges from the root.) Continuing in this way, with exactly the same justification as in the classic preprocessing for Knuth-Morris-Pratt, we arrive at the following algorithm for computing n u for a node v:
1
Algorithm n,
v' is the parent of v in K ; x is the character on the edge (v', v); w := nu/; While there is no edge out of w labeled x and w # r do w := n,; end (while); If there is an edge ( w , w ' ) out of w labeled x then n, := w ' ; else n u := r; Note the importance of the assumption that n, is already known for every node u that is k or fewer characters from r . To find n, for every node v, repeatedly apply the above algorithm to the nodes in K in a breadth-first manner starting at the root.
Theorem 3.4.1. Let n be the total length of all the patterns in P. The total time used by Algorithm n, when applied to all nodes in K: is 0(n).
PROOF The argument is a direct generalization of the argument used to analyze time in the classic preprocessing for Knuth-Morris-Pratt. Consider a single pattern P in P of length t and its path in K for pattern P. We will analyze the time used in the algorithm to find the failure links for the nodes on this path, as if the path shares no nodes with paths for any other pattern in P. That analysis will overcount the actual amount of work done by the algorithm, but it will still establish a linear time bound. The key is to see how lp(u) varies as the algorithm is executed on each successive node v down the path for P. When v is one edge from the root, then lp(v) is zero. Now let v be an arbitrary node on the path for P and let v' be the parent of v. Clearly, lp(v) 5 1p(vr) 1, so over all executions of Algorithm n, for nodes on the path for P , lp() is increased by a total of at most t. Now consider how lp() can decrease. During the computation of n, for any node v, w starts at nu(and so has initial node depth equal to 1p(v1),However, during the computation of n u , the node depth of w decreases every time an assignment to w is made (inside the while loop). When n , is finally set, lp(v) equals the current depth of w , so if w is assigned k times, then lp(v) 5 1p(vf)- k and lp() decreases by at least k. Now lp() is never negative, and during all the computations along path P, lp() can be increased by a total of at most t. It follows that over all the computations done for nodes on the path for P , the number of assignments made inside the while loop is at most t. The total time used is proportional to the number of assignments inside the loop, and hence all failure links on the path for P are set in O(t) time.
61
of an output link leads to the discovery of a pattern occurrence, so the total time for the algorithmis O(n +rn+k), where k is the total number of occurrences. In summary we have, Theorem 3 . 4 . 2 . I f P is a set of patterns with total length n and T is a text.of total lengrh m , then one can find all occurrences in T o f patterns from P in O(n)prepiocessing time pllis O(m+ k ) search time, where k is the number of occurrences. This is true even without assuming that the patterns in P are substring free.
In a later chapter (Section 6.5) we will discuss further implementation issues that affect the practical performance of both the Aho-Corasick method, and suffix tree methods.
3 . 5 . Three applications of exact set matching 3.5.1. Matching against a DNA or protein library of known patterns
There are a number of applications in molecular biology where a relatively stable Iibrary of interesting or distinguishing DNA or protein substrings have been constructed. The Sequence-tagged sites (STSs) and Expressed sequence tugs (ESTs) provide our first important illustration.
Sequence-tagged-sites
The concept of a Sequence-tagged-site (STS) is one of the most useful by-products that has come out of the Human Genome Project [I11, 234, 3991. Without going into full biological detail, an STS is intuitively a DNA string of length 200-300 nucleotides whose right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome [ I l l , 3 171. Thus each STS occurs uniquely in the DNA of interest. Although this definition is not quite correct, it is adequate for our purposes. An early goal of the Human Genome Project was to select and map (locate on the genome) a set of STSs such that any substring in the genome of length 100,000 or more contains at least one of those STSs. A more refined goal is to make a map containing ESTs (expressed sequence tags), which are STSs that come from genes rather than parts of intergene DNA. ESTs are obtained from mRNA and cDNA (see Section 11 3 . 3 for more detail on cDNA) and typically reflect the protein coding parts of a gene sequence. With an STS map, one can locate on the map any sufficiently long string of anonymous but sequenced DNA - the problem is just one of finding which STSs are contained in the anonymous DNA. Thus with STSs, map location of anonymous sequenced DNA becomes a string problem, an exact set matching problem. The STSs or the ESTs provide a computerbased set of indices to which new DNA sequences can be referenced. Presently, hundreds of thousands of STSs and tens of thousands of ESTs have been found and placed in computer databases [234]. Note that the total length of all the STSs and ESTs is very large compared to the typical size of an anonymous piece of DNA. Consequently, the keyword tree and the Aho-Corasick method (with a search time proportional to the length of the anonymous DNA) are of direct use in this problem for they allow very rapid identification of STSs or ESTs that occur in newly sequenced DNA. Of course, there may be some errors in either the STS map or in the newly sequenced DNA causing trouble for this approach (see Section 16.5 for a discussion of STS maps). But in this application, the number of errors should be a small percentage of the length of the STS, and that will allow more sophisticated exact (and inexact) matching methods to succeed. We will describe some of these in Sections 7.8.3,9.4,and 12.2 of the book.
60
Algorithm full AC search z := 1; c := 1; w := root; repeat While there is an edge ( w , w ' ) labeled T ( c ) begin if w' is numbered by pattern i or there is a directed path of failure links from w' to a node numbered with i then report that Pi occurs in T ending at position c ; w := w' and c := c + 1; end; w := n , and 1 := c - lp(w); until c > 11; Implementation Lemmas 3.4.2 and 3.4.3 specify at a high level how to find all occurrences of the patterns in the text. but specific implementation details are still needed. The goal is to be able to build the keyword tree, determine function v H n,!,and be able to execute the full AC search algorithm all in O ( m + k ) time. To do this we add an additional pointer, called the output link, to each node of K . The output link (if there is one) at a node u points to that numbered node (a node associated with the end of a pattern in P) other than v that is reachable from v by the fewest failure links. The output links can be determined in O ( n )time during the running of the preprocessing algorithm n , , When the n , value is determined, the possible output link from node v is determined as follows: If n u is a numbered node then the output link from v points to n,; if n , is not numbered but has an output link to a node w , then the output link from v points to w ; otherwise v has no output link. In this way, an output link points only to a numbered node, and the path of output links from any node v passes through all the numbered nodes reachable from v via a path of failure links. For example, in Figure 3.18 the nodes for tat and potnr will have their output links set to the node for at, The work of adding output links adds only constant time per node, so the overall time for algorithm n, remains O(n). With the output links, all occurrences in T of patterns of F can be detected in O(m + k) time. As before, whenever a numbered node is encountered during the full AC search, an occurrence is detected and reported. But additionally, whenever a node u is encountered that has an output link from it, the algorithm must traverse the path of output links from v, reporting an occurrence ending at position c of T for each link in the path. When that path traversal reaches a node with no output link, it returns along the path to node v and continues executing the full AC search algorithm. Since no character comparisons are done during any output link traversal, over both the construction and search phases of the algorithm the number of character comparisons is still bounded by O(n +m).Further, even though the number of traversals of output links can exceed that linear bound, each traversal
text T. For each starting location j of Pi in T, increment the count in cell j - li 1 of C by one.
{For example, if the second copy of string a b is found in T starting at position 18, then cell 12 of C is incremented by one.)
3. Scan vector C for any cell with value k. There is an occurrence of P in T starting at position p if and only if C ( p )= k.
Correctness and complexity of the method Correctness Clearly, there is an occurrence of P in T starting at position p if and only if, for each i , subpattern Pi E P occurs at position j = p + I , - 1 of T. The above method uses this idea in reverse. If pattern Pi E P is found to occur starting at 'position j of T , and pattern P, starts at position li in P , then this provides one "witness" that P occurs at T starting at position p = j - 1, 1. Hence P occurs in T starting at p if and only if similar witnesses for position p are found for each of the k strings in P . The algorithm counts, at position p, the number of witnesses that observe an occurrence of P beginning at p. This correctly determines whether P occurs starting at p because each string in P can cause at most one increment to cell p of C .
Complexity The time used by the Aho-Corasick algorithm to build the keyword tree for P is O ( n ) .The time to search for occurrences in T of patterns from P is O ( m + z ) , where IT I = m and z is the number of occurrences. We treat each pattern in P as being distinct even if there are multiple copies of it in P. Then whenever an occurrence of a pattern from P is found in T, exactly one cell in C is incremented; furthermore, a cell can be incremented to at most k. Hence z must be bounded by km, and the algorithm runs in O ( k m )time . Although the number of character comparisons used is just O ( m ) ,km need not be O ( m )and hence the number of times C is incremented may grow faster than O ( m ) , leading to a nonlinear O ( k m )time bound. But if k is assumed to be bounded (independent of I P I), then the method does run in linear time. In summary, Theorem 3.5.1. Ifthe number of wild cards in pairem P is bounded by a constant, then the exact matching problem with wild cards in the Partem can be solved in O ( n m ) time.
Later, in Sections 9.3, we will return to the problem of wild cards when they occur in either the pattern, text, or both.
62
A related application comes from the "BAC-PAC" proposal [442] for sequencing the human genome (see page 4 18). In that method, 600,000 strings (patterns) of length 500 would first be obtained and entered into the computer. Thousands of times thereafter, one would look for occurrences of any of these 600,000 patterns in text strings of length 150,000. Note that the total length of the patterns is 300 million characters, which is two-thousand times as large as the typical text to be searched.
where CYS is the amino acid cysteine and HIS is the amino acid histidine. Another important transcription factor is the Leucine Zipper, which consists of four to seven leucines, each separated by six wild card amino acids. If the number of permitted wild cards is unbounded, it is not known if the problem can be solved in linear time. However, if the number of wild cards is bounded by a fixed constant (independent of the size of P) then the following method, based on exact set pattern matching, runs in linear time: Exact matching with wild cards 0. Let C be a vector of length 1 TI initialized to all zeros.
1. Let P = ( P I ,fit. . . , Pk}be the (multi-)set of maximal substrings of P that do not contain any wild cards. Let 11, 12,. . . , lk be the starting positions in P of each of these substrings.
(For example, if P = ab##c#ab## then P = (ab, c, a b ) and l I = 1, l2 = 5, l3 = 7 , ) 2. Using the Aho-Corasick algorithm (or the suffix tree approach to be discussed later), find for each string Pi in P, all starting positions of Pi in
65
Every string specified by this regular expression has ten positions, which are separated by a dash. Each capital letter specifies a single amino acid and a group of amino acids enclosed by brackets indicates that exactly one of those amino acids must be chosen. A small x indicates that any one of the twenty amino acids from the protein alphabet can be chosen for that position. This regular expression describes 192,000 amino acid strings, but only a few of these actually appear in any known proteins. For example, ENLSSEDEEL is specified by the regular expression and is found in human granin proteins.
+,
Definition A single character from C is a regular expression. The symbol E is a regular expression. A regular expression followed by another regular expression is a regular form a regular exexpression. Two regular expressions separated by the symbol pression. A regular expression enclosed in parentheses is a regular expression. A regular expression enclosed in parentheses and followed by the symbol "*" is a regular expression. The symbol * is called the Kleene closure.
"+"
These recursive rules are simple to follow, but may need some explanation. The symbol E represents the empty string (i.e., the string of length zero). If R is a parenthesized regular expression, then R* means that the expression R can be repeated any number of times (including zero times). The inclusion of parentheses as part of a regular expression (outside of C) is not standard, but is closer to the way that regular expressions are actually specified in many applications. Note that the example given above in PROSITE format does not conform to the present definition but can easily be converted to do so. As an example, let C be the alphabet of lower case English characters. Then R = (a c r)ykk(p q)* vdt(1 z ~ ) ( p q is ) a regular expression over C, and S =
+ +
+ +
Note that in the context of regular expressions, the meaning of the word "pattern" is different from its previous and general meaning in this book.
64
more complex techniques of the type we will examine in Part I11 of the book. So for now, we view two-dimensional exact matching as an illustration of how exact set matching can be used in more complex settings and as an introduction to more realistic two-dimensional problems. The method presented follows the basic approach given in [44] and [66]. Since then, many additional methods have been presented since that improve on those papers in various ways. However, because the problem as stated is somewhat unrealistic, we will not discuss the newer, more complex, methods. For a sophisticated treatment of two-dimensional matching see [22] and [169]. Let m be the total number of points in T , let n be the number of points in P , and let n' be the number of rows in P. Just as in exact string matching, we want to find the smaller picture in the larger one in O(n + m) time, where O(nm) is the time for the obvious approach. Assume for now that each of the rows of P are distinct; later we will relax this assumption. The method is divided into two phases. In the first phase, search for all occurrences of each of the rows of P among the rows of T. To do this, add an end of row marker (some character not in the alphabet) to each row of T and concatenate these rows together to form a single text string T' of length O(m). Then, treating each row of P as a separate pattern, use the Aho-Corasick algorithm to search for all occurrences in T' of any row of P. Since P is rectangular, all rows have the same width, and so no row is a proper substring of another and we can use the simpler version of Aho-Corasick discussed in Section 3.4.2. Hence the first phase identifies all occurrences of complete rows of P in complete rows of T and takes O(n + m) time. Whenever an occurrence of row i of P is found starting at position (p, q ) of T, write the number i in position ( p , q ) of another array M with the same dimensions as T . Because each row of P is assumed to be distinct and because P is rectangular, at most one number will be written in any cell of M. In the second phase, scan each column of M, looking for an occurrence of the string 1 , 2 ,. . . , n' in consecutive cells in a single column. For example, if this string is found in column 6, starting at row 12 and ending at row n' 12, then P occurs in T when its upper left corner is at position (6,12). Phase two can be implemented in O(nJ+ m) = O(n m) time by applying any linear-time exact matching algorithm to each column of M. This gives an O(n + rn) time solution to the two-dimensional exact set matching problem. Note the similarity between this solution and the solution to the exact matching problem with wild cards discussed in the previous section. A distinction will be discussed in the exercises. Now suppose that the rows of P are not all distinct. Then, first find all identical rows and give them a common label (this is easily done during the construction of the keyword tree for the row patterns). For example, if rows 3,6, and 10 are the same then we might give them all the label of 3. We do a similar thing for any other rows that are identical. Then, in phase one, only look for occurrences of row 3, and not rows 6 and 10. This ensures that a cell of M will have at most one number written in it during phase 1. In phase 2, don't look for the string 1,2, 3, . . . , n' in the columns of M, but rather for a string where 3 replaces 6 and 10, etc. It is easy to verify that this approach is correct and that it takes just O(n m) time. In summary,
Theorem 3.5.2. IfT a n d P are rectangularpictures with m and n cells, respectively, then all exact occurrences of P i~ T can be found in O(n + nz) time, improving upon the naive method, which takes O(nm) time.
3.7. EXERCISES
3.7. Exercises
1. Evaluate empirically the speed of the Boyer-Moore method against the Apostolic+ Giancarlo method under different assumptions about the text and the pattern. These assumptions should include the size of the alphabet, the "randomness" of h e text or pattern, the level of periodicity of the text or pattern, etc.
2. In the Apostolic+Giancarlo method, array M is of size m, which may be large. Show how to modify the method so that it runs in the same time, but in place of M uses an array of size n.
3. In the Apostolic~Giancarlo method, it may be better to compare the characters first and then examine M and N if the two characters match. Evaluate this idea both theoretically and empirically.
4. In the Apostolico-Giancarlo method, M ( j ) is set to be a number less than or equal to the length of the (right-to-left) match of Pand T starting at position j o f T . Find examples where
the algorithm sets the value to be strictly less than the length of the match. Now, since the algorithm learns the exact location of the mismatch in all cases, M( j ) could always be set to the full length of the match, and this would seem to be a good thing to do. Argue that this change would result in a correct simulation of Boyer-Moore. Then explain why this was not done in the algorithm. Hint: tt's the time bound.
5. Prove Lemma 3.2.2 showing the equivalence of the two definitions of semiperiodic strings.
6. For each of the n prefixes of P, we want to know whether the prefix P [ l ..i] is a periodic string. That is, for each i we want to know the largest k > 1 (if there is one) such that P [ l ..ij can be written as a k for some string a. Of course, we also want to know the period.
Show how to determine this for all n prefixes in time linear in the length of Hint: 2-algorithm. semiperiodic and with what period. Again, the time should be linear.
P.
7. Solve the same problem as above but modified to determine whether each prefix is
8. By being more careful in the bookkeeping, establish the constant in the O(m) bound from Cole's linear-time analysis of the Boyer-Moore algorithm.
9. Show where Cole's worst-case bound breaks down if only the weak Boyer-Moore shift
rule is used. Can the argument be fixed, or is the linear time bound simply untrue when only the weak rule is used? Consider the example of T = abababababababababab and P = xaaaaaaaaa without also using the bad character rule.
10. Similar to what was done in Section 1.5, show that applying the classical Knuth-Morris-Pratt preprocessing method to the string P$T gives a linear-time method to find all occurrence of Pin T . In fact, the search part of the Knuth-Morris-Pratt algorithm (alter the preprocessing of P is finished) can be viewed as a slightly optimized version of the Knuth-Morris-Pratt preprocessing algorithm applied to the T part of P $ T . Make this precise, and quantify the utility of the optimization.
11. Using the assumption that P is substring free (i.e., that no pattern F$ E P is a substring of another pattern P, E P), complete the correctness proof of the Aha-Corasick algorithm. That is, prove that if no further matches are possible at a node v, then I can be set to c - IAv) and the comparisons resumed at node n, without missing any occurrences in T of patterns from P. 12. Prove that the search phase of the AhwCorasick algorithm runs in O(m) time if no pattern in P is a proper substring of another, and otherwise in O(m k) time, where k is the total number of occurrences.
13. The AhwCorasick algorithm can have the same problem that the Knuth-Morris-Pratt algorithm
Figure 3.19:
aykkpqppvdtpq is a string specified by R. To specify S,the subexpression ( p +q) of R was repeated four times, and the empty string c was the choice specified by the subexpression (l z 6). It is very useful to represent a regular expression R by a directed graph G(R) (usually called a nondeterministic, finite state automaton). An example is shown in Figure 3.19. The graph has a start node s and a termination node t, and each edge is labeled with a single symbol from C U c. Each s to t path in G(R) specifies a string by concatenating the characters of C that label the edges of the path. The set of strings specified by all such paths is exactly the set of strings specified by the regular expression R. The rules for constructing G(R) from R are simple and are left as an exercise. It is easy to show that if a regular expression R has n symbols, then G(R) can be constructed using at most 2n edges. The details are left as an exercise and can be found in [lo] and [8].
+ +
Definition A substring T' of string T matches the regular expression R if there is an s to t path in G(R) that specifies T'. Searching for matches To search for a substring in T that matches the regular expression R, we first consider the simpler problem of determining whether some (unspecified) prefix of T matches R. Let N(0) be the set of nodes consisting of node s plus all nodes of G(R) that are reachable from node s by traversing edges labeled c. In general, a node v is in set N(i), for i > 0, if v can be reached from some node in N(i - 1) by traversing an edge labeled T(i) followed by zero or more edges labeled c . This gives a constructive rule for finding set N(i) from set N(i - 1) and character T(i). It easily follows by induction on i that a node v is in N(i) if and only if there is path in G(R) from s that ends at u and generates the string T [1 ..i]. Therefore, prefix T [ l ..i3 matches R if and only if N(i) contains node t. Given the above discussion, to find all prefixes of T that match R, compute the sets N(i) for i from 0 to m, the length of T. If G(R) contains e edges, then the time for this algorithm is O(me), where m is the length of the text string T. The reason is that each iteration i [finding N(i) from N(i - I) and character T(i)] can be implemented to run in O(e) time (see Exercise 29). To search for a n o n p r e - substring of T that matches R, simply search for a prefix of T that matches the regular expression C* R. C* represents any number of repetitions (including zero) of any character in C. With this detail, we now have the following: Theorem 3.6.1. I f T is of length m, and the regular expression R contains n symbols, then it is possible to determine whether T conmins a substring matching R in O(nm) time.
3 . 7 . EXERCISES
f
69
(i , j - i + 1). Then declare that P occurs in T with upper left corner in any cell whose counter becomes n (the number of rows of P). Does this work?
f
Hint: No. Why not? Can you fix it and make it run in O(n
+ m) time?
27. Suppose we have q > 1 small (distinct) rectangular pictures and we want to find all occurrences of any of the q small pictures in a larger rectangular picture. Let n be the total number of points in all the small pictures and rn be the number of points in the large picture. Discuss how to solve this problem efficiently. As a simplification, suppose all the small pictures have the same width. Then show that O(n m)time suffices.
28. Show how to construct the required directed graph G(R) from a regular expression R. The construction should have the property that if R contains n symbols, then G(R) contains at most O(n) edges. 29. Since the directed graph G(R) contains O(n) edges when R contains n symbols, I N(i)l = O(n) for any i .This suggests that the set N(i) can be naively found from N(i - 1) and T(i) in O(ne) time. However, the time stated in the text for this task is O(e). Explain how this reduction of time is achieved. Explain that the improvement is trivial if G(R) contains no E edges.
30. Explain the importance, or the utility, of E edges in the graph G(R). If Rdoes not contain the closure symbol "*",can c edges aiways be avoided? Biological strings are always finite, hence "*" can atways be avoided. Explain how this simplifies the searching algorithm.
31. Wild cards can clearly be encoded into a regular expression, as defined in the text. However, it may be more efficient to modify the definition of a regular expression to explicitly include the wild card symbol. Develop that idea and explain how wild cards can be efficiently handled by an extension of the regular expression pattern matching algorithm.
32. PROSITE patterns often specify the number of times that a substring can repeat as a finite range of numbers. For example, CD(24) indicates that CD can repeat either two,
three, or four times. The formal definition of a regular expression does not include such concise range specifications, but finite range specifications can be expressed in a regular expression. Explain how. How much do those specifications increase the length of the expression over the length of the more concise PROSITE expression? Show how such range specifications are reflected in the directed graph for the regular expression ( E edges are permitted). Show that one can still search for a substring of T that matches the regular expression in O(rne) time, where rn is the length of T and e is the number of edges in the graph.
33. Theorem 3.6.1 states the time bound for determining if T contains a substring that matches a regular expression R. Extend the discussion and the theorem to cover the task of explicitly finding and outputting all such matches. State the time bound as the sum of a term that is independent of the number of matches plus a term that depends on that number.
68
14. Give an example showing that k, the number of occurrences in T of patterns in set P , can Be sure you account for the input size n. Try to make the growth grow faster than O(n+ m). as large as possible. 15. Prove Lemmas 3.4.2 and 3.4.3 that relate to the case of patterns that are not substring free. 16. The time analysis in the proof of Theorem 3.4.1 separately considers the path in K for each pattern Pin P . This results in an overcount of the time actually used by the algorithm. Perform the analysis more carefully to relate the running time of the algorithm to the number of nodes in K.
17. Discuss the problem (and soiution if you see one) of using the Aho-Corasick algorithm when a, wild cards are permitted in the text but not in the pattern and b. when wild cards are permitted in both the text and pattern.
18. Since the nonlinear time behavior of the wild card algorithm is due to duplicate copies of strings in P , and such duplicates can be found and removed in linear time, it is tempting to 'Yfix up" the method by first removing duplicates from P . That approach is similar to what is done in the two-dimensional string matching problem when identical rows were first found and given a single label. Consider this approach and try to use it to obtain a linear-time method for the wild card problem. Does it work, and if not what are the problems?
19. Show how to modify the wild card method by replacing array C (which is of length m > n) by a list of length n, while keeping the same running time.
20. In the wild card problem we first assumed that no pattern in P is a substring of another one, and then we extended the algorithm to the case when that assumption does not hold. Could we instead simply reduce the case when substrings of patterns are atlowed to the case when they are not? For example, perhaps we just add a new symbol to the end of each string in P that appears nowhere else in the patterns. Does it work? Consider both correctness and complexity issues. 21. Suppose that the wild card can match any length substring, rather than just a single character. What can you say about exact matching with these kinds of wild cards in the pattern, in the text, or in both? 22. Another approach to handling wild cards in the pattern is to modify the Knuth-Morris-Pratt or Boyer-Moore algorithms, that is, to develop shift rules and preprocessjng methods that can handle wild cards in the pattern. Does this approach seem promising? Try it, and discuss the problems (and solutions if you see them). 23. Give a complete proof of the correctness and O(n+ m)time bound for the two-dimensionat matching method described in the text (Section 3.5.3).
24. Suppose in the two-dimensionalmatching problem that Knuth-Morris-Pratt is used once for each pattern in P , rather than Aho-Corasick being used. What time bound would result?
25. Show how to extend the two-dimensional matching method to the case when the bottom of the rectangular pattern is not parallel to the bottom of the large picture, but the orientation of the two bottoms is known. What happens if the pattern is not rectangular? 26. Perhaps we can omit phase two of the two-dimensional matching method as follows: Keep a counter at each cell of the large picture. When we find that row i of the small picture 1 , increment the counter for cell occurs in row j of the large picture starting at position ( i ' , 1
1
0 0
1
0
1
0
1
1
0
1 1
0
because prefixes of P of lengths one and three end at position seven of T . The eighth character of T is character a , which has a U vector of
which is the correct ninth column of M . To see in general why the Shift-And method produces the correct array entries, observe that for any i > l the array entry for cell (i, j ) should be 1 if and only if the first i - 1 characters of P match the i - I characters of T ending at character j - 1 and character P ( i ) matches character T ( j ) . The first condition is true when the array entry for cell (I - 1, j - 1 ) is 1, and the second condition is true when the i th bit of the U vector for character T ( j ) is 1. By first shifting column j - 1, the algorithm ANDs together entry (i - I , j - 1) of column j 1 with entry i of the vector U(T(j)). Hence the algorithm computes the correct entries for array M .
4
Seminumerical String Matching
In other words, M(i, j) is 1 if and only if P[l..i] exactly matches T [ j - i l..j]. For example, if T = cal~fornin and P = for, then M(1,5) = M (2,6j = M ( 3 , 7 ) = I, whereas M(i, j ) = 0 for all other combinations of i, j. Essentially, the entries with value 1 in row i of M show all the places in T where a copy of P [ l ..i] ends, and column j of M shows all the prefixes of P that end at position j of T. Clearly, M(n, J ) = l if and only if an occurrence of P ends at position j of T ; hence computing the last row of M solves the exact matching problem. For the algorithm to compute M it first constructs an n-length binary vector U ( x )for each character x of the alphabet. U ( x ) is set to 1 for the positions in P where character x appears. For example, if P = abacdeab then U(n) = 10100010.
Definition Define Bit-Shifr(j - 1 ) as the vector derived by shifting the vector for coliimn j - 1 down by one position and setting that first to I . The previous bit in position n disappears. In other words, Bit-Shifr(j - 1) consists of 1 followed by the first n - I bits of column j - 1.
For example, Figure 4.1 shows a column j - 1 before and after the bit-shift.
73
zero column of each array is again initialized to all zeros. Then the jth column of M' is computed by:
~ ' ( j =) M ' - ' ( j ) OR [ f l i t - ~ h i f t ( ~ ' (1)) j AND U ( T ( j ) ) ] OR ~
' - l ( j
- 1).
Intuitively, this just says that the first i characters of P will match a substring of T ending at position j, with at most I mismatches, if and only if one of the following three conditions hold: The first i characters of P match a substring of T ending at j , with at most 1 - 1 mismatches. The first i - 1 characters of P match a substring of T ending at j - 1, with at most 1 mismatches, and the next pair of characters in P and T are equal. The first i - 1 characters of P match a substring of T ending at j - 1, with at most I - 1 mismatches. It is simple to establish that these recurrences are correct, and over the entire algorithm the number of bit operations is O(knrn).As in the Shift-And method, the practical efficiency comes from the fact that the vectors ate bit vectors (again of length n) and the operations are very simple - shifting by one position and ANDing bit vectors. Thus when the pattern is relatively small, so that a column of any M' fits into a few words, and k is also small, agrep is extremely fast.
Definition The matrix MC is an n by m + 1 integer-valued matrix, where entry M C ( i , j ) is the number of characters of P [ l . . i ] that match T [ j - i I .. j ] .
A simple algorithm to compute matrix MC generalizes the Shift-And method, replacing the AND operation with the increment by one operation. The zero column of MC starts with all zeros, but each M C ( i , j) entry now is set to MC(i - 1, j - 1 ) if P ( i ) # T ( j ) , and otherwise it is set to M C ( i - 1, j - 1) 1. Any entry with value n in the last row again indicates an occurrence of P in T, but values less than n count the exact number of characters that match for each of different alignments of P with T. This extension uses O ( n m ) additions and comparisons, although each addition operation is particularly simple, just incrementing by one. If we want to compute the entire MC array then O ( n m )time is necessary, but the most important information is contained in the last row of MC. For each position j > n in T , the last row indicates the number of characters that match when the right end of P is aligned with character j of T. The problem of finding the last row of MC is called the match-count problem. Match-counts are useful in several problems to be discussed later.
72
That is, M k ( i , j ) is the natural extension of the definition of M ( i , j)to allow up to k mismatches. Therefore, M o is the array M used in the Shift-And method. If ~ " ( nj ), = 1 then there is an occurrence of P in T ending at posi~ionj that contains at most k mismatches. We let M" j ) denote the jth column of M k . In agrep, the user chooses a value of k and then the arrays M , M 1 , M ~. ., . , M~ are computed. The efficiency of the method depends on the size of k - the larger k is, the slower the method. For many applications, a value of k as small as 3 or 4 is sufficient, and the method is extremely fast.
75
Definition Define Va(a, B, i ) to be the number of matches of characterZathat occur when the start of string a is positioned opposite position i of string B. Va(a, B) is the (n + m)-length vector holding these values.
Similar definitions apply for the other three characters. With these definitions,
and
The problem then becomes how to compute V,(a, /?, i) for each i. Convert the two respectively, where every occurrence of character a strings into binary strings d?, and becomes a 1, and all other characters become 0s. For example, let a be acaacggaggrat and fi be accacgang. Then the binary strings Ea and Fa are 10 11000100010 and 100100 1 10. To compute V,(a, p , i), position Fa to start at position i of dT, and count the number of columns where both bits are equal to 1. For example, if i = 3 then we get
L,
and V,(a, p , 9) = 1. Another way to view this is to consider each space opposite a bit to be a 0 (so both binary strings are the same length), d o a bitwise AND operation with the strings, and then add the resulting bits. To formalize this idea, pad the right end of (the larger string) with n additional zeros and pad the right end of 15with m additional zeros. The two resulting strings then each have length n rn. Also, for convenience, renumber the indices of both strings to run from O t o n + r n - 1.Then
where the indices in the expression are taken modulo n + m . The extra zeros are there to handle the cases when the left end of a is to the left end of /I and, conversely, when the right end of a is to the right end of /?. Enough zeros were padded so that when the right end of a is right of the right end of /3, the corresponding bits in the padded d?, are all i) opposite zeros. Hence no "illegitimate wraparound" of a and p is possible, and V,((Y, /?, is correctly computed. So far, all we have done is to recode the match-count problem, and this recoding doesn't suggest a way to compute V,(cr, /?) more efficiently than before the binary coding and padding. This is where correlation and the FFT come in.
Clearly, when cr = P and = T the vector V(a, 8 ) contains the information needed for the last row of MC. But it contains more information because we allow the left end of ct to be to the left of the left end of /?, and we also allow the right end of a to be to the right of the right end of B. Negative numbers specify positions to the left of the left end of B, and positive numbers specify the other positions. For example, when a is aligned with B as follows,
21123456789
B: A:
accctgtcc aac t g c c g
then the left end of a is aligned with position -2 of B. Index i ranges from -n + 1 to m. Notice that when i > m - n, the right end of cr is right of the right end of B. For any fixed i , V(a, B, i) can be directly computed in O(n) time (for any i , just directly count the number of resulting matches and mismatches), so V(a, B ) can be computed in O(nn1) total time. We now show how to compute V(a, B) in O(m logm) total time by using the Fast < n, so this technique yields a Fourier Transform. For most problems of interest log m < large speedup. Further, there is specialized hardware for FFT that is very fast, suggesting a way to solve these problems quickly with hardware. The solution will work for any alphabet, but it is easiest to explain it on a small alphabet. For concreteness we use the four-letter alphabet a , t, c , g of DNA.
77
where for each character x , V,(cr, B, i) is computed by replacing each wild card with character x . In summary,
Theorem 4.3.1. The match-count problem can be solved in O(m log m) time even unbounded number o f wild cards are allowed in either P or T.
if an
Later, after discussing suffix trees and common ancestors, we will present in Section 9.3 a different, more comparison-based approach to handling wild cards that appear in both strings.
+ i - I). +
That is, consider P to be an n-bit binary number. Similarly, consider T: to be an n-bit binary number. For example, if P = 0101 then n = 4 and H ( P ) = Z3 x 0 + 2' x 1 2' x 0 2 O x 1 = 5; if T = 101~01010, n = 4, and r = 2, then H(T,) = 6. Clearly, if there is an occurrence of P starting at position r of T then H ( P ) = H(T,). However, the converse is also true, so
if and only if
Cyclic correlation Definition Let X and Y be two z-length vectors with real number components indexed from 0 to z - 1. The cyclic correlation of X and Y is an z-length real vector W(i) = ~ i ~ X(j) i - "x Y(i + j ) , where the indices in the expression are taken modulo z.
Clearly, the problem of computing vector V,(ar, B) is exactly the problem of computing the cyclic correlation of padded strings @a and In detail, X = a,, Y = Pa, z = n m , and W = V,(a, p). Now an algorithm based only on the definition of cyclic correlation would require 0(z2) operations, so again no progress is apparent. But cyclic correlation is a classic problem known to be solvable in O(z log z) time using the Fast Fourier Transform. (The FFT is more often associated with the convalutian problem for two vectors, but cyclic correlation and convolution are very similar. In fact, cyclic correlation is solved by reversing one of the input vectors and then computing the convolution of the two resulting vectors.) The FFT method, and its use in the solution of the cyclic correlation problem, is beyond the scope of this book, but the key is that it solves the cyclic correlation problem in O(z log z) arithmetic operations, for two vectors each of length 2 . Hence it solves the match-count problem using only O ( m log m ) arithmetic operations. This is surprisingly efficient and a definite improvement over the O(nm) bound given by the generalized Shift-And approach. However, the FFT requires operations over complex numbers and so each arithmetic step is more involved (and perhaps more costly) than in the more direct Shift-And method.'
K.
A related approach [ 5 8 ] attempts to solve the match-count problem in O ( m log m ) integer (nonuomplex) operations by implementing the FFT over a finite field. In practice. this approach is probably superior to the approach based on complex numbers, although in terms of pure complexity theory the claimed O ( m logm) bound is not completely kosher because it uses a precomputed table of numbers that is only adequate for values of m up to a certain size.
79
For example, if P = 101 11 1 and p = 7, then H ( P ) = 47 and H p ( P ) = 47 mod 7 = 5. Moreover, this can be computed a s follows:
2"T(r - 1)
+ T(r + n - I ) ,
mod p) mod p.
Therefore, each successive power of two taken mod p and each successive value Hp(Tr) can be computed in constant time.
The goal will be to choose a modulus p small enough that the arithmetic is kept efficient, yet large enough that the probability of a false match between P and T is kept small. The key comes from choosing p to be a prime number in the proper range and exploiting properties of prime numbers. We will state the needed properties of prime numbers without proof.
Definition For a positive integer 11, n(u)is the number of primes that are less than or equal to u.
The following theorem is a variant of the famous prime number theorem.
Theorem 4.4.2.
& 5 n ( u ) 5 1.26&,
78
The proof, which we leave to the reader, is an immediate consequence of the fact that every integer can be written in a unique way as the sum of positive powers of two. Theorem 4.4.1 converts the exact match problem into a numerical problem, comparing the two numbers H ( P ) and H (T, ) rather than directly comparing characters. But unless the pattern is fairly small, the computation of H ( P ) and H(T,) will not be efficient.' The problem is that the required powers of two used in the definition of H ( P ) and H(T,) grow large too rapidly. (From the standpoint of complexity theory, the use of such large numbers violates the unit-time random access machine (RAM) model. In that model, the largest allowed numbers must be represented in O[log(n + m)] bits, but the number 2" requires n bits. Thus the required numbers are exponentially too large.) Even worse, when the alphabet is not binary but say has t characters, then numbers as large as t" are needed. In 1987 R. Karp and M. Rabin 12661 published a method (devised almost ten years earlier), called the randomized fingerprint method, that preserves the spirit of the above numerical approach, but that is extremely efficient as well, using numbers that satisfy the RAM model. It is a randomized method where the only if part of Theorem 4.4.1 continues to hold, but the if part does not. Instead, the ifpart will hold with high probability. This is explained in detail in the next section.
Definition For a posi tive integer p, H p ( P )is defined as H ( P ) mod p. That is H,(P) is the remainder of H ( P ) after division by p. Similarly, H,,(T,) is defined as H(T,) mod p. The numbers Hp(P) and H,(T,) are calledjngerprints of P and T,.
Already, the utility of using fingerprints should be apparent. By reducing H ( P ) and H(T,) modulo a number p, every fingerprint remains in the range 0 to p - 1, so the size of a fingerprint does not violate the RAM model. But if H ( P ) and H(T,) must be computed before they can be reduced modulo p , then we have the same problem of intermediate numbers that are too large. Fortunately, modular arithmetic allows one to reduce at any time (i.e., one can never reduce too much), so that the following generalization of Homer's rule holds:
Lemma4.4.1. H,(P) = [[. . . ( ( [ P ( l ) x 2 mod p + P(2)] x 2 mod p f P ( 3 ) ) x 2 mod p + P(4)). . .] mod p f P(n)) mod p, and no number ever e-vceeds2 p during the comp~rtation of HP(P).
One can more efficiently compute H(Tr+l) from H(Tr) than by following the detinition directly (and we will need that later on). but the time to do the updates is not the issue here.
Given the fact that each H,(T,) can be computed in constant time from HP(T,-,), the fingerprint algorithm runs in O ( m ) time, excluding any time used to explicitly check a declared match. It may, however, be reasonable not to bother explicitly checking declared matches, depending on the probability of an error. We will return to the issue of checking later. For now, to fully analyze the probability of error, we have to answer the question of what I should be.
How to choose I The utility of the fingerprint method depends on finding a good value for I. As I increases, the probability of a false match between P and T decreases, but the allowed size of p increases, increasing the effort needed to compute H,(P) and H,(T,). Is there a good balance? There are several good ways to choose I depending on n and m. One choice is to take I = n m 2 .With that choice the largest number used in the algorithm requires at most 4flogn + logm) bits, satisfying the RAM model requirement that the numbers be kept small as a function of the size of the input. But, what of the probability of a false match? Corollary 4.4.2. When I = nm 2 , the probability of a false match is at most
PROOF
By Theorem 4.4.3 and the prime number theorem (Theorem 4.4.2), the probability of a false match is bounded by
A small example from [266] illustrates this bound. Take n = 250, rn = 4000, and < lom3. hence I = 4 x lo9 < 2". Then the probability of a false match is at most Thus, with just a 32-bit fingerprint, for any P and T the probability that even a single one of the algorithm's declarations is wrong is bounded by 0.001. Alternately, if I = n'm then the probability of a false match is O(i/n), and since it takes O(n) time to determine whether a match is false or real, the expected verification time would be constant. The result would be an O(m) expected time method that never has a false match.
Extensions
If one prime is good, why not use several? Why not pick k primes p i , p?.. . . , pk randomly and compute k fingerprints? For any position r, there can be an occurrence of P starting ( P ) = Hp,(T,) for every one of the k selected primes. We now define a at r only if H,, false match between P and T to mean that there is an r such that P does not occur in T starting at r, but Hpi(P) = H,, (T,) for each of the k primes. What now is the probability of a false match between P and T? One bound is fairly immediate and intuitive.
80
Lemma 4.4.2. I f u >_ 29, then the product o f all the primes that are less thurl or equal to ui s greater than 2" [383].
For example, for u = 29 the prime numbers less than or equal to 29 are 2,5,7, 1 1, 13, 17, 19,23, and 29. Their product is 2,156,564,410 whereas z~~ is 536,870,912.
Corollary 4.4.1. If u > 29 and x is any number less than or equal to 2", then x has fewer than n ( u ) (distinct) prime divisors.
Suppose x does have k > n ( u ) distinct prime divisors q , , q?,. . . , q k . Then 2" 2 x 1 , qlq2 . . . q k (the first inequality is from the statement of the corollary, and the second from the fact that some primes in the factorization of x may be repeated). But qiqz . . . q k is at least as large as the product of the smallest k primes, which is greater than the product of the first n ( u ) primes (by assumption that k > ~ ( 1 1 ) ) .However, the product of the primes less than or equal to u is greater than 2" (by Lemma 4.4.2). So the assumption that k > n ( u ) leads to the contradiction that 2" > 2", and the lemma is proved.
PROOF
The central theorem Now we are ready for the central theorem of the Karp-Rabin approach. Theorem 4.4.3. Let P and T be any strings such that nm 2 29, where n and m are the lengths of P crrld T , respectively. Let I be any positive integer I f p is n randomly chosen prime number less than or eqcial to I , then the probability of a false match between P and ir[nm) T is less than or equal to =.
PROOF
Let R be the set of positions in T where P does not begin. That is, s E R if and only if P does not occur in T beginning at s. For each s E R, H ( P ) # H(T,). Now H ( P ) - H(T,)I).That product must be at most 2"" since for consider the product nXER(I any s , H ( P ) - H ( T , ) 5 2" (recall that we have assumed a binary alphabet). Applying Corollary 4.4.1, n , G R ( I N ( P) H(T,)I) has at most ntnrn) distinct prime divisors. Now suppose a false match between P and T occurs at some position r of T . That means that H ( P ) mod p = H(T,) mod p and that p evenly divides H ( P ) - H(T,). Trivially then. p evenly divides llSER(I H(P)- H(T,)I), and so p is one of the prime divisors of that product, If p allows a false match to occur between P and T, then p must ) But p was chosen randomly from a set of ~ ( 1 ) be one of a set of at most ~ ( n mnumbers. numbers, so the probability that p is a prime that allows a false match between P and T is at n01rn) most =.
Notice that Theorem 4.4,3 holds for any choice of pattern P and text T such that nm 2 29. The probability in the theorem is not taken over choices of P and T but rather over choices of prime p. Thus, this theorem does not make any (questionable) assumptions about P or T being random or generated by a Markov process, etc. It works for any P and T! Moreover, the theorem doesn't just bound the probability that a false match occurs at a fixed position r , it bounds the probability that there is even a single such position r in 7".It is also notable that the analysis in the proof of the theorem feels "weak". That is, it only develops rt very weak property of a prime p that allows a false match, namely being one of at most n ( n m ) numbers that divide II.,,R(I H ( P ) - H(T,)I). This suggests that the true probability of a false match occurring between P and T is much less than the bound established in the theorem* Theorem 4.4.3 leads to the following random fingerprint algorithm for finding all occurrences of P in T.
83
allows numerous false matches (a demon seed). Theorem 4.4.3 says nothing about how bad a particular prime can be. But by picking a new prime after each error is detected, we can apply Corollary 4.4.2 to each prime, establishing
Theorem 4.4.6. v a new prime is randomly chosen afcer the detection of an error; then for any pattern and text the probability oft errors is at most
(y)r.
This probability falls so rapidly that one is effectively protected against a long series of errors on any particular problem instance. For additional probabilistic analysis of the Karp-Rabin method, see [ 1 821.
82
Theorem 4.4.4. When k primes are chosen randomly between 1 nnd 1 and kfingerprints rr(nm)l k are used, the probability of a false match between P and T is a t most [-;;iTi.
We saw in the proof of Theorem 4.4.3 that if p is a prime that allows H,(P) = H,(T,) at some position r where P does not occur, then p is in a set of at most n(nm) integers. When k fingerprints are used, a false match can occur only if each of the k primes is in that set, and since the primes are chosen randomly (independently), the bound from Theorem 4.4.3 holds for each of the primes. S o the probability that all the primes are in the set is bounded by [ s ) ' , and the theorem is proved. o
PROOF
As an example, if k = 4 and n , m, and I are as in the previous example, then the probability of a false match between P and T is at most by lo-]?. Thus, the probability of a false match is reduced dramatically, from loL3to lo-'*, while the computational effort of using four primes only increases by four times. For typical values of n and m, a small choice of k will assure that the probability of an error due to a false match is less than the probability of error due to a hardware malfunction.
Theorem 4.4.5. When k primes are chosen randomly between 1 and 1 and k fingerprints are used, the probability of a false match between P and T is a t most rn[%lk.
PROOF Suppose that a false match occurs at some fixed position r . That means that each
there are prime pi must evenly divide IH ( P ) - H(Tr)I. Since I H ( P ) - H(Tr)I 5 Y , at most n ( n ) primes that divide it. S o each pi was chosen randomly from a set of n ( l ) primes and by chance is part of a subset of n ( n ) primes. The probability of this happening n(n) k at that fixed r is therefore [=I . Since there are m possible choices for r , the probability of a false match between P and T (i.e., the probability that there is such an r) is at most m[-lk, and the theorem is proved. Assuming, as before, that I = nm 2 , a little arithmetic (which we leave to the reader) shows
CorolIary 4.4.3. When k primes are chosen randomly and used in the fingerprint algorithm, the probability of a false match between P and T is at most (1 .26)km-(2k-1'( 1+ 0.6 ln m)k.
Applying this to the running example of n = 250, m = 4000, and k = 4 reduces the probability of a false match to at most 2 x We mention one further refinement discussed in [266]. Returning to the case where only a single prime is used, suppose the algorithm explicitly checks that P occurs in T when H,(P) = H,(T,), and it finds that P does not occur there. Then one may be better off by picking a new prime to use for the continuation of the computation. This makes intuitive sense. Theorem 4.4.3 randomizes over the choice of primes and bounds the probability that a randomly picked prime will allow a false match anywhere in T. But once the prime has been shown to allow a false match, it is no longer random. It may well be a prime that
4.5. EXERCISES
85
specifications be efficiently handled with the Shift-And method or agrep? The answer partly depends on the number of such specifications that appear in the expression.
8. (Open problem) Devise a purely comparison-based method to compute match-counts in O(mlog m) time. Perhaps one can examine the FFT method in detail to see if complex arithmetic can be replaced with character comparisons in the case of computing matchcounts.
11. Complete the details and analysis to convert the Karp-Rabin method from a Monte-Carl* style randomized algorithm to a Las Vegas-style randomized algorithm.
12. There are improvements possible in the method to check for false matches in the K a r p Rabin method. For example, the method can find in O(m) time all those runs containing no false matches. Explain how. Also, at some point, the method needs to explicitly check for Pat only I, and not 12. Explain when and why.
84
more than two consecutive runs. It follows that the total time for the method, over all runs, is O(m). With the ability to check for false matches in O(m) time, the KarpRabin algorithm can be converted from a method with a small probability of error that runs in O (m) worst-case time, to one that makes no error, but runs in O(m) expected time (a conversion from a Monte Carlo algorithm to a Las Vegas algorithm). To achieve this, simply (re)run and (re)check the Karp-Rabin algorithm until no false matches are detected. We leave the details as an exercise.
4.5. Exercises
1. Evaluate empirically the sizes of P and T.
2. Extend the agrep method to solve the problem of finding an "occurrence" of a pattern P
inside a text T, when a small number of insertions and deletions of characters, as well as mismatches, are allowed. That is, characters can be inserted into P and characters can be deleted from P.
3. Adapt Shift-Andand agrep to handle a set of patterns. Can you do better than just handling each pattern in the set independentty?
4. Prove the correctness of the agrep method.
5. Show how to efficiently handle wild cards (both in the pattern and the text) in the Shift-And approach. Do the same for agrep. Show that the efficiency of neither method is affected by the number of wild cards in the strings.
6. Extend the Shift-And method to efficient!^ handle regular expressions that do not use the Kleene closure. Do the same for agrep. Explain the utility of these extensions to collections of biosequence patterns such as those in PROSITE.
7. We mentioned in Exercise 32 of Chapter 3 that PROSITE patterns often specify a range for the number of times that a subpattern repeats. Ranges of this type can be easily handled by the O(nm) regular expression pattern matching method of Section 3.6. Can such range
PART I1
A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer-Moore algorithms achieve), but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge between exact matching problems, the focus of Part I, and inexact matching problems that are the focus of Part 111. The classic application for suffix trees is the substringproblem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. That is, the allowed preprocessing takes time proportional to the length of the text, but thereafter, the search for S must be done in time proportional to the length of S, independent of the length of T. These bounds are achieved with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches for it in O(n) time using that suffix tree. The O(m) preprocessing and O(n) search result for the substring problem is very surprising and extremely useful. In typical applications, a long sequence of requested strings will be input after the suffix tree is built, so the linear time bound for each search is important. That bound is not achievable by the Knuth-Moms-Pratt or Boyer-Moore methods - those methods would preprocess each requested string on input, and then take O(m) (worst-case) time to search for the string in the text. Because m may be huge compared to n, those algorithms would be impractical on any but trivial-sized texts. Often the text is a fixed set of strings, for example, a collection of STSs or ESTs (see Sections 3.5.1 and 7. lo), so that the substring problem is to determine whether the input string is a substring of any of the fixed strings. Suffix trees work nicely to efficiently solve this problem as well. Superficially, this case of multiple text strings resembles the dictionary problem discussed in the context of the Aho-Corasick algorithm. Thus it is natural to expect that the Aho-Corasick algorithm could be applied. However, the AhoCorasick method does not solve the substring problem in the desired time bounds, because it will only determine if the new string is a full string in the dictionary, not whether it is a substring of a string in the dictionary. After presenting the algorithms, several applications and extensions will be discussed in Chapter 7. Then a remarkable result, the constant-time least common ancesror method, will be presented in Chapter 8. That method greatly amplifies the utility of suffix trees, as will be illustrated by additional applications in Chapter 9. Some of those applications provide a bridge to inexact matching; more applications of suffix trees will be discussed in Part 111, where the focus is on inexact matching.
3
Figure 5.1 : Suffix tree for string xabxac. The node labels u and w on the two interior nodes will later.
used
of another suffix of S then no suffix tree obeying the above definition is possible, since the path for the first suffix would not end at a leaf. For example, if the last character of xabxac is removed, creating string xabxa, then suffix xa is a prefix of suffix xu bxa, so the path spelling out xu would not end at a leaf. To avoid this problem, we assume (as was m e in Figure 5.1) that the last character of S appears nowhere else in S. Then, no suffix of the resulting string can be a prefix of any other suffix. To achieve this in practice, we can add a character to the end of S that is not in the alphabet that string S is taken from. In this book we use $ for the "termination" character. When it is important to emphasize the fact that this termination character has been added, we will write it explicitly as in S$. Much of the time, however, this reminder will not be necessary and, unless explicitly stated otherwise, every string S is assumed to be extended with the termination symbol $, even if the symbol is not explicitly shown. A suffix tree is related to the keyword tree (without backpointers) considered in Section 3.4. Given string S, if set P is defined to be the m suffixes of S, then the suffix tree for S can be obtained from the keyword tree for P by merging any path of nonbranching nodes into a single edge. The simple algorithm given in Section 3.4 for building keyword time, rather than the O(m) trees could be used to construct a suffix tree for S in 0(m2) bound we will establish.
Definition The label of a path from the root that ends at a node is the concatenation, in order, of the substrings labeling the edges of that path. The path-label of a nude is the label of the path from the root of 7 to that node. Definition For any node v in a suffix tree, the string-depth of u is the number of characters in u's label. Definition A path that ends in the middle of an edge (u, u ) splits the label on ( u , v) at a designated point. Define the label of such a path as the label of u concatenated with the characters on edge (u, v) down to the designated split point.
For example, in Figure 5.1 string xu labels the internal node w (so node w has path-label xn), string a labels node u , and string xabx labels a path that ends inside edge(w,I), that is, inside the leaf edge touching leaf 1,
5 . 3 . A motivating example
Before diving into the details of the methods to construct suffix trees, let's look at how a suffix tree for a string is used to solve the exact match problem: Given a pattern P of
90
Definition A suffix tree 7 for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m . Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i , the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that srarts at position i. That is, it spells out S[i..mj.
For example, the suffix tree for the string xabxac is shown in Figure 5.1. The path from the root to the leaf numbered 1 spells out the full string S = xabxac, while the path to the leaf numbered 5 spells out the suffix ac, which starts in position 5 of S. As stated above, the definition of a suffix tree for S does not guarantee that a suffix u m of S matches a prefir tree for any string S actually exists. The problem is that if one s
93
is proportional to the number of edges traversed, so the time for the traversal is O(k), even though the total string-depth of those O(k) edges may be arbitrarily larger than k. If only a single occurrence of P is required, and the preprocessing is extended a bit, then the search time can be reduced from O(n k) to O(n) time. The idea is to write at each node one number (say the smallest) of a leaf in its subtree, This can be achieved in O(m) time in the preprocessing stage by a depth-first traversal of T. The details are straightforward and are left to the reader. Then, in the search stage, the number written on the node at or below the end of the match gives one starting position of P in T. In Section 7.2.1 we will again consider the relative advantages of methods that preprocess the text versus methods that preprocess the pattern(s). Later, in Section 7.8, we will also show how to use a suffix tree to solve the exact matching problem using O(n) preprocessing and O(m) search time, achieving the same bounds as in the algorithms presented in Part I.
Figure 5.2: Three occurrences of aw in awyawxawxz. Their starting positions number the leaves in the subtree of the node with path-label aw.
length n and a text T of length m, find all occurrences of P in T in O(n m) time. We have already seen several solutions to this problem. Suffix trees provide another approach: Build a suffix tree T for text T in O(m) time. Then, match the characters of P along the unique path in T until either P is exhausted or no more matches are possible. In the latter case, P does not appear anywhere in T. In the former case, every leaf in the subtree below the point of the last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf. The key to understanding the former case (when all of P matches a path in T ) is to note . that P occurs in T starting at position j if and only if P occurs as a prefix of T [ j . . m ] But that happens if and only if string P labels an initial part of the path from the root to leaf j . It is the initial path that will be followed by the matching algorithm. The matching path is unique because no two edges out of a common node can have edge-labels beginning with the same character. And, because we have assumed a finite alphabet, the work at each node takes constant time and so the time to match P to a path is proportional to the length of P. For example, Figure 5.2 shows a fragment of the suffix tree for string T = awyawxawxz. Pattern P = a u,appears three times in T starting at locations 1,4,and 7. Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1,4, and 7. If P fully matches some path in the tree, the algorithm can find all the starting positions of P in T by traversing the subtree below the end of the matching path, collecting position numbers written at the leaves. All occurrences of P in T can therefore be found in O(n +m) time. This is the same overall time bound achieved by several algorithms considered in Part I, but the distribution of work is different. Those earlier algorithms spend O(n) time for preprocessing P and then O(m) time for the search. In contrast, the suffix tree approach spends O(m) preprocessing time and then O(n k) search time, where k is the number of occurrences of P in T. To collect the k starting positions of P, traverse the subtree at the end of the matching path using any linear-time traversal (depth-first say), and note the leaf numbers encountered. Since every internal node has at least two children, the number of leaves encountered
Je%---
6
Figure 6.1: Suffix tree for string xabxa $.
xu b x a
string S$ if and only if at least one of the suffixes of S is a prefix of another suffix. The terminal symbol $ was added to the end of S precisely to avoid this situation. However, if S ends with a character that appears nowhere else in S, then the implicit suffix tree of S will have a leaf for each suffix and will hence be a true suffix tree. As an example, consider the suffix tree for string xabxa$ shown in Figure 6.1. Suffix xu is a prefix of suffix xabxa, and similarly the string a is a prefix of abxa. Therefore, in the suffix tree for xabxa the edges leading to leaves 4 and 5 are labeled only with $. Removing these edges creates two nodes with only one child each, and these are then removed as well. The resulting implicit suffix tree for xabxa is shown in Figure 6.2. As another example, Figure 5.1 on page 91 shows a tree built for the string xabxac. Since character c appears only at the end of the string, the tree in that figure is both a suffix tree and an implicit suffix tree for the string. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S -each suffix is spelled out by the characters on some path from the root of the implicit suffix tree. However, if the path does not end at a leaf, there will be no marker to indicate the path's end. Thus implicit suffix trees, on their own, are somewhat less informative than true suffix trees. We will use them just as a tool in Ukkonen's algorithm to finally obtain the true suffix tree for S.
6
Linear-Time Construction of Suffix Trees
We will present two methods for constructing suffix trees in detail, Ukkonen's method and Weiner's method. Weiner was the first to show that suffix trees can be built in linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice than Weiner's method. Hence Ukkonen is the method of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen's algorithm, and the two algorithms can be read independently (with one small shared section noted in the description of Weiner's method).
[ l..i) by Definition We denote the implicit suffix tree of the string S tom.
1
1
The implicit suffix tree for any string S will have fewer leaves than the suffix tree for
94
I
1
Figure 6.3: Implicit suffix tree for string axabx before the sixth character, b, is added.
"
Figure 6.4: Extended implicit suffix tree after the addition of character b.
As an example, consider the implicit suffix tree for S = axabx shown in Figure 6.3. The first four suffixes end at leaves, but the single character suffix x ends inside an edge. When a sixth character b is added to the string, the first four suffixes get extended by applications of Rule 1, the fifth suffix gets extended by rule 2, and the sixth by rule 3. The result is shown in Figure 6.4.
96
Ukkonen's algorithm by first presenting an 0(m 3)-time method to build all trees Ti and then optimizing its implementation to obtain the claimed time bound.
+ +
+ +
+ l),
Rule 1 In the current tree, path B ends at a leaf. That is, the path from the root labeled #l extends to the end of some leaf edge. To update the tree, character S(i 1) is added to the end of the label on that leaf edge.
Rule 2 No path from the end of string starts with character S(i + I), but at least one labeled path continues from the end of /3. In this case, a new leaf edge starting from the end of /3 must be created and labeled with character S(i 1). A new node will also have to be created there if /3 ends inside an edge. The leaf at the end of the new leaf edge is given the number j .
Rule 3 Some path from the end of string B starts with character S(i + 1). In this case the string PS(i I) is already in the current tree, so (remembering that in an implicit suffix tree the end of a suffix need not be explicitly marked) we do nothing.
Following Corollary 6.1.1, all internal nodes in the changing tree will have suffix links from them, except for the most recently added internal node, which will receive its suffix link by the end of the next extension. We now show how suffix links are used to speed up the implementation.
bound. However, taken together, they do achieve a linear worst-case time. The most important element of the acceleration is the use of suffir links.
We will sometimes refer to a suffix link from v to s(v) as the pair (v, s(v)). For example, in Figure 6.1 (on page 95) let v be the node with path-label xa and let s(v) be the node whose path-label is the single character a . Then there exists a suffix link from node v to node s(v). In this case, a is just a single character long. As a special case, if a is empty, then the suffix link from an internal node with path-label x a goes to the root node. The root node itself is not considered internal and has no suffix link from it. Although definition of suffix links does not imply that every internal node of an implicit suffix tree has a suffix link from it, it will, in fact, have one. We actually establish something stronger in the following lemmas and corollaries.
Lemma 6.1.1. Ifa new internal node v with path-label xtr is added to the current tree in extension j of some phase i + 1, then either the path labeled a already ends at an internal node of the current tree or an internal node at the end of string a will be created (by the extension rules) in extension j + 1 in the same phase i 1.
A new internal node v is created in extension j (of phase i 1) only when extension rule 2 applies. That means that in extension j , the path labeled xtr continued with some character other than S(i l), say c . Thus, in extension j 1, there is a path labeled a! in the tree and it certainly has a continuation with character c (although possibly with other characters as well). There are then two cases to consider: Either the path labeled a continues only with character c or it continues with some additional character. When a is continued only by c , extension rule 2 will create a node s(v) at the end of path a . When a is continued with two different characters, then there must already be a node s(v) at the end of path a. The Lemma is proved in either case.
PROOF
Corollary 6.1.1. In Ukkonen's algorithm, any newly created internal node will have a suffix link from it by the end of the next extension.
The proof is inductive and is true for tree T1 since 2, contains no internal nodes. Suppose the claim is true through the end of phase i, and consider a single phase i 1. By Lemma 6.1.1, when a new node v is created in extension j , the correct node s ( v ) ending the suffix link from v will be found or created in extension j 1. No new internal node gets created in the last extension of a phase (the extension handling the single character 1 are known suffix S(i l)), so all suffix links from internal nodes created in phase i by the end of the phase and tree Ti+,has all its suffix links.
PROOF
Corollary 6.1.1 is similar to Theorem 6.2.5, which will be discussed during the treatment of Weiner's algorithm, and states an important fact about implicit suffix trees and ultimately about suffix trees. For emphasis, we restate the corollary in slightly different language.
Corollary 6.1.2. In any implicit suffix tree I,, if internal node v has path-label x a , then there is a node s(v) of Ti with path-label a .
101
algorithm to 0(m2). This trick will also be central in other algorithms to build and use suffix trees.
Trick number 1: skipkount trick In Step 2 of extension j + 1 the algorithm walks down from node s(u) along a path labeled y. Recall that there surely must be such a y path from s ( v ) . Directly implemented, this walk along y takes time proportional to 1 y 1, the number of characters on that path. But a simple trick, called the skip/count trick, will reduce the traversal time to something f nodes on the path. It will then follow that the time for all proportional to the number o the down walks in a phase is at most O(m).
Trick 1 Let g denote the length of y , and recall that no two labels of edges out of s ( v )can start with the same character, so the first character of y must appear as the first character on exactly one edge out of s(u). Let g' denote the number of characters on that edge. If g' is less than g, then the algorithm does not need to look at any more of the characters ' , on that edge; it simply skips to the node at the end of the edge. There it sets g to g - g sets a variable h to g' 1, and looks over the outgoing edges to find the correct next edge (whose first character matches character h of y). In general, when the algorithm identifies the next edge on the path it compares the current value of g to the number of characters g' on that edge. When g is at least as large as g f , the algorithm skips to the node at the end of the edge, sets g to g - g', sets h to h g', and finds the edge whose first character is character h of y and repeats. When an edge is reached where g is smaller than or equal to g', then the algorithm skips to character g on the edge and quits, assured that the y path from s(v) ends on that edge exactly g characters down its label. (See Figure 6.6). Assuming simple and obvious implementation details (such as knowing the number of characters on each edge, and being able, in constant time, to extract from S the character at any given position) the effect of using the skip/count trick is to move from one node to the next node on the y path in constant time.' The total time to traverse the path is then proportional to the number of nodes on it rather than the number of characters on it. This is a useful heuristic, but what does it buy in terms of worst-case bounds? The next lemma leads immediately to the answer.
Definition Define the node-depth of a node u to be the number of nodes on the path from the root to u.
Lemma 6.1.2. Let (v, s(v)) be any sufJixlink traversed during Ukkonen 's algorithm. At that moment, the node-depth of u is at most one greater than the node depth of s(v).
When edge ( v , s(v)) is traversed, any internal ancestor of v, which has path-label xp say, has a suffix link to a node with path-label p. But x p is a prefix of the path to v , so is a prefix of the path to s(v) and it follows that the suffix link from any internal ancestor of v goes to an ancestor of s ( v ) . Moreover, if is nonempty then the node labeled by B is an internal node. And, because the node-depths of any two ancestors of v must differ, each ancestor of u has a suffix link to a distinct ancestor of s ( v ) . It follows that the node-depth of s(v) is at least one (for the root) plus the number of internal ancestors of v who have path-labels more than one character long. The only extra ancestor that v can have (without a corresponding ancestor for s(v)) is an internal ancestor whose path-label
PROOF
Again, we are assuming a constant-sized aIphabet.
Figure 6.5: Extension j > I in phase i + 1. Walk up atmost one edge (labeled y ) from the end of the path labeled S [ j - 1..i]to node v; then follow the suffix link to s(v);then walk down the path specifying substring y ; then apply the appropriate extension rule to insert suffix S[j..i + 11.
End. Assuming the algorithm keeps a pointer to the current full string S[l..i], the first extension of phase i 1 need not do any up or down walking. Furthermore, the first extension of phase i 1 always applies suffix extension rule 1.
+ +
What has been achieved so far? The use of suffix links is clearly a practical improvement over walking from the root in each extension, as done in the naive algorithm. But does their use improve the worst-case running time? The answer is that as described, the use of suffix links does not yet improve the time bound. However, here we introduce a trick that will reduce the worst-case time for the
Figure 6.7: For every node v on the path X U , the corresponding node ~ ( vis ) on the path a.However, the node-depth of s ( v ) can be one less than the node-depth of v, it can be equal, or it can be greater. For example, the node labeled xab has node-depth two, whereas the node-depth of ab is one. The node-depth of the node labeled xabcdefg is four, whereas the node-depth of abcdefg is five.
Corollary 6.1.3. Ukkonen's algorithm can be implemented with suffix links to run in 0 ( m 2 ) time.
Note that the 0 ( m 2 ) time bound for the algorithm was obtained by multiplying the O ( m ) time bound on a single phase by m (since there are m phases). This crude multiplication was necessary because the time analysis was directed to only a single phase. What is needed are some changes to the implementation allowing a time analysis that crosses phase boundaries. That will be done shortly. At this point the reader may be a bit weary because we seem to have made no progress, since we started with a naive 0 ( m 2 )method. Why all the work just to come back to the same time bound? The answer is that although we have made no progress on the time bound, we have made great conceptual progress so that with only a few more easy details, the time will fall to O ( m ) .In particular, we will need one simple implementation detail and two more little tricks.
end of
suffix j
Figure 6.6: The skip/count trick. In phase i + 1, substring y has length ten. There is a copy of substring y out of node s(v);it is found three characters down the last edge, after four node skips are executed.
has length one (it has label x ) . Therefore, v can have node-depth at most one more than s (v). (See Figure 6.7). a
Definition As the algorithm proceeds, the current node-depth of the algorithm is the node depth of the node most recently visited by the algorithm.
Theorem 6.1.1. Using the skip/count trick, any phase of Ukkonen's algorithm takes O(m) time.
PROOF
There are i 1 5 m extensions in phase i. In a single extension the algorithm walks up at most one edge to find a node with a suffix link, traverses one suffix link, walks down some number of nodes, applies the suffix extension rules, and maybe adds a suffix link. We have already established that all the operations other than the down-walhng take constant time per extension, so we only need to analyze the time for the down-walks. We do this by examining how the current node-depth can change over the phase. The up-walk in any extension decreases the current node-depth by at most one (since it moves up at most one node), each suffix link traversal decreases the node-depth by at most another one (by Lemma 6.1.2), and each edge traversed in a down-walk moves to a node of greater node-depth. Thus over the entire phase the current node-depth is decremented at most 2m times, and since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase. It follows that over the entire phase, the total number of edge traversals during down-walks is bounded by 3m. Using the skip/count trick, the time per down-edge traversal is constant, so the total time in a phase for all the down-walking is O(m), and the theorem is proved. a
Observation 1: Rule 3 is a show stopper In any phase, if suffix extension rule 3 applies in extension j , it will also apply in all further extensions ( j 1 to i + 1 ) until the end of the phase. The reason is that when rule 3 applies, the path labeled S[j..i] in the current tree must continue with character S(i l), and so the path labeled S[ j + 1 ..i] does also, and rule 3 again applies in extensions j 1, j 2, . . . , i 1. When extension rule 3 applies, no work needs to be done since the suffix of interest is already in the tree. Moreover, a new suffix link needs to be added to the tree only after an extension in which extension rule 2 applies. These facts and Observation 1 lead to the following implementation trick.
Trick 2 End any phase i +1 the first time that extension rule 3 applies. If this happens in extension j, then there is no need to explicitly find the end of any string S [ k . . i ] for k > j . The extensions in phase i + 1 that are "done" after the first execution of rule 3 are said to be done i m p l i c i t l y .This is in contrast to any extension j where the end of S [ j . . i ] is explicitly found. An extension of that kind is called an explicit extension. Trick 2 is clearly a good heuristic to reduce work, but it's not clear if it leads to a better worst-case time bound. For that we need one more observation and trick. Observation 2: Once a leaf, always a leaf That is, if at some point in Ukkonen's algorithm a leaf is created and labeled j (for the suffix starting at position j of S ) , then that leaf will remain a leaf in all successive trees created during the algorithm. This is true because the algorithm has no mechanism for extending a leaf edge beyond its current leaf. In more detail, once there is a leaf labeled j , extension rule 1 will always apply to extension j in any successive phase. So once a leaf, always a leaf. Now leaf 1 is created in phase 1, so in any phase i there is an initial sequence of consecutive extensions (starting with extension I ) where extension rule 1 or 2 applies. Let ji denote the last extension in this sequence. Since any application of rule 2 creates a new leaf, it follows from Observation 2 that ji 5 j i + , . That is, the initial sequence of extensions where rule 1 or 2 applies cannot shrink in successive phases. This suggests an implementation trick that in phase i + I avoids all explicit extensions 1 through j i . Instead, only constant time will be required to do those extensions implicitly. To describe the trick, recall that the label on any edge in an implicit suffix tree (or a suffix tree) can be represented by two indices p , q specifying the substring S [ p . . q ] .Recall index q is equal to i and in phase i + 1 index q gets also that for any leaf edge of I,, incremented to i + 1 , reflecting the addition of character S(i + 1) to the end of each suffix. Trick 3 In phase i 1, when a leaf edge is first created and would normally be labeled with substring S [ p . . i + 11, instead of writing indices (p, i + 1) on the edge, write (p, e ) , where e is a symbol denoting "the current e n d . Symbol e is a global index that is set to i + 1 once in each phase. In phase i + 1, since the algorithm knows that rule 1 will apply in extensions 1 through j; at least, it need do no additional explicit work to implement
104
Figure 6.8: The !eft tree is a fragment of the suffix tree for string S = abcdefabcuvw, with the edge-labels written explicitly. The right tree shows the edge-labels compressed. Note that that edge with label 2, 3 could also have been labeled 8, 9.
each is labeled with a complete suffix, requiring 26x27/2 characters in all. For strings longer than the alphabet size, some characters will repeat, but still one can construct strings of arbitrary length rn so that the resulting edge-labels have more than O ( m ) characters in total. Thus, an O(m)-time algorithm for building suffix trees requires some alternate scheme to represent the edge-labels.
Edge-label compression
A simple, alternate scheme exists for edge labeling. Instead of explicitly writing a substring on an edge of the tree, only write a pair o f indices on the edge, specifying beginning and end positions of that substring in S (see Figure 6.8). Since the algorithm has a copy of string S, it can locate any particular character in S in constant time given its position in the string. Therefore, we may describe any particular suffix tree algorithm as if edge-labels were explicit, and yet implement that algorithm with only a constant number of symbols written on any edge (the index pair indicating the beginning and ending positions of a substring). For example, in Ukkonen's algorithm when matching along an edge, the algorithm uses the index pair written on an edge to retrieve the needed characters from S and then performs the comparisons on those characters. The extension rules are also easily implemented with this labeling scheme. When extension rule 2 applies in aphase i 1, label the newly created edge with the index pair (i + 1, i + l), and when extension rule 1 applies (on a leaf edge), change the index pair on that leaf edge from (p, q ) to (p, q 1). It is easy to see inductively that q had to be i and hence the new label ( p , i + 1) represents the correct new substring for that leaf edge. By using an index pair to specify an edge-label, only two numbers are written on any edge, and since the number of edges is at most 2m - I , the suffix tree uses only O(m)symbols and requires only O ( m )space. This makes it more plausible that the tree can actually be built in 0 ( m ) time.' Although the fully implemented algorithm will not explicitly write a substring on an edge, we will still find it convenient to talk about "the substring or label on an edge or path" as if the explicit substring was written there.
We make the standard RAM model assumption that a number with up to log rn bits can be read, written, or compared in constant time.
107
Since there are only rn phases, and is bounded by m, the algorithm therefore executes only 2m explicit extensions. As established earlier, the time for an explicit extension is a constant plus some time proportional to the number of node skips it does during the down-walk in that extension. To bound the total number of node skips done during all the down-walks, we consider (similar to the proof of Theorem 6.1.1) how the current node-depth changes during successive extensions, even extensions in different phases. The key is that the first explicit extension in any phase (after phase 1) begins with extension j*,which was the last explicit extension in the previous phase. Therefore, the current node-depth does not change between the end of one extension and the beginning of the next. But (as detailed in the proof of Theorem 6.1. l), in each explicit extension the current node-depth is first reduced by at most two (up-walking one edge and traversing one suffix link), and thereafter the down-walk in that extension increases the current node-depth by one at each node skip. Since the maximum node-depth is m, and there are only 2m explicit extensions, it follows (as in the proof of Theorem 6.1.1) that the maximum number of node skips done during all the down-walking (and not just in a single phase) is bounded by O(m). All work has been accounted for, and the theorem is proved. ~7
Theorem 6.1.3. Ukkonen's algorithm builds a true s u m tree for S, along with all its suJff;r links i n O(m) time.
For example, Suff, is the entire string S, and Suff, is the single character S(m).
Definition Define ?; to be the tree that has m - i -I-2 leaves numbered i through m + 1 such that the path from the root to any leaf j ( i 5 j 5 rn + 1) has label Suffj$. That is, ?; is a tree encoding all and only the suffixes of string S [ i . . m ] $ ,so it is a suffix tree of string S[i.,m]$.
Weiner's algorithm constructs trees from 7m+1 down to 'T; (i.e., in decreasing order of i). We will first implement the method in a straightforward inefficient way. This will
Figure 6.9: Cartoon of a possible execution of Ukkonen's algorithm. Each line represents a phase of the algorithm, and each number represents an explicit extension executed by the algorithm. In this cartoon there are four phases and seventeen explicit extensions. In any two consecutive phases, there is at most one index where the same explicit extension is executed in both phases.
those jiextensions. Instead, it only does constant work to increment variable e , and then does explicit work for (some) extensions starting with extension ji 1.
End Step 3 correctly sets ji+l because the initial sequence of extensions where extension rule 1 or 2 applies must end at the point where rule 3 first applies. The key feature of algorithm SPA is that phase i + 2 will begin computing explicit extensions with extension j * , where j* was the last explicit extension computed in phase i 1. Therefore, two consecutive phases share at most one index ( j * )where an explicit extension is executed (see Figure 6.9). Moreover, phase i 1 ends knowing where string S[j*..i 11 ends, so the repeated extension of j* in phase i 2 can execute the suffix extension rule for j* without any up-walking, suffix link traversals, or node skipping. That means the first explicit extension in any phase only takes constant time. It is now easy to prove the main result.
Theorem 6.1.2. Using su@x links and implementation tricks 1,2, and 3, Ukkanen's algorithm bililcis implicit su&r trees Z1through Z , in O(m)total time.
PROOF
The time for all the implicit extensions in any phase is constant and so is O(m) over the entire algorithm. As the algorithm executes explicit extensions, consider an index corresponding to the explicit extension the algorithm is currently executing. Over the entire execution of the algorithm, never decreases, but it does remain the same between two successive phases.
109
Since a copy of string Head(i) begins at some position between i 1 and m , Head(i) is also a prefix of Suffk for some k > i . It follows that Head(i) is the longest prefix (possibly empty) of Suffi that is a label on some path from the root in tree ?;+[. can be descrjbed as follows: The above straightforward algorithm to build 7 ,from
2. If there is no node at the end of Head(i) then create one, and let w denote the node (created or not) at the end of Head(i). If w is created at this point, splitting an existing edge, then split its existing edge-label so that w has node-label Head(i). Then, create a new leaf numbered i and a new edge (w, i) labeled with the remaining characters of Suff, $. That is, the new edge-label should be the last rn - i 1 - (Head(i)l characters of Suffi, followed by the termination symbol $.
Figure 6.10: A step in the naive Weiner algorithm. The full string tat is added to the suffix tree for at. The edge labeled with the single character $ is omitted, since such an edge is part of every suffix tree.
serve to introduce and illustrate important definitions and facts. Then we will speed up the straightforward construction to obtain Weiner's linear-time algorithm.
z+l
Definition For any position i , Head(i)denotes the longest prefix of S[i..mj that matches a substring of S[i 1 ..m]$.
Note that Head(i) could be the empty string. In fact, Head(m) is always the empty string because S[i l ..m] is the empty string when i 1 is greater than m and character
S ( m ) # $.
111
Suffi and Suffk both begin with string Head(i) = S(i)B and differ after that. For concreteness, say Suffi begins S(i)#?a and Suf& begins S(i)Bb. But then Suffi+l begins Ba and Suffk+, begins Bb. Both i + 1 and k 1 are greater than or equal to i 1 and less than or equal to m , so both suffixes are represented in tree ? ; + I .Therefore, in tree there must be a path from the root labeled p (possibly the empty string) that extends in two ways, one continuing with character a and the other with character b. Hence there is a with path-label B, and I,(S(i)) = 1 since there is a path (namely, an initial node u in part of the path to leaf k) labeled S(i)p in ?1+1. Further, node u must be on the path to leaf i 1 since #l is a prefix of Suff,, Now l,(S(i)) = 1 and v has path-label a, so Head(i) must begin with S(i)cr. That means that CY is a prefix of B and so node u , with path label 13, must either be v or below v on the path to leaf i + 1. However, if u # v then u would be a node below v on the path to leaf i + 1, and l,(S(i)) = 1. This contradicts the choice of node v , so v = u, cr = #l, and the theorem is proved. That is, Head(i) is exactly the string S(i)a.
z+,,
z+l
Note that in Theorem 6.2.1 and its proof we only assume that node v exists. No assumption about v' was made. This will be useful in one of the degenerate cases examined later.
Theorem 6.2.2. Assume both u and u' have been found and L,t(S(i)) points to node v" . I f l ; = 0 then Head(i) ends at v"; othemise it ends after exactly li characters on a single edge out of v".
PROOF
Since v' is on the path to leaf i 1 and L,l(S(i)) points to node v " , the path from the root labeled Head(i) must include v". By Theorem 6.2.1 Head(i) = S(i)cx,so Head(i) must end exactly li characters below v". Thus, when li = 0, Head(i) ends at v " . But when li > 0, there must be an edge e = ( v " , z) out of v" whose label begins with character c (the first of the li characters on the path from v' to v ) in ?;+, . Can Head(i) extend down to node z (i.e., to a node below vr')? Node z must be a branching node, for if it were a leaf then some suffix Suffk,for k > i, would be a prefix of Suff,, which is not possible. Let z have path-label S(i)y. If Head(i) extends down to branching node z,then there must be two substrings starting at or after position i 1 of S that both begin with string y . Therefore, there would be a node z' with path-label y in ?;+, . Node z' would then be below v r on the path to leaf i 1, contradicting the selection of v'. So Head(i) must not reach z and must end in the interior of edge e . In particular, it ends exactly 1, characters from v" on edge e. o
Thus when li = 0, we know Head(i) ends at v", and when li > 0, we find Head(i) from v" by examining the edges out of v" to identify that unique edge e whose first character is c. Then Heud(i) ends exactly li characters down e from u". Tree ?; is then constructed by subdividing edge e, creating a node w at this point, and adding a new edge from w to leaf i labeled with the remainder of Suff,. The search for the correct edge out of v" takes only constant time since the alphabet is fixed. In summary, when v and v' exist, the above method correctly creates ?; from ?j+~, although we must still discuss how to update the vectors. Also, it may not yet be clear at this point why this method is more efficient than the naive algorithm for finding Head(i). That will come later. Let us first examine how the algorithm handles the degenerate cases when v and v' do not both exist.
110
For any (single) character x and any node u, Iu(x) = 1 in ?;+, if and only if there is apath from the root of 7,+, labeled x a , where ct is the path-label of node u . The path labeled x a need not end at a node. For any character x, L,(x) in Z+l points to (internal) node 7T in ?;+I if and only if TI has path-label x a , where u has path-label a. Otherwise L,(x) is null. For example, in the tree in Figure 5.1 (page 91) consider the two internal nodes u and w with path-labels a and xu respectively. Then I, (x) = 1 for the specific character x, and L,(x) = w . Also, I,(b) = 1, but L,(b) is null. Clearly, for any node u and any character x, L,(x) is nonnull only if I,(x) = 1, but the converse is not true. It is also immediate that if I,(x) = 1 then I,(x) = 1 for every ancestor node u of u . Tree ?-, has only one nonleaf node, namely the root r. In this tree we set I,(S(m)) to one, set I,(x) to zero for every other character x, and set all the link entries for the root to The algorithm will maintain the vectors as null. Hence the above properties hold for Tm. the tree changes, and we will prove inductively that the above properties hold for each tree.
z+l
Theorem 6.2.1. Assume that node v has been found by the algorithrrz and that it has path-label a. Then the string Hend(i) is exactly S(i)a.
PROOF Head(i) is the longest prefix of Suff, that is also a prefix of Suffk for some k > i. that begins with S(i), so Since v was found with I,(S(i)) = I there is a path in Head(i) is at least one character long. Therefore, we can express Head(i) as S(i)#l, for some (possibly empty) string p.
113
on this path. If li = 0 then Head(i) ends at v". Otherwise, search for the edge e out of v" whose first character is c. Head(i) ends exactly li characters below v" on edge e. 4, If a node already exists at the end of Head(i), then let w denote that node; otherwise, create a node w at the end of Head(i). Create a new leaf numbered i; create a new edge (w, i ) Iabeled with the remaining substring of Suffi (i.e., the last m - i 1 - IHead(i)l characters of Suff,), followed with the termination character $. Tree ?; has now been created.
Correctness
It should be clear from the proof of Theorems 6.2.1 and 6.2.2 and the discussion of the degenerate cases that the algorithm correctly creates tree ?; from ?;+I, although before it can create ?;-I, it must update the I and L vectors,
Theorem 6.2.3. When a not. node w is created in the interior o f an edge (v", z ) the indicator vector for w sho~lld be copied from the indicator vector for z.
It is immediate that if Iz(x) = 1 then I,(x) must also be 1 in ?;. But can it happen that I,(x) should be 1 and yet I,(x) is set to 0 at the moment that w is created? We will see that it cannot. Let node z have path-label y , and of course node w has path-label Head(i), a prefix of Y.The fact that there are no nodes between u and z in ?;+] means that every suffix from Suff,, down to Suff;,, that begins with string Head(i) must actually begin with the longer
PROOF
Case 1 I,(S(i)) = 0. In this case the walk ends at the root and no node u was found. It follows that character S(i) does not appear in any position greater than i, for if it did appear, then some suffix in that range would begin with S(i), some path from the root would begin with S(i), and I,(S(i)) would have been I . So when I, (S(i)) = 0, Head(i) is the empty string and ends at the root. Case 2 I,(S(i)) = 1 for some v (possibly the root), but v' does not exist. In this case the walkends at the root with L,(S(i)) null. Let ti be the number of characters from the root to v. From Theorem 6.2.1 Head(i) ends exactly ti 1 characters from the root. Since v exists, there is some edge e = (r, z) whose edge-label begins with character S(i). This is true whether ti = 0 or ti > 0. If ti = 0 then Head(i) ends after the first character, S(i), on edge e. Similarly, if ti > 0 then Head(i) ends exactly ti + 1 characters from the root on edge e. For suppose Head(i) extends all the way to some child z (or beyond). Then exactly as in the proof of Theorem 6.2.2, z must be a branching node and there must be a node z' below the root on the path to leaf i + 1 such that LZr(S(i)) is nonnull, which would be a contradiction. S o when ti > 0, Head(i) ends exactly ti + 1 characters from the root on the edge e out of the root. This edge can be found from the root in constant time since its first character is S(i ). In either of these degenerate cases (as in the good case), Head(i) is found in constant time after the walk reaches the root. After the end of Head(i) is found and w is created or found, the algorithm proceeds exactly as in the good case. Note that degenerate Case 2 is very similar to the " g o o d case when both v and v f were found, but differs in a small detail because Head(i) is found ti 1 characters down on e rather than ti characters down (the natural analogue of the good case).
z+l
115
The current node-depth can increase by one each time a new node is created and each time a link pointer is traversed. Hence the total number of increases in the current node-depth is at most 2m. It follows that the current node-depth can also only decrease at most 2m times since the current node-depth starts at zero and is never negative. The current node-depth decreases at each move up the walk, so the total number of nodes visited during all the upward walks is at most 2rn. The time for the algorithm is proportional to the total number of nodes visited during upward walks, so the theorem is proved.
Theorem 6.2.5. If v is a node in the s u f i tree labeled by the string x a , where x is a single character; then there is a node in the tree labeled with the string a.
This fact was also established as Corollary 6.12 during the discussion of Ukkonen's algorithm.
McCreight's algorithm at the high level McCreight's algorithm builds the suffix tree 7 for rn-length string S by inserting the suffixes in order, one at a time, starting from suffix one (i.e., the complete string S). (This is opposite to the order used in Weiner's algorithm, and it is superficially different from Ukkonen's algorithm.) It builds a tree encoding all the suffixes of S starting at positions 1 through i + 1, from the tree encoding all the suffixes of S starting at positions 1 through i. The naive construction method is immediate and runs in 0 ( m 2 )time. Using suffix links and the skip/count trick, that time can be reduced to O(m). We leave this to the interested reader to work out.
3
The space requirements For Ukkonen and McCreight's algorithms are determined by the need to represent and move around the tree quickly. We will be much more precise about space and practical impfementation issues in Section 6.5.
114
string y . Hence in '7;+, there can be a path labeled xHead(i) only if there is also a path labeled x y , and this holds for any character x . Therefore, if there is a path in ?; labeled xHead(i) (the requirement for I,(x) to be 1) but no path x y , then the hypothesized string xHead(i) must begin at character i of S. That means that Suff;+l must begin with the string Head(i). But since w has path-label Head(i), leaf i + 1 must be below w in '7; and so must be below z in ?;+, . That is, z is on the root to i + 1 path. However, the algorithm starts at leaf i 1 and walks toward the root, and when it finds to construct '7; from node v or reaches the root, the indicator entry for x has been set to 1 at every node on the path from the leaf i 1. The walk finishes before node w is created, and so it cannot be that I,(x) = 0 at the time when w is created. S o if path xHeab(i) exists in ?;, then IZ(x)= 1 at the moment w is created, and the theorem is proved.
z+,
+
Lemma 6.2.1. When the algorithm traverses a link pointer from a node v' to a node v"
in
?;+I,
PROOF
Let u be a nonroot node in '7;+, on the path from the root to v", and suppose u has path-label S(i)a for some nonempty string a. All nodes on the root-to-v" path are of this type, except for the single node (if it exists) with path-label S(i). Now S(i)a is the prefix of Suffi and of Suffk for some k > i, and this string extends differently in the two cases. Since v' is on the path from the root to leaf i 1, a is a prefix of Suffi+,, and there . Hence the must be a node (possibly the root) with path-label cr on the path to v' in 7;+, path to v' has a node corresponding to every node on the path to v", except the node (if it exists) with path-label S(i). Hence the depth of v" is at most one more than the depth of v', although it could be less. o
Theorem 6.2.4. Assuming afinite alphabet, Weiner's algorithm construcrs the s u f i tree for a string of length m in O ( m ) time.
232 Figure 6.11: Generalized suffix tree for strings S1 = xabxa and S2 = babxba. The first number at a leaf indicates the string; the second number indicates the starting position of the suffix in that string.
2,3
2,l
in theory by suffix trees, where the typical string size is in the hundreds of thousands, or even millions, and/or where the alphabet size is in the hundreds. For those problems, a "linear" time and space bound is not sufficient assurance of practicality. For large trees, paging can also be a serious problem because the trees do not have nice locality properties. Indeed, by design, suffix links allow an algorithm to move quickly from one part of the tree to a distant parr of the tree. This is great for worst-case time bounds, but it is horrible for paging if the tree isn't entirely in memory. Consequently, implementing suffix trees to reduce practical space use can be a serious concei-n.4The comments made here for suffix trees apply as well to keyword trees used in the Aho-Corasick method. The main design issues in all three algorithms are how to represent and search the branches out of the nodes of the tree and how to represent the indicator and link vectors in Weiner's algorithm. A practical design must balance the constraints of space against the need for speed, both in building the tree and in using it afterwards. We will discuss representing tree edges, since the vector issues for Weiner's algorithm are identical. There are four basic choices possible to represent branches. The simplest is to use an array of size @(I C 1) at each nonleaf node v. The m a y at v is indexed by single characters of the alphabet; the cell indexed by character x has a pointer to a child of v if there is an edge out of v whose edge-label begins with character x and is otherwise null. If there is such an edge, then the cell should also hold the two indices representing its edge-label. This array allows constant-time random accesses and updates and. although simple to program, it can use an impractical amount of space as ] C 1 and m get large. An alternative to the array is to use a linked list at node v of characters that appear at the beginning of edge-labels out of v . When a new edge from v is added to the tree, a new character (the first character on the new edge label) is added to the list. Traversals from node v are implemented by sequentially searching the list for the appropriate character. Since the list is searched sequentially it costs no more to keep it in sorted order. This somewhat reduces the average time to search for a given character and thus speeds up (in practice) the construction of the tree. The key point is that it allows a faster termination of a search for a character that is not in the list. Keeping the list in sorted order will be particularly useful in some of applications of suffix trees to be discussed later.
A very different approach to limiting space. based on changing the suffix tree into a different data structure called a sr~& array. will be discussed in Section 7.14.
116
6.6. EXERCISES
119
many in molecular biology, space is more of a constraint than is time), the size of the suffix tree for a string may dictate using the solution that builds the smaller suffix tree. So despite the added conceptual burden, we will discuss such space-reducing alternatives in some detail throughout the book.
6.5.1. Alphabet independence: all linears are equal, but some are more equal than others
The key implementation problems discussed above are all related to multiple edges (or links) at nodes. These are influenced by the size of the alphabet C -the larger the alphabet, the larger the problem. For that reason, some people prefer to explicitly reflect the alphabet size in the time and space bounds of keyword and suffix tree algorithms. Those people usually refer to the construction time for keyword or suffix trees as O(m log ( CI), where m is the size of all the patterns in a keyword tree or the size of the string in a suffix tree. More completely, the Aho-Corasick, Weiner, Ukkonen, and McCreight algorithms all either :1) space, or the O(m) time bound should be replaced with the minimum of require O(m 1 T O(m log m) and O(m log I C I). Similarly, searching for a pattern P using a suffix tree can be done with O ( (P 1) comparisons only if we use O(mI X I) space; otherwise we must allow the minimum of O(I P I log m) and O(I PI log I C 1) comparisons during a search for P. In contrast, the exact matching method using Z values has worst-case space and comparison requirements that are alphabet independent - the worst-case number of comparisons (either characters or numbers) used to compute Z values is uninfluenced by the size of the alphabet. Moreover, when two characters are compared, the method only checks whether the characters are equal or unequal, not whether one character precedes the other in some ordering. Hence no prior knowledge about the alphabet need be assumed. These properties are also true of the Knuth-Morris-Pratt and the Boyer-Moore algorithms. The alphabet independence of these algorithms makes their linear time and space bounds superior, in some people's view, to the linear time and space bounds of keyword and suffix tree algorithms: "All linears are equal but some are more equal than others". Alphabet-independent algorithms have also been developed for a number of problems other than exact matching. Two-dimensional exact matching is one such example. The method presented in Section 3.5.3 for two-dimensional matching is based on keyword trees and hence is not alphabet independent. Nevertheless, alphabet-independent solutions for that problem have been developed. Generally, alphabet-independent methods are more complex than their coarser counterparts. In this book we will not consider alphabet-independence much further, although we will discuss other approaches to reducing space that can be employed if large alphabets cause excessive space use.
6.6. Exercises
1. Construct an infinite family of strings over a fixed alphabet, where the total length of the edge-labels on their suffix trees grows faster than O(m)( m is the length of the string). That
is, show that linear-time suffix tree algorithms would be impossible if edge-labels were written explicitly on the edges.
2. In the text, we first introduced Ukkonen's algorithm at a high level and noted that it could be implemented in O ( d ) time. That time was then reduced to O ( d ) with the use of suffix links and the skip/count trick. An alternative way to reduce the O(d) time to O ( d ) (without suffix links or skip/count) is to keep a pointer to the end of each suffix of S[l ..i].
118
Keeping a linked list at node v works well if the number of children of v is small, but in worst-case adds time I E I to every node operation. The O(m) worst-case time bounds are preserved since I Z I is assumed to be fixed, but if the number of children of v is large then little space is saved over the array while noticeably degrading performance. A third choice, a compromise between space and speed, is to implement the list at node v as some sort of balanced tree [lo]. Additions and searches then take O(1og k) time and O(k) space, where k is the number of children of v. Due to the space and programming overhead of these methods, this alternative makes sense only when k is fairly large. The final choice is some sort of hashing scheme. Again, the challenge is to find a scheme balancing space with speed, but for large trees and alphabets hashing is very attractive at least for some of the nodes. And, using perfect hashing techniques [I671 the linear worst-case time bound can even be preserved. When m and C are large enough to make implementation difficult, the best design is probably a mixture of the above choices. Nodes near the root of the tree tend to have the most children (the root has a child for every distinct character appearing in S), and so arrays are a sensible choice at those nodes. In addition, if the tree is dense for several levels below the root, then those levels can be condensed and eliminated from the explicit tree. For example, there are 20' possible amino acid substrings of length five. Every one of these substrings exists in some known protein sequence already in the databases. Therefore, when implementing a suffix tree for the protein database, one can replace the first five levels of the tree with a five-dimensional array (indexed by substrings of length five), where an entry of the array points to the place in the remaining tree that extends the five-tuple, The same idea has been applied [320] to depth seven for DNA data. Nodes in the suffix tree toward the leaves tend to have few children and lists there are attractive. At the extreme, if w is a leaf and v is its parent, then infomation about w may be brought up to v, removing the need for explicit representation of the edge ( v , w) or the node w. Depending on the other implementation choices, this can lead to a large savings in space since roughly half the nodes in a suffix tree are leaves. A suffix tree whose leaves are deleted in this way is called a position tree. In a position tree, there is a one-to-one correspondence between leaves of the tree and substrings that are uniquely occurring in S. For nodes in the middle of a suffix tree, hashing or balanced trees may be the best choice. Fortunately, most large suffix trees are used in applications where S is fixed (a dictionary or database) for some time and the suffix tree will be used repeatedly. In those applications, one has the time and motivation to experiment with different implementation choices. For a more in-depth look at suffix tree implementation issues, and other suggested variants of suffix trees, see [23]. Whatever implementation is selected, it is clear that a suffix tree for a string will take considerably more space than the representation of the string itself.' Later in the book we will discuss several problems involving two (or more) strings P and T, where two O(I PI ] T 1) time solutions exist, one using a suffix tree for P and one using a suffix tree for T. We will also have examples where equally time-efficient solutions exist, but where one uses a generalized suffix tree for two or more strings and the other uses just a suffix tree for the smaller string. In asymptotic worst-case time and space, neither approach is superior to the other, and usually the approach that builds the larger tree is conceptually simpler. However, when space is a serious practical concern (and in many problems, including
Although, we have built suffix trees for DNA and amino acid strings more than one million characters long that can be completely contained in the main memory of a moderate-size workstation.
6.6. EXERCISES
121
14. Suppose one must dynamically maintain a suffix tree for a string that is growing or contracting. Discuss how to do this efficiently if the string is growing (contracting) on the left end, and how to do it if the string is growing (contracting) on the right end.
Can either Weiner's algorithm or Ukkonen's algorithm efficiently handle both changes to the right and to the left ends of the string? What would be wrong in reversing the string so that a change on the left end is "simulated by a change on the right end?
15. Consider the previous problem where the changes are in the interior of the string. If you cannot find an efficient solution to updating the suffix tree, explain what the technical issues are and why this seems like a difficult problem.
16. Consider a generalized suffix tree built for a set of k strings. Additional strings may be added to the set, or entire strings may be deleted from the set. This is the common case for maintaining a generalized suffix tree for biological sequence data [320]. Discuss the problem of maintaining the generalized suffix tree in this dynamic setting. Explain why this problem has a much easier solution than when arbitrary substrings represented in the suffix tree are deleted.
120
3. The relationship between the suffix tree for a string S and for the reverse string S' is not obvious. However, there is a significant relationship between the two trees. Find it, state it, and prove it.
Hint: Suffix links help.
4. Can Ukkonen's algorithm be implemented in linear time without using suffix links? The idea is to maintain, for each index i,a pointer to the node in the current implicit suffix tree that . is closest to the end of suffix i
5. In Trick 3 of Ukkonen's algorithm, the symbol "em is used as the second index on the label of every leaf edge, and in phase i 1 the global variable e is set to i 1. An alternative to using "en is to set the second index on any leaf edge to m (the total length of S) at the point that the leaf edge is created. In that way, no work is required to update that second index. Explain in detail why this is correct, and discuss any disadvantages there may be in this approach, compared to using the symbol " e l ' .
6. Ukkonen's algorithm builds all the implicit suffix trees I, through I, in order and on-line, all in O(m) time. Thus it can be called a linear-time on-line algorithm to construct implicit suffix trees.
(Open question) Find an on-line algorithm running in O(m) total time that creates all the true suffix trees. Since the time taken to explicitly store these trees is 0(m 2 ), such an algorithm would (like Ukkonen's algorithm) update each tree without saving it.
7. Ukkonen's algorithm builds all the implicit suffix trees in O(m) time. This sequence of implicit suffix trees may expose more information about S than does the single final suffix tree for S. Find a problem that can be solved more efficiently with the sequence of implicit suffix trees than with the single suffix tree. Note that the atgorithm cannot save the implicit suffix trees and hence the problem will have to be solved in parallel with the construction of the implicit suffix trees.
8. The naive Weiner algorithm for constructing the suffix tree of S (Section 6.2.1) can be described in terms of the Aho-Corasick algorithm of Section 3.4: Given string S of tength m, append $ and let P be the set of patterns consisting of the rn 1 suffixes of string S$. Then build a keyword tree for set P using the Aho-Corasick algorithm. Removing the backlinks gives the suffix tree for S. The time for this construction is O ( d ) . Yet, in our discussion of Aho-Corasick, that method was considered as a linear time method. Resolve this apparent contradiction.
9. Make explicit the relationship between link pointers in Weiner's algorithm and suffix links in Ukkonen's algorithm. 10. The time analyses of Ukkonen's algorithm and of Weiner's algorithm both rely on watching how the current node-depth changes, and the arguments are almost perfectly symmetric. Examine these two algorithms and arguments closely to make explicit the similarities and differences in the analysis. Is there some higher-level analysis that might establish the time bounds of both the algorithms at once?
11. Empirically evaluate different implementation choices for representing the branches out of the nodes and the vectors needed in Weiner's algorithm. Pay particular attention to the effect of alphabet size and string length, and consider both time and space issues in building the suffix tree and in using it afterwards.
12. By using implementation tricks similar to those used in Ukkonen's algorithm (particularly, suffix links and skip/count) give a linear-time implementation for McCreight's algorithm. 13. Flesh out the relationship between McCreight's algorithm and Ukkonen's algorithm, when they both are implemented in linear time.
7.2. APL2: SUFFIX TREES AND THE EXACT SET MATCHING PROBLEM
123
search can be done in O(m) time whenever a text T is specified. Can suffix trees be used in this scenario to achieve the same time bounds? Although it is not obvious, the answer is "yes". This reverse use of suffix trees will be discussed along with a more general problem in Section 7.8. Thus for the exact matching problem (single pattern), suffix trees can be used to achieve the same time and space bounds as Knuth-Moms-Pratt'and Boyer-Moore when the pattern is known first or when the pattern and text are known together, but they achieve vastly superior performance in the important case that the text is known first and held fixed, while the patterns vary.
7.2. APL2: Suffix trees and the exact set matching problem
Section 3.4 discussed the exact set matching problem, the problem of finding all occurrences from a set of strings P in a text T, where the set is input all at once. There we developed a linear-time solution due to Aho and Corasick. Recall that set P is of total length n and that text T is of length m. The Aho-Corasick method finds all occurrences in T of any pattern from P in O(n + m k ) time, where k is the number of occurrences. This same time bound is easily achieved using a suffix tree 7 for T. In fact, we saw in the previous section that when T is first known and fixed and the pattern P varies, all occurrences of any specific P (of length n) in T can be found in O(n k p ) time, where k p is the number of occurrences of P. Thus the exact set matching problem is actually a simpler case because the set P is input at the same time the text is known. To solve it, we build suffix tree 7 for T in O(m) time and then use this tree to successively search for all occurrences of each pattern in P. The total time needed in this approach is O(n m k).
+ +
7.2.1. Comparing suffix trees and keyword trees for exact set matching
Here we compare the relative advantages of keyword trees versus suffix trees for the exact set matching problem. Although the asymptotic time and space bounds for the two methods are the same when both the set P and the string T are specified together, one method may be preferable to the other depending on the relative sizes of P and T and on which string can be preprocessed. The Aho-Corasick method uses a keyword tree of size O(n), built in O(n) time, and then carries out the search in O(m) time. In contrast, the suffix tree is of size O(m), takes O(m) time to build, and is used to search in O(n) time. The constant terms for the space bounds and for the search times depend on the specific way the trees are represented (see Section 6.5), but they are certainly large enough to affect practical performance. In the case that the set of patterns is larger than the text, the suffix tree approach uses less space but takes more time to search. (As discussed in Section 3.5.1 there are applications in molecular biology where the pattern library is much larger than the typical texts presented after the library is fixed.) When the total size of the patterns is smaller than the text, the A h d o r a s i c k method uses less space than a suffix tree, but the suffix tree uses less search time, Hence, there is a time/space trade-off and neither method is uniformly superior to the other in time and space. Determining the relative advantages of A h d o r a s i c k versus suffix trees when the text is fixed and the set of patterns vary is left to the reader. There is one way that suffix trees are better, or more robust, than keyword trees for the exact set matching problem (in addition to other problems). We will show in Section 7.8 how to use a suffix tree to solve the exact set matching problem in exactly the same time
7
First Applications of Suffix Trees
We will see many applications of suffix trees throughout the book. Most of these applications allow surprisingly efficient, linear-time solutions to complex string problems. Some of the most impressive applications need an additional tool, the constant-time lowest common ancestor algorithm, and so are deferred until that algorithm has been discussed (in Chapter 8). Other applications arise in the context of specific problems that will be discussed in detail later. But there are many applications we can now discuss that illustrate the power and utility of suffix trees. In this chapter and in the exercises at its end, several of these applications will be explored. Perhaps the best way to appreciate the power of suffix trees is for the reader to spend some time trying to solve the problems discussed below, without using suffix trees. Without this effort or without some historical perspective, the availability of suffix trees may make certain of the problems appear trivial, even though linear-time algorithms for those problems were unknown before the advent of suffix trees. The longest common substring problem discussed in Section 7.4 is one clear example, where Knuth had conjectured that a linear-time algorithm would not be possible [24, 2781, but where such an algorithm is immediate with the use of suffix trees. Another classic example is the longestprefi repeat problem discussed in the exercises, where a linear-time solution using suffix trees is easy, but where the best prior method ran in O ( n log n ) time.
125
Theorem 7.4.1. The longest common substring of two strings can be found in linear time using a generalized su& tree.
Although the longest common substring problem looks trivial now, given our knowledge of suffix trees, it is very interesting to note that in 1970 Don Knuth conjectured that a linear-time algorithm for this problem would be impossible [24, 2781, We will return to this problem in Section 7.9, giving a more space efficient solution. Now recall the problem of identifying human remains mentioned in Section 7.3. That problem reduced to finding the longest substring in one fixed string that is also in some string in a database of strings. A solution to that problem is an immediate extension of the longest common substring problem and is left to the reader.
124
and space bounds as for the AhwCorasick method - O(n) for preprocessing and O(m) for search. This is the reverse of the bounds shown above for suffix trees. The timelspace tradeoff remains, but a suffix tree can be used for either of the chosen time/space combinations, whereas no such choice is available for a keyword tree.
127
given length I. These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string. This problem can easily be solved in linear time by extending the approach discussed above for the longest common substring of two strings. Build a generalized suffix tree for S1 and S2. Then mark each internal node that has in its subtree a leaf representing a suffix of S1 and also a leaf representing a suffix of Sz. Finally, report all marked nodes that have string-depth of 1 or greater. If v is such a marked node, then the path-label of v is a suspicious string that may be contaminating the desired DNA string. If there are no marked nodes with string-depth above the threshold 1, then one can have greater confidence (but not certainty) that the DNA has not been contaminated by the known contaminants. More generally, one has an entire set of known DNA strings that might contaminate a desired DNA string. The problem now is to determine if the DNA string in hand has any sufficiently long substrings (say length 1 or more) from the known set of possible contaminants. The approach in this case is to build a generalized suffix tree for the set P of possible contaminants together with S 1 ,and then mark every internal node that has a leaf in its subtree representing a suffix from S I and a leaf representing a suffix from a pattern in P. All marked nodes of string-depth 1 or more identify suspicious substrings. Generalized suffix trees can be built in time proportional to the total length of the strings in the tree, and all the other marking and searching tasks described above can be performed in linear time by standard tree traversal methods. Hence suffix trees can be used to solve the contamination problem in linear time. In contrast, it is not clear if the Aho-Corasick algorithm can solve the problem in linear time, since that algorithm is designed to search for occurrences offull patterns from P in S1,rather than for substrings of patterns. As in the longest common substring problem, there is a more space efficient solution to the contamination problem, based on the material in Section 7.8. We leave this to the reader.
126
by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism, or the contamination is from the DNA of the host itself (for example bacteria or yeast). Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (the polymerase chain reaction) used to make copies of the desired DNA. Without going into these and other specific ways that contamination occurs, we refer to the general phenomenon as DNA contamination. contamination is an extremely serious problem, and there have been embarrassing occurrences of large-scale DNA sequencing efforts where the use of highly contaminated clone libraries resulted in a huge amount of wasted sequencing. Similarly, the announcement a few years ago that DNA had been successfully extracted from dinosaur bone is now viewed as premature at best. The "extracted" DNA sequences were shown, through DNA database searching, to be more similar to mammal DNA (particularly human) [2] than to bird and crockodilian DNA, suggesting that much of the DNA in hand was from human contamination and not from dinosaurs. Dr. S. Blair Hedges, one of the critics of the dinosaur claims, stated: "In looking for dinosaur DNA we all sometimes find material that at first looks like dinosaur genes but later turns out to be human contamination, so we move on to other things. But this one was published." [SO] These embarrassments might have been avoided if the sequences were examined early for signs of likely contaminants, before large-scale analysis was performed or results published. Russell Doolittle [129] writes ". . . O n a less happy note, more than a few studies have been curtailed when a preliminary search of the sequence revealed it to be a common contaminant . . . used in purification. As a rule, then, the experimentalist should search early and often". Clearly, it is important to know whether the DNA of interest has been contaminated. Besides the general issue of the accuracy of the sequence finally obtained, contamination can greatly complicate the task of shotgun sequence assembly (discussed in Sections 16.14 and 16.15) in which short strings of sequenced DNA are assembled into long strings by looking for overlapping substrings. Often, the DNA sequences from many of the possible contaminants are known. These include cloning vectors, PCR primers, the complete genomic sequence of the host organism (yeast, for example), and other DNA sources being worked with in the laboratory. (The dinosaur story doesn't quite fit here because there isn't yet a substantial transcript of human DNA.) A good illustration comes from the study of the nemotode C. elegans, one of the key model organisms of molecular biology. In discussing the need to use YACs (Yeast Artificial Chromosomes) to sequence the C. elegans genome, the contamination problem and its potential solution is stated as follows:
The main difficulty is the unavoidable contamination of purified YACs by substantial amounts of DNA from the yeast host, leading to much wasted time in sequencing and assembling irrelevant yeast sequences. However, this difficulty should be eliminated (using). . . the complete (yeast) sequence. . . It will then become possible to discard instantly all sequencing reads that are recognizable as yeast DNA and focus exclusively on C. elegans DNA. [225]
DNA contamination problem Given a string SI (the newly isolated and sequenced
string of DNA) and a known string Sz(the combined sources of possible contamination), find all substrings of S2 that occur in SIand that are longer than some
7.7. APL7: BUILDING A SMALLER DIRECTED GRAPH FOR EXACT MATCHING 129
1
Figure 7.1: Suffix tree for string xyxauaxa without sutfix links shown.
128
set of strings, find substrings "common" to a large number of those strings. The word L'~~mm~ here n ' ' means "occurring with equality". A more difficult problem is to find "similar" substrings in many given strings, where "similar" allows a small number of differences. Problems of this type will be discussed in Part 111.
Definition For each k between 2 and K, we define l(k) to be the length of the longest substring common to at least k of the strings.
We want to compute a table of K - 1 entries, where entry k gives 1(k) and also points to one of the common substrings of that length. For example, consider the set of strings (sandollar, sandlot, handler, grand, panrry}. Then the l(k) values (without pointers to the strings) are: k l(k) one substring sand and and an Surprisingly, the problem can be solved in linear, O ( n ) ,time [236]. It really is amazing that so much information about the contents and substructure of the strings can be extracted in time proportional to the time needed just to read in the strings. The linear-time algorithm will be fully discussed in Chapter 9 after the constant-time lowest common ancestor method has been discussed* To prepare for the O(n) result, we show here how to solve the problem in O(Kn) time. That time bound is also nontrivial but is achieved by a generalization of the longest common substring method for two strings. First, build a generalized suffix tree 7 for the K strings. Each leaf of the tree represents a suffix from one of the K strings and is marked with one of K unique string identifiers, 1 to K, to indicate which string the suffix is from. Each of the K strings is given a distinct termination symbol, so that identical suffixes appearing in more than one string end at distinct leaves in the generalized suffix tree. Hence, each leaf in 7 has only one string identifier.
Definition For every internal node v of 7, define C(v) to be the number of distinct string identifiers that appear at the leaves in the subtree of v.
Once the C ( v ) numbers are known, and the string-depth of every node is known, the desired I(k) values can be easily accumulated with a linear-time traversal of the tree. That traversal builds a vector V where, for each value of k from 2 to K , V ( k ) holds the string-depth (and location if desired) of the deepest (string-depth) node v encountered with C(u) = k. (When encountering a node v with C(v) = k, compare the string-depth of v to the current value of V(k) and if v's depth is greater than V(k), change V(k) to the depth of v.) Essentially, V(k) reports the length of the longest string that occurs exactly k times. Therefore, V(k) 5 l(k). To find l(k) simply scan V from largest to smallest index, writing into each position the maximum V(k) value seen. That is, if V(k) is empty or V(k) < V(k 1) then set V(k) to V(k + 1). The resulting vector holds the desired 1(k) values.
By the same reasoning, if there is a path of suffix links from p to q going through a node v, then the number of leaves in the subtree of v must be at least as large as the number in the subtree of p and no larger than the number in the subtree of q. It follows that if p and q have the same number of leaves in their subtrees, then all the subtrees below nodes on the path have the same number of leaves, and all these subtrees are isomorphic to each other. For the converse side, suppose that the subtrees of p and q are isomorphic. Clearly then they have the same number of leaves. We will show that there is a directed path of suffix links between p and q . Let a be the path-label of p and B be the path-label of q and 5 laf, assume that Since # a,if j l is a suffix of a it must be a proper suffix. And, if /Iis a proper suffix of a,then by the properties of suffix links, there is a directed path of suffix links from p to q, and the theorem would be proved. So we will prove, by contradiction, that /Imust be a suffix of or. Suppose is not a suffix of a. Consider any occurrence of a in T and let y be the suffix of T just to the right of that occurrence of a. That means that a y is a suffix of T and there is a path labeled y running from node p to a leaf in the suffix tree. Now since /?is not a suffix of a, no suffix of T that starts just after an occurrence of B can have length 1 y 1, and therefore there is no path of length 1 y 1 from q to a leaf. But that implies that the subtrees rooted at p and at q are not isomorphic, which is a contradiction.
Definition Let Q be the set of all pairs (p, q ) such that a) there exists a suffix link from p to q in T ,and b) p and q have the same number of leaves in their respective subtrees.
begin Identify the set Q of pairs (p, q ) such that there is a suffix link from p to q and the number of leaves in their respective subtrees is equal. While there is a pair (p, q ) in Q and both p and q are in the current DAG, Merge node p into q . end. The "correctness" of the resulting DAG is stated formaily in the following theorem.
Theorem 7 . 7 . 2 . Let 7 be the su8x tree for an input string S, and let D be the DAG resulting from running the compaction algorithm on 7.Any directed path in D from the root enumerates a substring of S, and every substring o f S is enumerated by some such path. Therefore, the problem of determining whether a string is a subsrring of S can be solved in linear time using D instead o f 7.
DAG D can be used to determine whether a pattern occurs in a text, but the graph seems to lose the location(s) where the pattern begins. It is possible, however, to add simple (linear-space) information to the graph so that the locations of all the occurrences can also be recovered when the graph is traversed. We address this issue in Exercise 10. It may be surprising that, in the algorithm, pairs are merged in arbitrary order. We leave the correctness of this, a necessary part of the proof of Theorem 7.7.2, as an exercise. As a practical matter it makes sense to merge top-down, never merging two nodes that have ancestors in the suffix tree that can be merged.
1
Figure 7.2: A directed acyclic graph used to recognize substrings of xyxaxaxa.
be used to solve the exact matching problem in the same way a suffix tree is used. The algorithm matches characters of the pattern against a unique path from the root of the graph; the pattern occurs somewhere in the text if and only if all the characters of the pattern are matched along the path. However, the Ieaf numbers reachable from the end of the path may no longer give the exact starting positions of the occurrences. This issue will be addressed in Exercise 10. Since the graph is a DAG after the first merge, the algorithm must know how to merge nodes in a DAG as well as in a tree. The general merge operation for both trees and DAGs is stated in the following way: A merge of node p into node q means that all edges out of p are removed, that the edges into p are directed to q but have their original respective edge-labels, and that any part of the graph that is now unreachable from the root is removed. Although the merges generally occur in a DAG, the criteria used to determine which nodes to merge remain tied to the original suffix tree - node p can be merged into q if the edge-labeled subtree of p is isomorphic to the edge-labeled subtree of q in the suffix tree* Moreover, p can be merged into q , or q into p, only if the two subtrees are isomorphic. So the key algorithmic issue is how to find isomorphic subtrees in the suffix tree. There are general algorithms for subtree isomorphism but suffix trees have additional structure making isomorphism detection much simpler.
Theorem 7.7.1. In a sufi tree 7 the edge-labeled subtree below a node p is isomorphic to the subtree below a nude q ifand only ifthere is a directedpath of s u m linksfrom one node to the other node, and the number of leaves in the two subtrees is equal.
PROOF
First suppose p has a direct suffix link to q and those two nodes have the same number of leaves in their subtrees. Since there is a suffix link from p to q , node p has path-label x a while q has path-label a.For every leaf numbered i in the subtree of p there is a leaf numbered i 1 in the subtree of q , since the suffix of T starting at i begins with x a only if the suffix of T starting at i I begins with a.Therefore, for every (labeled) path from p to a leaf in its subtree, there is an identical path (with the same labeled edges) from q to a Ieaf in its subtree. Now the numbers of leaves in the subtrees of p and q are assumed to be equal, so every path out of q is identical to some path out of p, and hence the two subuees are isomorphic.
7.8. APL8: A REVERSE ROLE FOR SUFFIX TREES, MAJOR SPACE REDUCTION
133
Thus the problem of finding the matching statistics is ageneralization of the exact matching problem.
Matching statistics can be used to reduce the size of the suffix tree needed in solutions to problems more complex than exact matching. This use of matching statistics will probably be more important than their use to duplicate the preprocessing/search bounds of KnuthMorris-Pratt and Aho-Corasick. The first example of space reduction using matching statistics will be given in Section 7.9. Matching statistics are also used in a variety of other applications described in the book. One advertisement we give here is to say that matching statistics are central to a fast approximate matching method designed for rapid database searching. This will be detailed in Section 12.3.3. Thus matching statistics provide one bridge between exact matching methods and problems of approximate string matching.
7.8. APLS: A reverse role for suffix trees, and major space reduction
We have previously shown how suffix trees can be used to solve the exact matching problem with O(m) preprocessing time and space (building a suffix tree of size O(m) for the text T ) and O(n + k) search time (where n is the length of the pattern and k is the number of occurrences). We have also seen how suffix trees are used to solve the exact set matching problem in the same time and space bounds (n is now the total size of all the patterns in the set). In contrast, the Knuth-Morris-Pratt (or Boyer-Moore) method preprocesses the pattern in O(n) time and space, and then searches in O(m) time. The Aho-Corasick method achieves similar bounds for the set matching problem. Asymptotically, the suffix tree methods that preprocess the text are as efficient as the m) time and use O(n m) methods that preprocess the pattern - both run in O(n space (they have to represent the strings). However, the practical constants on the time and space bounds for suffix trees often make their use unattractive compared to the other methods. Moreover, the situation sometimes arises that the pattern(s) will be given first and held fixed while the text varies. In those cases it is clearly superior to preprocess the pattern(s). So the question arises of whether we can solve those problems by building a suffix tree for the pattern(s), not the text. This is the reverse of the normal use of suffix trees. In Sections 5.3 and 7.2.1 we mentioned that such a reverse role was possible, thereby using suffix trees to achieve exactly the same time and space bounds (preprocessing versus search time and space) as in the Knuth-Morris-Pratt or A h a o r a s i c k methods. To explain this, we will develop a result due to Chang and Lawler [94], who solved a somewhat more general problem, called the matching statistics problem.
= 4.
135
node v with the leaf number of one of the leaves in its subtree. This takes time linear in the size of T . Then, when using T to find each ms(i), if the search stops at a node u , the desired p(i) is the suffix number written at u ; otherwise (when the search stops on an edge (u, v)), p(i) is the suffix number written at node v .
Back to STSs
Recall the discussion of STSs in Section 3.5.1. There it was mentioned that, because of errors, exact matching may not be an appropriate way to find STSs in new sequences. But since the number of sequencing errors is generally small, we can expect long regions of agreement between a new DNA sequence and any STS it (ideally) contains. Those regions of agreement should allow the correct identification of the STSs it contains. Using a (precomputed) generalized suffix tree for the STSs (which play the role of P), compute matching statistics for the new DNA sequence (which is T) and the set of STSs. Generally, the pointer p ( i ) will point to the appropriate STS in the suffix tree. We leave it to the reader to flesh out the details. Note that when given a new sequence, the time for the computation is just proportional to the length of the new sequence.
Definition Given two strings S; and S j , any suffix of Si that matches a prefix of S, is , Sj. called a suj5.x-prefi match of S, Given a collection of strings S = S1, Sz, . . . , Sk of total length m, the all-pairs suffixprefix problem is the problem of finding, for each ordered pair S,, S, in S , the longest suffix-prefix match of Si, S, .
134
no further matches are possible. In either case, ms(i 1) is the string-depth of the ending position. Note that the character comparisons done after reaching the end of the @ path begin either with the same character in T that ended the search for nzs(i) or with the next character in T, depending on whether that search ended with a mismatch or at a leaf. There is one special case that can arise in computingms(i 1). If ms(i) = 1 orms(i) = 0 (so that the algorithm is at the root), and T(i 1) is not in P, then ms(i 1) = 0.
Theorem 7.8.1. Using only a s u m treefor P and a copy of T, all the m matching statistics can be found in O(m) time.
PROOF
The search for any ms(i + 1) begins by backing up at most one edge from position b to a node v and traversing one suffix link to node s(v). From stv) a # path I is traversed in time proportional to the number of nodes on it, and then a certain number of additional character comparisons are done. The backup and link traversals take constant time per i and so take O(m) time over the entire algorithm. To bound the total time to traverse the various B paths, recall the notion of current node-depth from the time analysis of Ukkonen's algorithm (page 102). There it was proved that a link traversal reduces the current depth by at most one (Lemma 6.1.2), and since each backup reduces the current depth by one, the total decrements to current depth cannot exceed 2m. But since current depth cannot exceed m or become negative, the total increments to current depth are bounded by 3m. Therefore, the total time used for all the @ traversals is at most 3m since the current depth is increased at each step of any B traversal. It only remains to consider the total time used in all the character comparisons done in the "after-B" traversals. The key there is that the after-@ character comparisons needed to compute ms(i l), for i 2 1, begin with the character in T that ended the computation for ms(i) or with the next character in T. Hence the after-@comparisons performed when computing ms(i) and ms(i 1) share at most one character in common. It follows that at most 2m comparisons in total are performed during all the after-B comparisons. That takes care of all the work done in finding all the matching statistics, and the theorem is proved.
137
Definition We call an edge a terminal edge if it is labeled only with a string termination symbol. Clearly, every terminal edge has a leaf at one end, but not all edges touching Leaves are terminal edges.
The main data structure used to solve the all-pairs suffix-prefix problem is the generalized suffix tree T ( S ) for the k strings in set S. As T ( S ) is constructed, the algorithm also builds a list L ( v ) for each internal node v. List L ( v ) contains the index i if and only if v is incident with a terminal edge whose leaf is labeled by a suffix of string Si. That is, L(u) holds index i if and only if the path label to v is a complete suffix of string Si. For example, consider the generalized suffix tree shown in Figure 6.11 (page 1 17). The node with path-label ba has an L list consisting of the single index 2, the node with path-label a has a list consisting of indices 1 and 2, and the node with path-label x a has a list consisting of index 1. All the other lists in this example are empty. Clearly, the lists can be constructed in linear time during (or after) the construction of T ( S ) . Now consider a fixed string S,, and focus on the path from the root of T ( S ) to the leaf j representing the entire string S j . The key obsemation is the following: If v is a node on this path and i is in L ( u ) , then the path-label of v is a suffix of Si that matches a prefix of S j . So for each index i , the deepest node u on the path to leaf j such that i E L ( u ) identifies the longest match between a suffix of Si and a prefix of S j . The path-label of v is the longest suffix-prefix match of ( S i ,S j ) .It is easy to see that by one traversal from the root to leaf j we can find the deepest nodes for all 1 5 i 5 k (i # j). Following the above observation, the algorithm efficiently collects the needed suffixprefix matches by traversing T ( S ) in a depth-first manner. As it does, it maintains k stacks, one for each string. During the depth-first traversal, when a node v is reached in a forward edge traversal, push v onto the ith stack, for each i E L ( v ) . When a leaf j (representing the entire string S,) is reached, scan the k stacks and record for each index i the current top of the ith stack. It is not difficult to see that the top of stack i contains the node v that defines the suffix-prefix match of (Si, S , ) . If the i th stack is empty, then there is no overlap between a suffix of string Si and a prefix of string S j . When the depth-first traversal backs up past a node u, we pop the top of any stack whose index is in L(v).
Theorem 7.10.1. All the k 2 longest suffix-prefix matches are found in O ( m k 2 ) time by the algorithm. Since m is the size of the input and k 2 is the size of the output, the algorithm is time optimal.
The total number of indices in all the lists L ( v ) is O ( m ) .The number of edges in T ( S ) is also O ( m ) .Each push or pop of a stack is associated with a leaf of T ( S ) and , each leaf is associated with at most one pop and one push; hence traversing T ( S )and updating the stacks takes O ( m )time. Recording of each of the 0 ( k 2 ) answers is done in constant time per answer.
PROOF
Extensions
We note two extensions. Let k' 5 k 2 be the number of ordered pairs of strings that have a nonzero length suffix-prefix match. By using double links, we can maintain a linked list of
In the following discussion of repetitive structures in DNA and protein, we divide the structures into three types: local, small-scale repeated strings whose function or origin is at least partially understood; simple repeats, both local and interspersed, whose function is less clear; and more complex interspersed repeated strings whose function is even more in doubt.
Definition A complementedpalindrome is a DNA or RNA string that becomes a palindrome if each character in one half of the string is changed to its complement character (in DNA, A - T are complements and C - G are complements; in RNA A - U and C - G are complements). For example, AGCTCGCGAGCT is a complemented palindrome.'
Small-scale local repeats whose function or origin is partially understood include: complemented palindromes in both DNA and RNA, which act to regulate DNA transcription (the two parts of the complemented palindrome fold and pair to form a "hairpin loop"); nested complemented palindromes in tRNA (transfer RNA) that allow the molecule to fold up into a cloverleaf structure by complementary base pairing; tandem arrays of repeated RNA that flank retroviruses (viruses whose primary genetic material is RNA) and facilitate the incorporation of viral DNA (produced from the RNA sequence by reverse transcription) into the host's DNA; single copy inverted repeats that flank transposable (movable) DNA in various organisms and that facilitate that movement or the inversion of the DNA orientation; short repeated substrings (both palindromic and nonpalindrornic) in DNA that may help the chromosome fold into a more compact structure; repeated substrings at the ends of viral DNA (in a linear state) that allow the concatenation of many copies of the viral DNA (a molecule of this type is called a conca~aclmer); copies of genes that code for important RNAs (rRNAs and tRNAs) that must be produced in large number; clustered genes that code for important proteins (such as histone) that regulate chromosome structure and must be made in large number; families of genes that code for similar proteins (hemoglobins and myoglobins for example); similar genes that probably arose through duplication and subsequent mutation (including pseudogenes that have mutated
' The use of the word "palindrome" in molecular biology does not conform to the normal Englishdictionwy definition
of the word. The easiest translation of the molecular biologist's "palindrome" to normal English is: "complemented palindrome". A more molecular view is that a palindrome is a segment of double-stranded DNA or RNA such that both strands read the same when both are read in the same direction, say in the 5' to 3' direction. Alternately, a palindrome is a segment of double-stranded DNA that is symmetric (with respect to reflection) around both the horizontal axis and the midpoint of the segment. (See Figure 7.3).Since the two strands are complementary, each strand defines a complemented palindrome in the sense deli ned above. The term "mirror repeat" is sometimes used in the molecular biology literature to refer to a "palindrome" as defined by the dictionary.
138
the nonempty stacks. Then when a leaf of the tree is reached during the traversal, only the stacks on this list need be examined. In that way, all nonzero length suffix-prefix matches can be found in O(m k') time. Note that the position of the stacks in the linked list will vary, since a stack that goes from empty to nonempty must be linked at one of the ends of the list; hence we must also keep (in the stack) the name of the string associated with that stack. At the other extreme, suppose we want to collect for every pair not just the longest suffix-prefix match, but all suffix-prefix matches no matter how long they are. We modify the above solution so that when the tops of the stacks are scanned, the entire contents of each scanned stack is read out. If the output size is k*, then the complexity for this solution is O(m k*).
. . .reports of various kinds of repeats are too common even to list. 11281
In an analysis of 3.6 million bases of DNA from C. elegans, over 7,000 families of repetitive sequences were identified [5]. In contrast, prokaryotes (organisms such as bacteria whose DNA is not enclosed in a nucleus) have in total little repetitive DNA, although they still possess certain highly structured small-scale repeats. In addition to its sheer quantity, repetitive DNA is striking for the variety of repeated structures it contains, for the various proposed mechanisms explaining the origin and maintenance of repeats, and for the biological functions that some of the repeats may play (see [394] for one aspect of gene duplication). In many texts (for example, [3 171, [469], and [315]) on genetics or molecular biology one can find extensive discussions of repetitive strings and their hypothesized functional and evolutionary role. For an introduction to repetitive elements in human DNA, see [253] and [255].
' It is reported in [192] that a search of the database MEDLINE using the key (repeat OR repcritirle) AND @rotein
OR sequence) turned up over 6,000 papers published in the preceding twenty years.
.f
14 1
and account for as much as 5% of the DNA of human and other mammalian genomes. Alu repeats are substrings of length around 300 nucleotides and occur as nearly (but not exactly) identical copies widely dispersed throughout the genome. Moreover, the interior of an Alu string itself consists of repeated substrings of length around 40, and the Alu sequence is often flanked on either side by tandem repeats of length 7-10. Those right and left flanking sequences are usually complemented palindromic copies of each other, So the Alu repeats wonderfully illustrate various kinds of phenomena that occur in repetitive DNA. For an introduction to Alu repeats see [254]. One of the most fascinating discoveries in molecular genetics is a phenomenon called genomic (or gametic) imprinting, whereby a particular allele of a gene is expressed only when it is inherited from one specific parent [48,227,391]. Sometimes the required parent is the mother and sometimes the father. The allele will be unexpressed, or expressed differently, if inherited from the "incorrect" parent. This is in contradiction to the classic Mendelian rule of equivalence - that chromosomes (other than the Y chromosome) have no memory of the parent they originated from, and that the same allele inherited from either parent will have the same effect on the child. In mice and humans, sixteen imprinted gene alleles have been found to date [48]. Five of these require inheritance from the mother, and the rest from the father. The DNA sequences of these sixteen imprinted genes all share the common feature that They contain, or are closely associated with, a region rich in direct repeats. These repeats range in size from 25 to 120 bp,3 are unique to the respective imprinted regions, but have no obvious homology to each other or to highly repetitive mammalian sequences. The direct repeats may be an important feature of gametic imprinting, as they have been found i n all imprinted genes analyzed to date, and are also evolutionarily conserved. [48] Thus, direct repeats seem to be important in genetic imprinting, but like many other examples of repetitive DNA, the function and origin of these repeats remains a mystery.
A detail not contained in this quote is that the direct (tandem) repeats in the genes studied [48] have a total length of about I,500 bases.
140
to the point that they no longer function); common exons of eukaryotic DNA that may be basic building blocks of many genes; and common functional or structural subunits in protein (motifs and domains). Restriction enzyme cutting sites illustrate another type of small-scale, structured, repeating substring of great importance to molecular biology. A restriction enzyme is an enzyme that recognizes a specific substring in the DNA of both prokaryotes and eukaryotes and cuts (or cleaves) the DNA every place where that pattern occurs (exactly where it cuts inside the pattern varies with the pattern). There are hundreds of known restriction enzymes and their use has been absolutely critical in almost all aspects of modem molecular biology and recombinant DNA technology. For example, the surprising discovery that eukaryotic DNA contains intruns (DNA substrings that interrupt the DNA of protein coding regions), for which Nobel prizes were awarded in 1993, was closely coupled with the discovery and use of restriction enzymes in the late 1970s. Restriction enzyme cutting sites are interesting examples of repeats because they tend to be complemented palindromic substrings. For example, the restriction enzyme EcuRI recognizes the complemented palindrome GAATTC and cuts between the G and the adjoining A (the substring TTC when reversed and complemented is GAA). Other restriction enzymes recognize separated (or interrupted) complemented palindromes. For example, restriction enzyme BglI recognizes GCCNNNNNGGC, where N stands for any nucleotide. The enzyme cuts between the last two Ns. The complemented palindromic structure has been postulated to allow the two halves of the complemented palindrome (separated or not) to fold and form complementary pairs. This folding then apparently facilitates either recognition or cutting by the enzyme. Because of the palindromic structure of restriction enzyme cutting sites, people have scanned DNA databases looking for common repeats of this form in order to find additional candidates for unknown restriction enzyme cutting sites. Simple repeats that are less well understood often arise as tandem arrays (consecutive repeated strings, also called "direct repeats") of repeated DNA. For example, the string T7;4GGG appears at the ends of every human chromosome in arrays containing one to two thousand copies [332]. Some tandem arrays may originate and continue to grow by apostulated mechanism of unequal crossing over in meiosis, although there is serious opposition to that theory. With unequal crossing over in meiosis, the likelihood that more copies will be added in a single meiosis increases as the number of existing copies increases. A number of genetic diseases (Fragile X syndrome, Huntington's disease, Kennedy's disease, myotonic dystrophy, ataxia) are now understood to be caused by increasing numbers of tandem DNA repeats of a string three bases long. These triplet repeats somehow interfere with the proper production of particular proteins. Moreover, the number of triples in the repeat increases with successive generations, which appears to explain why the disease increases in severity with each generation. Other long tandem arrays consisting of short strings are very common and are widely distributed in the genomes of mammals. These repeats are called satellite DNA (further subdivided into micro and mini-satellite DNA), and their existence has been heavily exploited in genetic mapping and forensics. Highly dispersed tandem arrays of length-two strings are common. In addition to tri-nucleotide repeats, other mini-satellite repeats also play a role in human genetic diseases 12861. Repetitive DNA that is interspersed throughout mammalian genomes, and whose function and origin is less clear, is generally divided into SINES (short interspersed nuclear sequences) and LINES (long interspersed nuclear sequences). The classic example of a SINE is the Alu family. The Alu repeats occur about 300,000 times in the human genome
143
Definition A maximal pair (or a maximal repeated pair ) in a string S is a pair of identical substrings a and B in S such that the character to the immediate left (right) of a is different from the character to the immediate left (right) of B. That is, extending a and B in either direction would destroy the equality of the two strings. Definition A maximal pair is represented by the triple ( p l , p ~n'), , where pl and p 2 give the starting positions of the two substrings and n' gives their length. For a string S, we define R(S) to be the set of all triples describing maximal pairs in S.
For example, consider the string S = xabcyiiiza bcqu bcy r-xar, where there are three occurrences of the substring a b c . The first and second occurrences of a b c form a maximal pair (2, 10, 3), and the second and third occurrences also form a maximal pair (10, 14,3), whereas the first and third occurrences of abc do not form a maximal pair. The two occurrences of the string abcy also form a maximal pair (2, 14,4). Note that the definition allows the two substrings in a maximal pair to overlap each other. For example, cxxaxxa-xxb contains a maximal pair whose substring is xxa.xx. Generally, we also want to permit a prefix or a suffix of S to be part of a maximal pair. For example, two occurrences of xu in xabcyiii:ubcqabcyrxar should be considered as a maximal pair. To model this case, simply add a character to the start of S and one to the end of S that appear nowhere else in S. From this point on, we will assume that has been done. It may sometimes be of interest to explicitly find and output the full set R(S). However, in some situations R ( S ) may be too large to be of use, and a more restricted reflection of the maximal pairs may be sufficient or even preferred.
Definition Define a maximal repeat a as a substring of S that occurs in a maximal pair in S. That is, a is a maximal repeat in S if there is a triple ( p l , p2, la[)E R(S) and (Y occurs in S starting at position pl and pz. Let Rr(S) denote the set of maximal repeats in S .
For example, with S as above, both strings a b c and ubcy are maximal repeats. Note that no matter how many times a string participates in a maximal pair in S, it is represented only once in Rr(S). Hence lR'(S)I is less than or equal to lR(S)I and is generally much smaller. The output is more modest, and yet it gives a good reflection of the maximal pairs. In some applications, the definition of a maximal repeat does not properly model the desired notion of a repetitive structure. For example, in S = acubxcuyacub, substring (Y is
142
The existence of highly repetitive DNA, such as Alus, makes certain kinds of large-scale DNA sequencing more difficult (see Sections 16.11 and 16.16), but their existence can also facilitate certain cloning, mapping, and searching efforts. For example, one general approach to low-resolution physical mapping (finding on a true physical scale where features of interest are located in the genome) or to finding genes causing diseases involves inserting pieces of human DNA that may contain a feature of interest into the hamster genome. This technique is called somatic cell hybridization. Each resulting hybrid-hamster cell incorporates different parts of the human DNA, and these hybrid cells can be tested to identify a specific cell containing the human feature of interest. In this cell, one then has to identify the parts of the hamster's hybrid genome that are human. But what is a distinguishing feature between human and hamster DNA? One approach exploits the Alu sequences. Alu sequences specific to human DNA are so common in the human genome that most fragments of human DNA longer than 20,000 bases will contain an Alu sequence [317]. Therefore, the fragments of human DNA in the hybrid can be identified by probing the hybrid for fragments of Alu. The same idea is used to isolate human oncogenes (modified growth-promoting genes that facilitate certain cancers) from human tumors. Fragments of human DNA from the tumor are first transferred to mouse cells. Cells that receive the fragment of human DNA containing the oncogene become transformed and replicate faster than cells that do not. This isolates the human DNA fragment containing the oncogene from the other human fragments, but then the human DNA has to be separated from the mouse DNA. The proximity of the oncogene to an Alu sequence is again used to identify the human part of the hybrid genome [47 11. A related technique, again using proximity to Alu sequences, is described in [403].
Algorithmic problems on repeated structures We consider specific problems concerning repeated structures in strings in several sections of the book.4 Admittedly, not every repetitive string problem that we will discuss is perfectly motivated by a biological problem or phenomenon known today. A recurring objection is that the first repetitive string problems we consider concern exact repeats (although with complementation and inversion allowed), whereas most cases of repetitive DNA involve nearly identical copies. Some techniques for handling inexact palindromes (complemented or not) and inexact repeats will be considered in Sections 9.5 and 9.6. Techniques that handle more liberal errors will be considered later in the book. Another objection is that simple techniques suffice for small-length repeats. For example, if one seeks repeating DNA of length ten, it makes sense to first build a table of all the 4'' possible strings and then scan the target DNA with a length-ten template, hashing substring locations into the precomputed table. Despite these objections, the fit of the computational problems we will discuss to biological phenomena is good enough to motivate sophisticated techniques for handling exact or nearly exact repetitions. Those techniques pass the "plausibility" test in that they, or the ideas that underlie them, may be of future use in computational biology. In this light, we now consider problems concerning exactly repeated substrings in a single string.
" n
a sense, the longest common substring probls~nand the k-common substring problem (Sections 7.h and 9.7) also concern repetitive substrings. However, the repeats in those problems occur across distinct strings. rather than inside the same string. That distinction is critical, both in the definition of the problems and For the techniques used to solve them.
145
Note that being left diverse is a property that propagates upward. If a node v is left diverse. so are all of its ancestors in the tree.
Theorem 7.12.2. The string a labeling the path to a node v o f 7 is a marimal repeat if and only if v is left diverse.
PROOF
Suppose first that v is left diverse. That means there are substrings x a and y a ;represent different characters. Let the first substring be followed in S, where x and . by character p. If the second substring is followed by any character but p, then a is a maximal repeat and the theorem is proved. So suppose that the two occurrences are x a p and y a p . But since v is a (branching) node there must also be a substringaq in S for some character q that is different from p. If this occurrence of aq is preceded by character x then it participates in a maximal pair with string y a p , and if it is preceded by y then it participates in a maximal pair with x a p . Either way, a cannot be preceded by both x and y, so a must be part of a maximal pair and hence a must be a maximal repeat. Conversely, if a is a maximal repeat then it participates in a maximal pair and there must be dccurrences of a that have distinct left characters. Hence v must be left diverse.
The maxima1 repeats can be compactly represented Since the property of being left diverse propagates upward in 7,Theorem 7.12.2 implies that the maximal repeats of S are represented by some initial portion of the suffix tree for S . In detail, a node is called a "frontier" node in 7 if it is left diverse but none of its children are left diverse. The subtree of 7 from the root down to the frontier nodes precisely represents the maximal repeats in that every path from the root to a node at or above the frontier defines a maximal repeat. Conversely, every maximal repeat is defined by one such path. This subtree, whose leaves are the frontier nodes in 7, is a compact representation5 of the set of all maximal repeats of S. Note that the total length of the maximal repeats could be as large as 0(n2),but since the representation is a subtree of 7 it has O(n) total size (including the symbols used to represent edge labels). So if the left diverse nodes can be found in O(n) time, then a tree representation for the set of maximal repeats can be constructed in O ( n ) time, even though the total length of those maximal repeats could be S2(n2).We now describe an algorithm to find the left diverse nodes in 7. Finding left diverse nodes in linear time
For each node v of 7 ,the algorithm either records that v is left diverse or it records the character, denoted x , that is the left character of every leaf in v's subtree. The algorithm starts by recording the left character of each leaf of the suffix tree 7 for S. Tnen it processes the nodes in 7 bottom up. To process a node v, it examines the children of v. If any child of v has been identified as being left diverse, then it records that v is left diverse. If none of v's children are left diverse, then it examines the characters recorded at v7schildren. If these recorded characters are all equal, say x, then it records character x at node v. However, if they are not all x, then it records that v is left diverse. The time to check if all children of v have the same recorded character is proportional to the number of v's children. Hence the total time for the algorithm is O(FZ). To form the final representation of the set of maximal repeats, simply delete all nodes from 7 that are not left diverse. In summary, we have
This kind of tree is sometimes referred to as a compact rrie,but we will not use that terminology.
144
a maximal repeat but so is a a b , which is a superstring of string a , although not every occurrence of a is contained in that superstring. It may not always be desirable to report a as a repetitive structure, since the larger substring a a b that sometimes contains a may be more informative.
Definition A supermaximal repeat is a maximal repeat that never occurs as a substring of any other maximal repeat.
Maximal pairs, maximal repeats, and supermaximal repeats are only three possible ways to define exact repetitive structures of interest. Other models of exact repeats are given in the exercises. Problems related to palindromes and tandem repeats are considered in several sections throughout the book. Inexact repeats will be considered in Sections 9.5 and 9.6.1. Certain kinds of repeats are elegantly represented in graphical form in a device called a landscape [104]. An efficient program to construct the landscape, based essentially on suffix trees, is also described in that paper. In the next sections we detail how to efficiently find all maximal pairs, maximal repeats, and supermaximal repeats.
.Lemma7.12.1. Let 7 be the s u m tree for string S. If a string a is a rnaxintal repeat in S then a is the path-label of a node v in 7.
PROOF
If a is a maximal repeat then there must be at least two copies of a in S where the character to the right of the first copy differs from the character to the right of the second copy. Hence a is the path-label of a node v in 7. The key point in Lemma 7.12.1 is that path a must end at a node of T . This leads immediately to the following surprising fact:
Theorem 7.12.1. There can be at most n maximal repeats in any string of length n.
PROOF
Since 7 has n leaves, and each internal node other than the root must have at least two children, 7 can have at most n internal nodes. Lemma 7.12.1 then implies the theorem. Theorem 7.12.1 would be a trivial fact if at most one substring starting at any position i could be part of a maximal pair. But that is not true. For example, in the string S = xabcyiiizczbc.qcrbcyr considered earlier, both copies of substring a b c y participate in maximal pairs, while each copy of a b c also participates in maximal pairs. So now we know that to find maximal repeats we only need to consider strings that end at nodes in the suffix tree 7. But which specific nodes correspond to maximal repeats?
Definition For each position i in string S, character S(i - 1 ) is called the lefr chcrmc-ter of i. The IeJcharcrcter o f n leaf of 7 is the left character of the suffix position represented by that leaf. Definition A node v of 7 is called left diverse if at least two leaves in v's subtree have different left characters. By definition. a leaf cannot be left diverse.
147
by x and succeeded by y, is not contained in a maxima1 repeat, and so witnesses the near-supermaximali ty of a . In summary, we can state
Theorem 7.12.4. A left diverse internal node v represents a near-supermaximal repeat a ifand only ifone of v 'schildren is a leaf (specifying position i, say) and its lefl character, S(i - I ) , is the left character of no other leaf below v. A left diverse internal node v represents a supermaximal repeat a f i and only i f all of v's children are leaves, and each has a distinct left character:
Therefore, all supermaximal and near-supermaximal repeats can be identified in linear time. Moreover, we can define the degree of near-supermaximality of a as the fraction of occurrences of a that witness its near-supermaximality. That degree of each nearsupermaximal repeat can also be computed in linear time.
146
Theorem 7.12.3. All the maximal repeats in S can be found in O ( n ) time, and a tree representation for them can be constructed from s u f i tree 7in O ( n ) time as well.
Definition A substring a of S is a near-supermaximalrepeat if a is a maximal repeat in S that occurs at least once in a location where it is not contained in another maximal repeat. Such an occurrence of a is said to witness the near-supemaxirnality of a .
For example, in the string acubxcryaabxcrb, substring a is a maximal repeat but not a supermaximal or a near-supermaximal repeat, whereas in a a b x a y a a b , substring a is again not supermaximal, but it is near-supermaximal. The second occurrence of a witnesses that fact. With this terminology, a supermaximal repeat a is a maximal repeat in which every occurrence of a is a witness to its near-supermaximality. Note that it is not true that the set of near-supermaximal repeats is the set of maximal repeats that are not superrnaximal repeats. The suffix tree 7for S will be used to locate the near-supermaximal and the supemaximal repeats. Let v be a node corresponding to a maximal repeat a , and let w (possibly a leaf) be one of v's children, The leaves in the subtree of 7rooted at w identify the locations of some (but not all) of the occurrences of substring a in S . Let L ( w ) denote those occurrences. Do any of those occurrences of a witness the near-supennaxirnality of a?
Lemma 7.12.2. Ifnode w is an internal node in 7, then none of the occurrences of a specfled by L ( w ) witness the near-supermaximality of a.
Let y be the substring labeling edge (v, w ) . Every index in L ( w ) specifies an occurrence of a y . But w is internal, so tL(w)l > 1 and ay is the prefix of a maximal: repeat. Therefore, all the occurrences of a specified by L ( w ) are contained in a maximal repeat that begins a y , and w cannot witness the near-supermaximality of Q .
PROOF
Thus no occurrence of a in L ( w ) can witness the near-supermaximality of a unless w is a leaf. If w is a leaf, then w specifies a single particular occurrence of substring B = a y . We now consider that case.
Lemma 7.12.3. Suppose w is a lea3 and let i be the (single) occurrence of represented by leaf w. k t x be the left character of leuf w . Then the occurrence of Q at position i witnesses the near-sripermaximality of a if and only ifx is the left character of no other leaf below u.
If there is another occurrence of a! with a preceding character x , then xu occurs twice and so is either a maximal repeat or is contained in one. In that case, the occurrence of a at i is contained in a maximal repeat. If there is no other occurrence of a with a preceding x , then x a occurs only once r is a leaf, ay in S . Now let y be the first character on the edge from v to w . Since u occurs only once in S. Therefore, the occurrence of a starting at i , which is preceded
PROOF
149
linearization of the circular string. If 1 = 0 or 1 = n + 1, then cut the circular string between character n and character 1. Each leaf in the subtree of this point gives a cutting point yielding the same linear string. The correctness of this solution is easy to establish and is left as an-exercise. This method runs in linear time and is therefore time optimal. A different linear-time method with a smaller constant was given by Shiloach [404].
That is, the suffix starting at position Pos (1) of T is the lexically smallest suffix, and in general suffix Pus (i) of T is lexically smaller than suffix Pus ( i 1). As usual, we will affix a terminal symbol $ to the end of S, but now we interpret it to be lexically less than any other character in the alphabet. This is in contrast to its interpretation in the previous section. As an example of a suffix array, if T is mississippi, then the suffix array Pos is 11,8, 5 , 2 , 1, 1 0 , 9 , 7 , 4 , 6 , 3 .Figure 7.4 lists the eleven suffixes in lexicographic order.
148
maximal pairs, then the algorithm can be modified to run in O(n) time. If only maximal pairs of a certain minimum length are requested (this would be the typical case in many applications), then the algorithm can be modified to run in O(n k,,,) time, where k, is the number of maximal pairs of length at least the required minimum. Simply stop the bottom-up traversal at any node whose string-depth falls below that minimum. In summary, we have the following theorem:
Theorem 7.12.5. All the maximal pairs can be found in O(n k) time, where k is the number of maximal pairs. I f there are only k, maximal pairs of length above a given threshold, then all those can be found in O(n k,) time.
1
Figure 7.5: The lexical depth-first traversal of the suffix tree visits the leaves in order 5, 2,6, 3, 4. 1.
For example, the suffix tree for T = tartar is shown in Figure 7.5. The lexical depth-first traversal visits the nodes in the order 5 , 2 , 6 , 3 , 4 , 1 , defining the values of array Pos. As an implementation detail, if the branches out of each node of the tree are organized in a sorted linked list (as discussed in Section 6.5, page 116) then the overhead to do a lexical depth-first search is the same as for any depth-first search. Every time the search must choose an edge out of a node v to traverse, it simply picks the next edge on v's linked list.
Theorem 7.14.2. By using binary search on array Pos, all the occurrences of P in T can be found in O(n log rn) time.
Of course, the true behavior of the algorithm depends on how many long prefixes of P occur in T . If very few long prefixes of P occur in T then it will rarely happen that a specific lexical comparison actually takes O(n) time and generally the O(n log m) bound is quite pessimistic. In "random" strings (even on large alphabets) this method should run in O(n + log m) expected time. In cases where many long prefixes of P do occur in T, then the method can be improved with the two tricks described in the next two subsections.
i
ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi
Figure 7.4: The eleven suffixes of mississippi listed in lexicographic order. The starting positions of those suffixes define the suffix array Pos.
Notice that the suffix array holds only integers and hence contains no information about the alphabet used in string T . Therefore, the space required by suffix arrays is modest for a string of length m, the array can be stored in exactly rn computer words, assuming a word size of at least log rn bits. When augmented with an additional 2m values (called Lcp values and defined later), the suffix array can be used to find all the occurrences in T of a pattern P in O(n log, m) single-character comparison and bookkeeping operations. Moreover, this bound is independent of the alphabet size. Since for most problems of interest log, m is O(n), the substring problem is solved by using suffix arrays as efficiently as by using suffix trees.
Since no two edges out of v have labels beginning with the same character, there is a strict lexical ordering of the edges out of v. This ordering implies that the path from the root of 7-following the lexically smallest edge out of each encountered node leads to a leaf of 7 representing the lexically smallest suffix of T . More generally, a depth-first traversal of 7 that traverses the edges out of each node v in their lexical order will encounter the leaves of 7 in the lexical order of the suffixes they represent. Suffix array Pos is therefore just the ordered list of suffix numbers encountered at the leaves of 7 during the lexical depth-first search. The suffix tree for T is constructed in linear time, and the traversal also takes only linear time, so we have the following:
Theorem 7.14.1. The s u f i array Pos fur a string T of length m can be constructed in O(m) time.
Figure 7.6: Subcase t of the super-accelerant. Pattern Pis abcdemn, shown vertically running upwards from the first character. The suffixes Pos(L),Pos(M), and Pos(R) are also shown vertically. In this case, Lcp[L,M) > 0 and I > r. Any starting location of Pin Tmust occur in Pos to the right of M, since Pagrees with suffix Pos(MJ only up to character I.
If Lcp ( L , M ) > 1 , then the common prefix of suffix Pos(L) and suffix Pos(M) is longer than the common prefix of P and Pos(L). Therefore, P agrees with suffix Pos(M) up through character 2. In other words, characters 1 1 of suffix Pos(L) and suffix Pos(M) are identical and lexically less than character 1 1 of P (the last fact follows since P is lexically greater than suffix Pos(L)).Hence all (if any) starting locations of P in T must occur to the right of position M in Pos. So in any iteration of the binary search where this case occurs, no examinations of P are needed; L just gets changed to M , and 1 and r remain unchanged. (See Figure 7.6.) If Lcp ( L , M ) < 1, then the common prefix of suffix Pus(L) and Pos(M) is smaller than the common prefix of suffix Pos(L) and P. Therefore, P agrees with suffix Pos(M) up through character Lcp ( L , M ) . The Lcp ( L , M ) 1 characters of P and suffix Pus(L)are identical and lexically less than character Lcp ( L , M ) + 1 of suffix Pos(M). Hence all (if any) starting locations of P in T must occur to the left of position M in Pus. So in any iteration of the binary search where this case occurs, no examinations of P are needed; r is changed to t c p ( L , M), 1 remains unchanged, and R is changed to M . If Lcp ( L , M ) = I , then P agrees with suffix Pus(M) up to character I . The algorithm then lexically compares P to suffix Pos(M) starting from position 1 1 . In the usual manner, the outcome of that lexical comparison determines which of L or R change, along with the corresponding change of 1 or r .
+ +
Theorem 7.14.3. Using the Lcp vnlues, the search algorithm does at most O ( n + log nr)
comparisons and runs in that time.
First, by simple case analysis it is easy to verify that neither 1 nor r ever decrease during the binary search. Also, every iteration of the binary search terminates the search, examines no characters of P , or ends after the first mismatch occurs in that iteration. In the two cases (1 = r or Lcp ( L , M) = 1 > r ) where the algorithm examines a character during the iteration, the comparisons start with character max(1, r ) of P . Suppose there are k characters of P examined in that iteration. Then there are k - 1 matches during the iteration, and at the end of the iteration max(l, r ) increases by k - 1 (either 1 or r is changed to that value). Hence at the start of any iteration, character max(1, r ) of P may have already been examined, but the next character in P has not been. That means at most one redundant comparison per iteration is done. Thus no more than logz m redundant comparisons are done overall. There are at most n nonredundant comparisons of characters
PROOF
152
7.14.4. A super-accelerant
Call an examination of a character in P redundant if that character has been examined before. The goal of the acceleration is to reduce the number of redundant character examinations to at most one per iteration of the binary search - hence O(log rn) in all. The desired time bound, O(n logm), follows immediately. The use of rnlr alone does not achieve this goal. Since rnlr is the minimum of 1 and r, whenever 1 # r , all characters in P from rnlr 1 to the maximum of 1 and r will have already been examined. Thus any comparisons of those characters will be redundant. What is needed is a way to begin comparisons at the maximum of 1 and r .
Definition Lcp (i, j)is the length of the longest common prefix of the suffixes specified in positions i and j of Pos. That is, Lcp (i, j ) is the length of the longest prefix common to suffix Pos ( i ) and suffix Pos (j).The term Lcp stands for longest common prej3.x.
For example, when T = mississippi, suffix Pos (3) is issippi, suffix Pos (4) is ississippi, and so Lcp(3,4) is four (see Figure 7.4). To speed up the search, the algorithm uses Lcp (L. M ) and Lcp (M, R) for each triple ( L , M, R) that arises during the execution of the binary search. For now, we assume that these values can be obtained in constant time when needed and show how they help the search. Later we will show how to compute the particular Lcp values needed by the binary search during the preprocessing of T.
How to use Lcp values Simplest case In any iteration of the binary search, if 1 = r , then compare P to suffix Pos(M) starting from position rnlr 1 = 1 + 1 = r 1, as before.
General case When 1 # r , let us assume without loss of generality that 1 > r . Then there are three subcases:
155
If we assume that the string-depths of the nodes are known (these can be accumulated in linear time), then by the Lemma, the values Lcp (i, i + 1) for i from 1 to m - 1 are easily accumulated in O(m) time. The rest of the Lcp values are easy to accumulate because of the following lemma: Lemma 7.14.2. For any p a i r of positions i, j, where j is greater than i the smallest value o f Lcp (k, k I), where k ranges from i to j - 1.
+ I, Lcp (i, j ) is
Suffix Pos(i) and Suffix Pos(j) of T have a common prefix of length lcp(i, j). By the properties of lexical ordering, for every k between i and j, suffix Pos(k) must also have that common prefix. Therefore, lcp (k, k I) 2 lcp (i, j ) for every k between i and j - 1. Now by transitivity, Lcp (i, i +2) must be at least as large as the minimum of Lcp (i, i+ 1) and Lcp (i 1, i 2). Extending this observation, Lcp (i, j ) must be at least as large as the smallest Lcp ( k , k 1) for k from i to j - 1. Combined with the observation in the first paragraph, the lemma is proved.
PROOF
Given Lemma 7.14.2, the remaining Lcp values for B can be found by working up from the leaves, setting the Lcp value at any node v to the minimum of the lcp values of its two children. This clearly takes just O(m) time. In summary, the O(n log m)-time string and substring matching algorithm using a suffix array must precompute the 2m - 1 Lcp values associated with the nodes of binary tree B . The leaf values can be accumulated during the linear-time, lexical, depth-first traversal of 7 used to construct the suffix array. The remaining values are computed from the leaf values in linear time by a bottom-up traversal of B, resulting in the following:
Theorem 7.14.4. All the needed Lcp vallies can be accumulated in O(m) time, and all occurrences of P in T can be found using a su@ array in O(n log m) time.
Figure 7.7: Binary tree 8 representing all the possible search intervals in any execution of binary search in a list of length m = 8.
of P, giving a total bound of n + log m comparisons. All the other work in the algorithm can clearly be done in time proportional to these comparisons.
+ +
Essentially, the node labels specify the endpoints (L, R ) of all the possible search intervals that could arise in the binary search of an ordered list of length m . Since B is a binary tree with m leaves, B has 2m - 1 nodes in total. S o there are only O(m) Lcp values that need be precomputed. It is therefore plausible that those values can be accumulated during the O(m)-time preprocessing of T ; but how exactly? In the next lemma we show that the Lcp values at the leaves of B are easy to accumulate during the lexical depth-first traversal of 7.
Lemma 7.14.1. In the depth-first traversal of 7, consider the internal nodes visited between the visits to leaf Pos(i) and leaf Pos(i + I), that is, between the ith leaf visited and the next leaf visited. From among those internal nodes, let v denote the one that is closest to the root. Then Lcp(i, i + 1) equals the string-depth of node v.
For example, consider again the suffix tree shown in Figure 7.5 (page 15 1). Lcp(5,6) is the string-depth of the parent of leaves 4 and 1. That string-depth is 3, since the parent of 4 and 1 is labeled with the string tar. The values of Lcp ( i , i 1) are 2, 0, 1,0, 3 for i from 1 to 5. The hardest part of Lemma 7.14.1 involves parsing it. Once done, the proof is immediate from properties of suffix trees, and it is left to the reader.
157
suffix trees to speed up regular expression pattern matching (with errors) is discussed in Section 12.4.
Yeast Suffix trees are also the central data structure in genome-scale analysis of Saccharomyces cerevisiae (brewer's yeast), done at the Max-Plank Institute [320]. Suffix trees are "particularly suitable for finding substring patterns in sequence databases" [320]. So in that project, highly optimized suffix trees called hashedposition trees are used to solve problems of "clustering sequence data into evolutionary related protein families, structure prediction, and fragment assembly" [320]. (See Section 16.15 for a discussion of fragment assembly.) Borrelia burgdorferi Borrelia burgdo~eriis the bacterium causing Lyme disease. Its genome is about one million bases long, and is currently being sequenced at the Brookhaven National Laboratory using a directed sequencing approach to fill in gaps after an initial shotgun sequencing phase (see Section 16.14). Chen and Skiena [loo] developed methods based on suffix trees and suffix arrays to solve the fragment assembly problem for this project. In fragment assembly, one major bottleneck is overlap detection, which requires solving a variant of the suffix-prefix matching problem (allowing some errors) for all pairs of strings in a large set (see Section 16.15.1 .). The Borrelia work [loo] consisted of 4,612 fragments (strings) totaling 2,032,740 bases. Using suffix trees and suffix arrays, the needed overlaps were computed in about fifteen minutes. To compare the speed and accuracy of the suffix tree methods to pure dynamic programming methods for overlap detection (discussed in Section 11.6.4 and 16.15.1), Chen and Skiena closely examined cosrnid-sized data. The test established that the suffix tree approach gives a 1,000 times speedup over the (slightly) more accurate dynamic programming approach, finding 99% of the significant overlaps found by using dynamic programing. Efficiency is critical
In all three projects, the efficiency of building, maintaining, and searching the suffix trees is extremely important, and the implementation details of Section 6.5 are crucial, However, because the suffix trees are very large (approaching 20 million characters in the case of the Arnbidopsis project) additional implementation effort is needed, particularly in organizing the suffix tree on disk, so that the number of disk accesses is reduced. All three projects have deeply explored that issue and have found somewhat different solutions. See [320], [I001 and [63] for details.
156
consists of characters from a finite alphabet (representing the known patterns of interest) alternating with integers giving the distances between such sites. The alphabet is huge because the range of integers is huge, and since distances are often known with high precision, the numbers are not rounded off. Moreover, the variety of known patterns of interest is itself large (see [435]). It often happens that a DNA substring is obtained and studied without knowing where that DNA is located in the genome or whether that substring has been previously researched. If both the new and the previously studied DNA are fully sequenced and put in a database, then the issue of previous work or locations would be solved by exact string matching. But most DNA substrings that are studied are not fully sequenced - maps are easier and cheaper to obtain than sequences. Consequently, the following matching problem on maps arises and translates to an matching problem on strings with large alphabets: Given an established (restriction enzyme) map for a large DNA string and a map from a smaller string, determine if the smaller string is a substring of the larger one. Since each map is represented as an alternating string of characters and integers, the underlying alphabet is huge. This provides one motivation for using suffix arrays for matching or substring searching in place of suffix trees. Of course, the problems become more difficult in the presence of errors, when the integers in the strings may not be exact, or when sites are missing or spuriously added. That problem, called map alignment, is discussed in Section 16.10.
Arabidopsis thaliana An Arabidopsis thaliana genome p r ~ j e c tat , ~the Michigan State University and the University of Minnesota is initially creating an EST map of the Arabidopsis genome (see Section 3.5.1 for a discussion of ESTs and Chapter 16 for adiscussion of mapping). In that project generalized suffix trees are used in several ways [63,64,65]. First, each sequenced fragment is checked to catch any contamination by known vector sequences. The vector sequences are kept in a generalized suffix tree, as discussed in Section 7.5. Second, each new sequenced fragment is checked against fragments already sequenced to find duplicate sequences or regions of high similarity. The fragment sequences are kept in an expanding generalized suffix tree for this purpose. Since the project will sequence about 36,000 fragments, each of length about 400 bases, the efficiency of the searches for duplicates and for contamination is important. Third, suffix trees are used in the search for biologically significant patterns in the obtained Ambidopsis sequences. Patterns of interest are often represented as regular expressions, and generalized suffix trees are used to accelerate regular expression pattern matching, where a small number of errors in a match are allowed. An approach that permits
Arabidopsis rhaliano is the "fruit fly" of plant genetics. i.e., the classic model organism in studying the molecular biology of plants. Its size is about 100 million base pairs.
159
patterns and m is the size of the text. The more efficient algorithm will increase i by more than one whenever possible, using rules that are analogous to the bad character and good suffix rules of Boyer-Moore. Of course, no shift can be greater than .the length of the shortest pattern P in P, for such a shift could miss occurrences of P in'T.
+ [Pmi,l;otherwise increase i
to the
Ps
Figure 7.9: Shift when the bad character rule is applied
158
time. However, a synthesis of the Boyer-Moore and Aho-Corasick algorithms due to Comrnentz-Walter [I091 solves the exact set matching problem in the spirit of the BoyerMoore algorithm. Its shift rules allow many characters of T to go unexarnined. We will not describe the Commentz-Walter algorithm but instead use suffix trees to achieve the same result more simply. For simplicity of exposition, we will first describe a solution that uses two trees - a simple keyword tree (without back pointers) together with a suffix tree. The difficult work is done by the suffix tree, After understanding the ideas, we implement the method using only the suffix tree.
Definition Let P r denote the reverse of a pattern P , and let Pr be the set of strings obtained by reversing every pattern P from an input set P.
As usual, the algorithm preprocesses the set of patterns and then uses the result of the preprocessing to accelerate the search. The following exposition interleaves the descriptions of the search method and the preprocessing that supports the search.
161
Figure 7.11: The shift when the weak good suffix rule is applied. In this figure, pattern P3 determines the amount of the shift.
from the end of P , then i should be increased by exactly r positions, that is, : i should be set to i r. (See Figure 7.10 and Figure 7.11.) We will solve the problem of finding iZ, if it exists, using a suffix tree obtained by preprocessing set P'. The key involves using the suffix tree to search for a pattern P r in P' containing a copy of a' starting closest to its left end but not occurring as a prefix of P r . If that occurrence of a' starts at position z of pattern P r , then an occurrence of a ends r = z - 1 positions from the end of P. During the preprocessing phase, build a generalized suffix tree 7' for the set of patterns P r . Recall that in a generalized suffix tree each leaf is associated with both a pattern P r E Pr and a number z specifying the starting position of a suffix of P r .
Definition For each internal node u of T r ,z, denotes the smallest number z greater than 1 (if any) such that z is a suffix position number written at a leaf in the subtree of v. If no such Ieaf exists, then z , is undefined.
With this suffix tree T r , determine the number z , for each internal node v . These two preprocessing tasks are easily accomplished in linear time by standard methods and are left to the reader. As an example of the preprocessing, consider the set P = [wxa, xnqq, q.rxj and the generalized suffix tree for P' shown in Figure 7.12. The first number on each leaf refers to a string in P r ,and the second number refers to a suffix starting position in that string. The number z , is the first (or only) number written at every internal node (the second number will be introduced later). 'is used during the search to determine value iz,if it exists. We can now describe how 7 ' from the root of 7 ' . After matching crr along a path in K r , traverse the path labeled a That path exists because a is a suffix of some pattern in P (that is what the search in Kr
160
The preprocessing needed to implement the bad character rule is simple and is left to the reader, The generalization of the bad character rule to set matching is easy but, unlike the case of a single pattern, use of the bad character rule alone may not be very effective. As the number of patterns grows, the typical size of i i - i is likely to decrease, particularly if the alphabet is small. This is because some pattern is likely to have character T ( j - 1) close to, but left of, the point where the previous matches end. As noted earlier, in some applications in molecular biology the total length of the patterns in P is larger than the size of T, making the bad character rule almost useless. A bad character rule analogous to the simpler, unextended bad character rule for a single pattern would be even less useful. Therefore, in the set matching case, a rule analogous to the good suffix mle is crucial in making a Boyer-Moore approach effective.
163
The proof is immediate and is left to the reader. Clearly, i3 can be found during the ' path in T used to search for i2. If neither i2 nor i3 exist, then i should traversal of the a be increased by the length of the smallest pattern in P.
z.
5. For each node v in T r set d: equal to the smallest value of d, for any ancestor (including V ) of U. 6. Remove the subtree rooted at any unmarked node (including leaves) of T. (Nodes were marked in step 2.)
End. The above preprocessing tasks are easily accomplished in linear time by standard tree traversal methods.
Using L in the search phase Let C denote the tree at the end of the preprocessing. Tree 1 : is essentially the familiar keyword tree K r but is more compacted: Any path of nodes with only one descendent has been replaced with a single edge. Hence, for any i, the test to see if a pattern of P ends at position i can be executed using tree L rather than Kr. Moreover, unlike Kr, each node v in I: now has associated with it the values needed to compute i2 and i3 in constanl time. In detail, after the algorithm matches a string a! in T by following the path cur in C,the algorithm checks the first node v at or beneath the end of the path in L. If z, is defined there, then i2 exists and equals i + z , - 1. Next the algorithm checks the first node v at or above the end of the matched path. If d: is defined there then i3exists and equals i + d: - 1. The search phase will not miss any occurrence of a pattern if either the good suffix rule or the bad character rule is used by itself. However, the two rules can be combined to increment i by the largest amount specified by either of the two rules. Figure 7.13 shows tree L corresponding to the tree T' shown in Figure 7.12.
Figure 7.12: Generalized suffix tree 7 'for the set P = {wxa,xaqq, Wax).
determined), so a ' is a prefix of some pattern in P'. Let v be the first node at or below the end of that path in T'. If z, is defined, then i2 can be obtained from it: The leaf defining z, (i.e., the leaf where z = 2,) is associated with a string P r E P r that contains a copy of a ' starting to the right of position one. Over all such occurrences of a' in the strings of p r ,Pr contains the copy of a' starting closest to its left end. That means that P contains a copy of a that is not a suffix of P, and over all such occurrences of a,P contains the copy of a! ending closest to its right end. P is then the string in P that should be used to set i2. Moreover, a ends in P exactly z, - 1 characters from the end of P. Hence, as , - 1 positions. In summary, we have argued above, i should be increased by 2
Theorem 7.16.1. If the first node v in value r,, then i2 equals i + z,- 1.
Using suffix tree T ,the determination of i2 takes O ( l a [ time, ) only doubling the time of the search used to find a. However, with proper preprocessing, the search used to find i2 can be eliminated. The details will be given below in Section 7.16.5. Now we turn to the computation of i3. This is again easy assuming the proper preprocessing of P. Again we use the generalized suffix tree ' P for P r .To get the idea of the method let P E P be any pattern such that a suffix of a is a prefix of P. That means that a prefix of a ' is a suffix of P r . Now consider the path labeled cur in 7'. Since some suffix of cr is a prefix of P , some initial portion of the ar path in T rdescribes a suffix of P r . There thus must be a leaf edge (u, z ) branching off that path, where leaf z is associated with pattern P r and the label of edge ( u , z)is just the terminal character $. Conversely, let ( u ,Z) be any edge branching off the ar path and labeled with the single symbol $. Then the pattern P associated with z must have a prefix matching a suffix of a. These observations lead to the following preprocessing and search methods. In the preprocessing phase when 7' is built, identify every edge (u, z) in T' that is labeled only by the terminal character $. (The number z is used both as a leaf name and as the starting position of the suffix associated with that leaf.) For each such node u , set a variable d, to z. For example, in Figure 7.12, d,, is the second number written at each node u (d, is not defined for the root node). In the search phase, after matching a string a! in T, the value of i3 (if needed) can be found as follows:
Theorem 7.16.2. The value of i3 should be set to i + d , - I , where d , is the smallest d value at a node on the a ' path in T'. lfno node on that path has a d value defined, then i3 is undefined.
165
Note that when 1, > 0, the copy of Prior, starting at si is totally contained in S[1..i - 11. The Ziv-Lempel method uses some of the 1; and si values to construct a compressed representation of string S. The basic insight is that if the text S[1..i - 11 has been represented (perhaps in compressed form) and li is greater than zero, then the next 1;' characters of S (substring Prior i ) need not be explicitly described. Rather, that substring can be described by the pair (si, li), pointing to an earlier occurrence of the substring. Following this insight, a compression method could process S left to right, outputting the pair (s;, li) in place of the explicit substring S[i..i 1, - 11 when possible, and outputting the character S(i) when needed. Full details are given in the algorithm below.
Compression algorithm 1
begin i := 1 Repeat compute li and si if li > 0 then begin output (si, I;)
i :=i+li
end else begin output S(i) i:=i+l end Until i > n end. For example, S = abacabaxabz can be described as ab(1, l)c(l, 3)x(1,2)z. Of course, in this txample the number of symbols used to represent S did not decrease, but rather increased! That's typical of small examples. But as the string length increases, providing more opportunity for repeating substrings, the compression improves. Moreover, the algorithm could choose to output character S ( i ) explicitly whenever li is "small" (the actual rule depends on bit-level considerations determined by the size of the alphabet, etc.). For a small example where positive compression is observed, consider the contrived string S = ababababababababababababnbababab, represented as ab(1,2)(1,4)(1, 8)(1, 16). That representation uses 24 symbols in place of the original 32 symbols. If we extend this example to contain k repeated copies of a b , then the compressed representation contains approximately 5 log, k symbols - a dramatic reduction in space. To decompress a compressed string, process the compressed string left to right, so that any pair (si, 1;) in the representation points to a substring that has already been fully decompressed. That is, assume inductively that the first j terms (single characters or s, I pairs) of the compressed string have been processed, yielding characters 1 through i - 1 of the original string S. The next term in the compressed string is either character S(i l), or it is a pair (s;, Ii) pointing to a substring of S strictly before i . In either case, the algorithm has the information needed to decompress the j t h term, and since the first
164
Figure 7.13: Tree L corresponding to tree T f for the set P = (wxa, xaqq. qxax)
1234567890 qxaxtqqps t
wxa
xaqq qxax
Figure 7.14: The first comparisons start at position 3 of T and match ax. The value of z , is equal to two, so a shift of one position occurs. String qxax matches; z , is undefined, but d: is defined and equais 4, so a shift of three is made. The string qq matches, followed by a mismatch; z, is undefined, but d: is defined to be four, so a shift of three is made, after which no further matches are found and the algorithm halts.
To see how L: is used during the search, let T be qxmtqqps. The shifts of the P are shown in Figure 7.14.
Definition For any position i in a string S of length m , define the substring Prior, to be the longest prefix of S[i..ml that also occurs as a substring of S[ 1..i - 11.
For example, if S = a b a x c a b a x n b z then Prior, is b a x .
Definition For any position i in S, define 1; as the length of Prior i . For I , > 0, define si as the starting position of the left-most copy of Prior i .
In the above example, 1, = 3 and $7 = 2.
' Other, more common ways to study the relatedness or similarity of strings of two strings are extensively discussed
in Part 111.
166
term in the compressed suing is the first character of S, we conclude by induction that the decompression algorithm can obtain the original string S.
z-1.
Theorem 7.17.1. Compression algorithm 1 can be implemented to run in linear time as a one-pass, on-line algorithm to compress any input string S.
7.20. EXERCISES
169
6. Discuss the relative advantages of the Aho-Corasick method versus the use of suffix trees
for the exact set matching problem, where the text is fixed and the set of patterns is varied over time. Consider preprocessing, search time, and space use. Consider both the cases when the text is larger than the set of patterns and vice versa.
7. In what way does the suffix tree more deeply expose the structure of a string compared to the Aho-Corasick keyword tree or the preprocessing done for the Knuth-Morris-Pratt or Boyer-Moore methods? That is, the sp values give some information about a string, but the suffix tree gives much more information about the structure of the string. Make this precise. Answer the same question about suffix trees and Z values.
8. Give an algorithm to take in a set of k strings and to find the longest common substring
: ) pairs of strings. Assume each string is of length n. Since the longest of each of the ( time, 0 ( k 2 n )time clearly suffices. Now common substring of any pair can be found in O(n) suppose that the string lengths are different but sum to m.Show how to find all the longest common substrings in time O(km).Now try for O(m+ k 2) (I don't know how to achieve this last bound).
9. The problem of finding substrings common to a set of distinct strings was discussed separately from the problem of finding substrings common to a single string, and the first problem seems much harder to solve than the second. Why can't the first problem just be reduced to the second by concatenating the strings in the set to form one large string?
10. By modifying the compaction algorithm and adding a little extra (linear space) information to the resulting DAG, it is possible to use the DAG to determine not only whether a pattern occurs in the text, but to find all the occurrences of the pattern. We illustrate the idea when there is only a single merge of nodes p and q. Assume that p has larger string depth than q and that u is the parent of p before the merge. During the merge, remove the subtree of pand put a displacement number of - 1 on the new u to pq edge. Now suppose we search for a pattern P i n the text and determine that Pis in the text. Let i be a leaf below the path labeled P (i.e., below the termination point of the search). If the search traversed the u to pq edge, then Poccurs starting at position i - 1; otherwise it occurs starting at position i.
Generalize this idea and work out the details for any number of node merges.
11. In some applications it is desirable to know the number of times an input string P occurs
in a larger string S. After the obvious linear-time preprocessing, queries of this sort can be answered in O(I PI) time using a suffix tree. Show how to preprocess the DAG in linear time so that these queries can be answered in O(I PI) time using a DAG.
12. Prove the correctness of the compaction algorithm for suffix trees.
13. Let Sr be the reverse of the string S. Is there a relationship between the number of nodes in the DAG for S and the DAG for Sf?Prove it. Find the relationship between the DAG for S and the DAG for Sr (this relationship is a bit more direct than for suffix trees).
14. In Theorem 7.7.1 we gave an easily computed condition to determine when two subtrees
of a suffix tree for string S are isomorphic. An alternative condition that is less useful for efficient computation is as follows: Let a be the substring labeling a node p and P be the substring labeling a node q in the suffix tree for S. The subtrees of p and q are isomorphic if and only if the set of positions in S where occurrences of a end equals the set of positions in S where occurrences of 6 end. Prove the correctness of this alternative condition for subtree isomorphism.
15. Does Theorem 7.7.1 still hold for a generalized suffix tree (for more than a single string)? If not, can it be easily extended to hold? 16. The DAG Dfor a string Scan be converted to a finite-state machine by expanding each edge with more than one character on its label into a series of edges labeled by one character
168
Definition For any position i in string S, let Z L ( i ) denote the length of the longest substring beginning at i that appears somewhere in the string S[l..i].
Definition Given a DNA string S partitioned into exons and introns, the exon-average ZL vallle is the average ZL(i) taken over every position i in the exons of S. Similarly, the inrron-average ZL is the average ZL(i) taken over positions in introns of S .
It should be intuitive at this point that the exon-average ZL value and the intron-average ZL value can be computed in O ( n ) time, by using suffix trees to compute a11 the ZL(i) values. The technique, resembles the way matching statistics are computed, but is more involved since the substring starting at i must also appear to the left of position i. The main empirical result of [I461 is that the exon-average ZL value is lower than the intron-average ZL value by a statistically significant amount. That result is contrary to the expectation stated above that biologically significant substrings (exons in this case) should be more compressable than more random substrings (which introns are believed to be). Hence, the full biological significance of string compressability remains an open question.
7.20. Exercises
1. Given a set S of k strings, we want to find every string in S that is a substring of some other string in S. Assuming that the total length of all the strings is n, give an O(n)-time algorithm to solve this problem. This result will be needed in algorithms for the shortest superstring problem (Section 16.1 7).
2. For a string S of length n, show how to compute the N(i), L ( i ) , Lr(i) and Spi values (discussed in Sections 2.2.4 and 2.3.2) in O(n)time directly from a suffix tree for S.
3. We can define the suffix tree in terms of the keyword tree used in the Aho-Corasick (AC) algorithm. The input to the AC algorithm is a set of patterns P , and the AC tree is a compact representation of those patterns. For a single string S we can think of the n suffixes of S as a set of patterns. Then one can build a suffix tree for S by first constructing the AC tree for those n patterns, and then compressing, into a single edge, any maximal path through nodes with only a single child. If we take this approach, what is the relationship between the failure links used in the keyword tree and the suffix links used in Ukkonen's algorithm? Why aren't suffix trees built in this way?
4. A suffix tree for a string S can be viewed as a keyword tree, where the strings in the keyword tree are the suffixes of S. In this way, a suffix tree is useful in efficiently building a keyword tree when the strings for the tree are only implicitly specified. Now consider the and &, let D be the set of following implicitly specified set of strings: Given two strings Sl all substrings of S1 that are not contained in &. Assuming the two strings are of length n, show how to construct a keyword tree for set D i n O(n)time. Next, build a keyword tree for D together with the set of substrings of & that are not in S1.
5. Suppose one has built a generalized suffix tree for a string S along with its suffix links (or link pointers). Show how to efficiently convert the suffix tree into an Aho-Corasick keyword tree.
7.20. EXERCISES
171
compute matching statistics m s ( j ) for each position j in P. Number m s ( j ) is defined as the length of the longest substring starting at position j in Pthat matches some substring in T. We could proceed as before, but that would require a suffix tree for the long tree T . Show how to find all the matching statistics for both T and P i n 0(/ T I ) time, using only a suffix tree for P.
22. In our discussion of matching statistics, we used the suffix links created by Ukkonen's algorithm. Suffix links can also be obtained by reversing the link pointers of Weiner's algorithm, but suppose that the tree cannot be modified. Can the matching statistics be computed in linear time using the tree and link pointers as given by Weiner's algorithm?
23. In Section 7.8 we discussed the reverse use of a suffix tree to solve the exact pattern matching problem: Find all occurrences of pattern P in text T. The solution there computed the matching statistic ms(i) for each position i in the text. Here is a modification of that method that solves the exact matching problem but does not compute the matching statistics: Follow the details of the matching statistic algorithm but never examine new characters in the text unless you are on the path from the root to the leaf labeled 1. That is, in each iteration, do not proceed below the string ay in the suffix tree, until you are on the path that leads to leaf 1. When not on this path, the algorithm just follows suffix links and performs skip/count operations until it gets back on the desired path.
Prove that this modification correctly solves the exact matching problem in linear time. What advantages or disadvantages are there to this modified method compared to computing the matching statistics?
24. There is a simple practical improvement to the previous method. Let v be a point on the path to leaf 1 where some search ended, and let v' be the node on that path that was next entered by the algorithm (after some number of iterations that visit nodes off that path). Then, create a direct shortcut link from v to v'. The point is that if any future iteration ends at v, then the shortcut link can be taken to avoid the longer indirect route back to v'.
Prove that this improvement works (i.e., that the exact matching problem is correctty solved in this way). What is the relationship of these shortcut links to the failure function used in the KnuthMorris-Pratt method? When the suffix tree encodes more than a single pattern, what is the relationship of these shortcut links to the backpointers used by the Aho-Corasik method?
25. We might modify the previous method even further: In each iteration, only follow the suffix link (to the end of a) and do not do any skip/count operations or character comparisons unless you are on the path to leaf 1. At that point, do all the needed skippount computations to skip past any part of the text that has already been examined.
Fill in the details of this idea and establish whether it correctly solves the exact matching problem in linear time.
26. Recall the discussion of STSs in Section 7.8.3, page 135. Show in more detail how matching statistics can be used to identify any STSs that a string contains, assuming there are "modest" number of errors in either the STS strings or the new string. 27. Given a set of k strings of length n each, find the longest common prefix for each pair of strings. The total time should be O(kn p), where p is the number of pairs of strings having a common prefix of length greater than zero. (This can be solved using the lowest common ancestor algorithm discussed later, but a simpler method is possible.)
28. For any pair of strings, we can compute the length of the longest prefix common to the
pair in time linear in their total length. This is a simple use of a suffix tree. Now suppose we are given k strings of total length nand want to compute the minimum length of all the pairwise longest common prefixes over all of the (,k) pairs of strings, that is, the smallest
170
each. This finite-state machine will recognize substrings of S, but it will not necessarily be the smallest such finite-state machine. Give an example of this. We now consider how to build the smallest finite-state machine to recognize substrings of S. Again start with a suffix tree for S, merge isomorphic subtrees, and then expand each edge that it labeled with more than a single character. However, the merge operation must be done more carefully than before. Moreover, we imagine there is a suffix link from each teaf i to each leaf i + 1, for i < n. Then, there is a path of suffix links connecting all the leaves, and each leaf has zero leaves beneath it. Hence, all the leaves will get merged. Recall that Q is the set of all pairs (p,q) such that there exists a suffix link from p to q in T, where p and q have the same number of leaves in their respective subtrees. Suppose ( p , q ) is in Q. Let v be the parent of p, let y be the label of the edge ( v , p) into p, and let 6 be the label of the edge into q. Explain why l y l 2 161. Since every edge of the DAG will ultimately be expanded into a number of edges equal to the length of its edge-label, we want to make each edge-label as small as possible. Clearly, 6 is a suffix of y , and we will exploit this fact to better merge edge-labels. During a merge of p into q , remove all out edges from p a s before, but the edge from v is not necessarily directed to q. Rather, if IS1 > 1, then the S edge is split into two edges by the introduction of a new node u. The first of these edges is labeled with the first character of S and the second one, edge (u,q ) , is labeled with the remaining characters of 6 . Then the edge from v is directed to u rather than to q. Edge (v, u) is labeled with the first 1 y l - 161 1 characters of y .
Using this modified merge, clean up the description of the entire compaction process and prove that the resulting DAG recognizes substrings of S. The finite-state machine for S is created by expanding each edge of this DAG labeled by more than a single character. Each node in the DAG is now a state in the finite-state machine.
17. Show that the finite-state machine created above has the fewest number of states of any finite-state machine that recognizes substrings of S. The key to this proof is that a deterministic finite-state machine has the fewest number of states if no state in it is equivalent to any other state. Two states are equivalent if, starting from them, exactly the same set of strings are accepted. See f2281.
18. Suppose you already have the Aho-Corasick keyword tree (with backlinks). Can you use it
to compute matching statistics in linear time, or if not, in some "reasonablen nonlinear time bound? Can it be used to solve the longest common substring problem in a reasonable time bound? If not, what is the difficulty?
19. In Section 7.16 we discussed how to use a suffix tree to search for all occurrences of a set of patterns in a given text. If the length of all the patterns is n and the length of the text is m, that method takes O(n + m) time and O(m)space. Another view of this is that the solution takes O(m)preprocessing time and O(n)search time. In contrast, the AhoCorasick method solves the problem in the same total time bound but in O(n) space. Also, it needs O(n) preprocessing time and O(m)search time.
Because there is no definite relationship between n and m, sometimes one method will use less space or preprocessing time than the other. By using a generalized suffix tree, for the set of patterns and the reverse role for suffix trees discussed in Section 7.8, it is possible to solve the problem with a suffix tree, obtaining exactly the same time and space bounds obtained by the Aho-Corasick method. Show in detail how this is done.
20. Using the reverse role for suffix trees discussed in Section 7.8, show how to solve the general DNA contamination problem of Section 7.5 using only a suffix tree for St, rather than a generalized suffix tree for Sj together with all the possible contaminants.
21. In Section 7.8.1 we used a suffix tree for the small string P to compute the matching statistics ms(i)for each position i in the long text string T. Now suppose we also want to
7.20. EXERCISES
173
efficiently find all interesting substrings in the database. If the database has total length m, then the method should take time O(m) plus time proportional to the number of interesting substrings. 35. (Smallest &-repeat) Given a string Sand a number k, we want to find the smallest substring of S that occurs in S exactly k times. Show how to solve this problem in linear time.
36. Theorem 7.12.1, which states that there can be at most n maximal repeats in a string of length n, was established by connecting maximal repeats with suffix trees. It seems there should be a direct, simple argument to establish this bound. Try to give such an argument. Recall that it is not true that at most one maximal repeat begins at any position in S.
37. Given two strings S1 and & we want to find all maximal common pairs of Sl and &. A common substring C is maximal if the addition to C of any character on either the right or left of C results in a string that is not common to both S1and &.For example, if A = aayxpt and B = aqyxpw then the string y x p is a maximal common substring, whereas y x is not. A maximal common pair is a triple ( p , , fi,n'), where pl and fi are positions in S1 and &, respectively, and n ' is the length of a maximal common substring starting at those positions. This is a generalization of the maximal pair in a single string.
Letting m denote the total length of Sl and &, give an O(m k)-time solution to this problem, where k is the number of triples output. Give an O(m)-time method just to count the number of maximal common pairs and an O(n + /)-time algorithm to find one copy of each maximal common substring, where I is the total length of those strings. This is a generalization of the maximal repeat problem for a single string.
38. Another, equally efficient, but less concise way to identify supermaximal repeats is as follows: A maximal repeat in S represented by the left-diverse node v in the suffix tree for
S is a supermaximal repeat if and only if no proper descendant of v is left diverse and no node in v's subtree (including v) is reachable via a path of suffix links from a left diverse node other than v. Prove this. Show how to use the above claim to find all supermaximal repeats in linear time. 39. In biological applications, we are often not only interested in repeated substrings but in occurrences of substrings where one substring is an inverted copy of the other, a complemented copy, or (almost always) both. Show how to adapt all the definitions and techniques developed for repeats (maximal repeats, maximal pairs, supermaximal repeats, near-supermaximal repeats, common substrings) to handle inversion and complementation, in the same time bounds.
40. Give a linear-time algorithm that takes in a string S and finds the longest maximal pair in which the two copies do not overlap. That is, if the two copies begin at positions p, < pn and are of length n', then p~ n' < pn.
41. Techniques for handling repeats in DNA are not only motivated by repetitive structures that occur in the DNA itself but also by repeats that occur in data collected from the DNA. The paper by Leung et al. [298] gives one example. In that paper they discuss a problem of analyzing DNA sequences from E. coli, where the data come from more than 1,000 independently sequenced fragments stored in an E. colj database. Since the sequences were contributed by independent sequencing efforts, somp fragments contained others, some of the fragments overlapped others, and many intervals of the E. coligenome were yet unsequenced. Consequently, before the desired analysis was begun, the authors wanted to "clean upn the data at hand, finding redundantly sequenced regions of the E. coligenome and packaging all the available sequences into a few contigs, i.e., strings that contain all the substrings in the data base (these contigs may or may not be the shortest possible).
Using the techniques discussed for finding repeats, suffix-prefix overlaps, and so on, how
172
29. Verify that the all-pairs suffix-prefix matching problem discussed in Section 7.10 can be solved in O(km) time using any linear-time string matching method. That is, the O(km) time bound does not require a suffix tree. Explain why the bound does not involve a term for k 2.
30. Consider again the all-pairs suffix-prefix matching probtem. It is possible to solve the problem in the same time bound without an explicit tree traversal. First, build a generalized suffix tree T(S) for the set of k strings S (as before), and set up a vector V of length k. Then successively initialize vector V to contain all zeros, and match each string in the set through the tree. The match using any string Si ends at the leaf labeled with suffix 1 of string S, . During this walk for S,, if a node v is encountered containing index i in its list L(v), then write the string-depth of node v into position i of vector V. When the walk reaches the leaf for suffix 1 of S,, V(i), for each i , specifies the length of the longest suffix of Si that , . matches a prefix of S
Establish the worst-case time analysis of this method. Compare any advantages or disadvantages (in practical space and/or time) of this method compared to the tree traversal Then propose modifications to the tree traversal method method discussed in Section 7-10. that maintain all of its advantages and also correct for its disadvantages.
31. A substring a is called a prefix repeat of string S if a is a prefix of S and has the form pp for some string B. Give a linear-time algorithm to find the longest prefix repeat of an input string S. This problem was one of Weiner's motivations for developing suffix trees.
Very frequently in the sequence analysis literature, methods aimed at finding interesting features in a biological sequence begin by cataloging certain substrings of a long string. These methods almost always pick a fixed-length window, and then find all the distinct strings of that fixed length. The result of this window or q-gram approach is of course influenced by the choice of the window length. In the following three exercises, we show how suffix trees avoid this problem, providing a natural and more effective extension of the window approach. See also Exercise 26 of Chapter 14.
32. There are m 2 /2 substrings of a string T whose length is m. Some of those substrings are identical and so occur more than once in the string. Since there are @(m2) substrings, we cannot count the number of times each appears in T in O(m) time. However, using a suffix tree we can get an imphcit representation of these numbers in O(m) time. In particular, when any string P of length n is specified, the implicit representation should allow us to compute the frequency of Pin Tin O(n) time. Show how to construct the implicit frequency representation and how to use it.
33. Show how to count the number of distinct substrings of a string T in O(m) time, where the length of T is m. Show how to enumerate one copy of each distinct substring in time proportional to the length of all those strings.
34. One way to hunt for "interesting" sequences in a DNA sequence database is to look for substrings in the database that appear much more often than they would be predicted to appear by chance alone. This is done today and will become even more attractive when huge amounts of anonymous DNA sequences are avaitable.
Assumingonehas a statistical model to determine how likely any particular substring would occur by chance, and a threshold above which a substring is "interesting", show how to
7.20. EXERCISES
45. Prove the correctness of the method presented in Section 7.13 for the circular string linearization problem. 46. Consider in detail w6ther a suffix array can be used to efficient!^ solve the more complex string problems considered in this chapter. The goal is to maintain: the space-efficient properties of the suffix array while achieving the time-efficient properties of the suffix tree. Therefore, it would be cheating to first use the suffix array for a string to construct a suffix tree for that string.
47. Give the details of the preprocessing needed to implement the bad character rule in the Boyer-Moore approach to exact set matching.
48. tn Section 7.16.3, we used a suffix tree to implement a weakgood suffix rule for a BoyerMoore set matching algorithm. With that implementation, the increment of index i was determined in constant time after any test, independent even of the alphabet size. Extend the suffix tree approach to implement a strong good suffix rule, where again the increment to i can be found in constant time. Can you remove the dependence on the alphabet in this case? 49. Prove Theorem 7.1 6.2. 50. In the Ziv-Lempel algorithm, when computing (si,li) for some position i , why should the , equals i? What would be the problem traversal end at point p if the string-depth of p plus c with letting the match extend past character i? 51. Try to give some explanation for why the Ziv-Lempel algorithm outputs the extra character compared to compression algorithm 1.
52. Show how to compute all the n values ZL(i),defined in Section 7.18, in O(n) time. One solution is related to the computation of matching statistics (Section 7.8,1). 53. Successive refinement methods
Successive refinement is a general algorithmic technique that has been used for a number of string problems 1 114, 199, 2651.In the next several exercises, we introduce the ideas, connect successive refinement to suffix trees, and apply successive refinement to particular string problems. Let S be a string of length n. The relation Ek is defined on pairs of suffixes of S. We say i Ekj if and only if suffix i and suffix jof S agree for at least their first k characters. Note that Ek is an equivalence relation and so it partitions the elements into equivalence classes. Also, since S has n characters, every class in E,., is a singleton. Verify the following two facts:
Fact1 F o r a n y i # j , i E k + , j i f a n d o n l y i f i E k j a n d i + l E k j + l . Fact 2 Every Ek+, class is a subset of an Ek class and so the Ek+,partition is a refinement of the Ek partition.
We use a labeled tree T , called the refinement tree, to represent the successive refinements of the classes of Ek as k increases from 0 to n. The root of T represents class Eo and contains all the n suffixes of S. Each child of the root represents a class of El and contains the elements in that class. In general, each node at levelIrepresents a class of 6 and its children represent all the El,, classes that refine it. What is the relationship of T to the keyword tree (Section 3.4) constructed from the set of n suffixes of S? Now modify Tas follows. If node v represents the same set of suffixes as its parent node v', contract v and v' to a single node. In the new refinement tree, T', each nonleaf node has at least two children. What is the relationship of T' to the suffix tree for string S? Show how to convert a suffix tree for S into tree T' in 0($) time.
174
42. k-cover problem. Given two input strings S, and & and a parameter k , a k-cover C is a set of substrings of St, each of length k or greater, such that & can be expressed as the concatenation of the substrings of C in some order. Note that the substrings contained in C may overlap in S , , but not in &. That is, & is a permutation of substrings of S ,that are each of length k or greater. Give a linear-time algorithm to find a k-cover from two strings Sl and 5 ,or determine that no such cover exists.
If there is no k-cover, then find a set of substrings of S1, each of length k or greater, that cover the most characters of &. Or, find the largest k' < k (if any) such that there is a k'-cover. Give linear-lime algorithms for these problems. Consider now the problem of finding nonoverlappingsubstrings in St, each of length k or greater, to cover &, or cover it as much as possible. This is a harder problem. Grapple with it as best you can.
43. exon shuffling. In eukaryotic organisms, a gene is composed of alternating exons, whose concatenation specifies a single protein, and introns, whose function is unclear. Similar exons are often seen in a variety of genes. Proteins are often built in a modular form, being composed of distinct domains (units that have distinct functions or distinct folds that are independent of the rest of the protein), and the same domains are seen in many different proteins, although in different orders and combinations. It is natural to wonder if exons correspond to individual protein domains, and there is some evidence to support Hence modular protein construction may be reflected in the DNA by modular this v~ew. gene construction based on the reuse and reordering of stock exons. It is estimated tha! all proteins sequenced to date are made up of just a few thousand exons [468].This phenomenon of reusing exons is called exon shuffling, and proteins created via exon shuffling are called mosaic proteins. These facts suggest the following general search problem.
The problem: Given anonymous, but sequenced, strings of DNA from protein-coding regions where the exons and introns are not known, try to identify the exons by finding common regions (ideally, identical substrings) in two or more DNA strings. Clearly, many of the techniques discussed in this chapter concerning common or repeated substrings could be applied, although they would have to be tried out on real data to test their utility or limitations. No elegant analytical result should be expected. in addition to methods for repeats and common substrings, does the k-cover problem seem of use in studying exon shuffling? That question will surely require an empirical, rather than theoretical answer. Although it may not give an elegant worst-case result, it may be helpful to first find ail the maximal common substrings of length k or more. 44. Prove Lemma 7.14.1.
of the (singleton) classes describes a permutation of the integers 1 to n. Prove that this permutation is the suffix array for string S. Conclude that the reverse refinement method creates a suffix array in O(n log n) time. What is the space advantage of this method over the O(n)-time method detailed in Section 7.14.1?
56. Primitive tandem arrays
Recall that a string a is called a tandem array if a is periodic (see Section 3.2.1), i.e., it can be written as #I1 for some I 2 2. When I = 2, the tandem array can also be called a tandem repeat. A tandem array a = 8' contained in a string S is called maximal if there are no additional copies of before or after a. Maximal tandem arrays were initially defined in Exercise 4 in Chapter 1 (page 13) and the importance of tandem arrays and repeats was discussed in Section 7.1 1.1. We are interested in identifying the maximal tandem arrays contained in a string. As discussed before, it is often best to focus on a structured subset of the strings of interest in order to limit the size of the output and to identify the most informative members. We focus here on a subset of the maximal tandem arrays that succinctly and implicitly encode all the maximal tandem arrays. (In Section 9.5, 9.6, and 9.6.1 we wilt discuss efficient methods to find all the tandem repeats in a string, and we allow the repeats to contain some errors.) We use the pair ( p , I) to describe the tandem array pr. Now consider the tandem array a = abababababababab. It can be described by the pair (abababab, 2), or by (abab, 4), or by (ab, 8). Which description is best? Since the first two pairs can be deduced from the last, we choose the later pair. This "choice" will now be precisely defined. A string 8 is said to be primitive if 8 is not periodic. For example, the string ab is primitive, whereas abab is not. The pair (ab, 8) is the preferred description of abababababababab because string ab is primitive. The preference for primitive strings extends naturally to the description of maximal tandem arrays that occur as substrings in larger strings. Given a string S, we use the triple (i, 8 , I) to mean that a tandem array ( p , I) occurs in S starting at position i. A triple (i, 8 , I) rS called a pm-triple if is primitive and is a maximal tandem array.
@'
For example, the maximal tandem arrays in mississippidescribed by the pm-triples are (2,iss,2),(3,s,2),(3,ssi,2),(6,~,2) and (9,p,2). Note that two or more pm-triples can have the same first number, since two different maximal tandem arrays can begin at the same position. For example, the two maximal tandem arrays ss and ssissiboth begin at position three of mississippi. The pm-triples succinctly encode all the tandem arrays in a given string S. Crochemore [ I 141(with different terminology) used a successive refinement method to find all the pmtriples in O(nlog n) time. This implies the very nontrivial fact that in any string of length n there can be only O(nlog n) pm-triples. The method in 11141finds the E k partition for e&ch k. The following lemma is central:
Lemma 7.20.1. There is a tandem repeat of a k-length substring p starting at position i of S if and only if the numbers i and i k are both contained in a single class of E k and no numbers between i and i k are in that class.
Prove Lemma 7.20.1. One direction is easy. The other direction is harder and it may be useful to use Lemma 3.2.1 (page 40).
57. Lemma 7.20.1 makes it easy to identify pm-triples. Assume that the indices in each class is a pm-triple, where of Ekare sorted in increasing order. Lemma 7.20.1 implies that (i,p,l] p is a k-length substring, if and only if some single class of Ek contains a maximal series of numbers i , i k,i 2k,. . . ,i jk,such that each consecutive pair of numbers differs by k. Explain this in detail.
+ +
54. Several string algorithms use successive refinement without explicitly finding or representing all the classes in the refinement tree. Instead, they construct only some of the cfasses or only compute the tree implicitly. The advantage is reduced use of space in practice or an algorithm that is better suited for parallel computation [116]. The original suffix array construction method [308] is such an algorithm. In that algorithm, the suffix array is obtained as a byproduct of a successive refinement computation where the E k partitions are computed only for values of k that are a power of two. We develop that method here. First we need an extension of Fact 1:
Ek, for any k, holds the starting locations of a k-length substring of S. The algorithm in [308] constructs a suffix array for S using the reverse refinement approach, with the added detail that the classes of Ek are kept in the lexical order of the strings associated with the classes.
More specifically, to obtain the E2 partition of S = mississippl$, process the classes of El in order, from the lexically smallest to the lexically largest ciass. Processing the first class, (121, results in the creation of the E2 class (11). The second El class {2,5,8,11) marks indices (1,4,7) and ( l o } , and hence it creates the three E2 classes {1},{4,7)and (10). Class (9,10} of El creates the two classes (8) and (9). Class [3,4,6,7) of El creates classes (2,5) and {3,6) of E2.Each class of E2 holds the starting locations of identical substrings of length one or two. These classes, lexically ordered by the substrings they 4,in lexical order represent, are: {12),[11),(8),~2,5),~1),~10),[9),(4,7),[3,6).Theclassesof are: (12),(11},(8),{2,5),(1),{10),(9),{7},(4},{6),(3). Note that (2,5) remain in the same 4 class because (4,7) were in the same E2 class. The E2 classes of {4,7) and (3,6) are each refined in E,. Explain why. Although the general idea of reverse refinement should now be clear, efficient implementation requires a number of additional details. Give complete implementation details and analysis, proving that the E2k classes can be obtained from the Ek classes in O(n)time. Be sure to detail how the classes are kept in lexical order. Assume n is a power of two. Note that the algorithm can stop as soon as every class is a singleton, and this must happen within logzniterations. When the algorithm ends, the order
7.20. EXERCISES
179
The above problems can be generalized in many different directions and solved in essentially the same way. One particular generalization is the exact matching version of the primer selection problem. (In Section 12.2.5 we will consider a version of this problem that allows errors.) The primer selection problem arises frequently in molecular biology. One such situation is in "chromosome walking", a technique used in some DNA sequencing methods or gene location problems. Chromosome walking was used extensively in the location of the Cystic Fibrosis gene on human chromosome 7 . We discuss here only the DNA sequencing application. In DNA sequencing, the goal is to determine the complete nucleotide sequence of a long string of DNA. To understand the application you have to know two things about existing sequencing technology. First, current common laboratory methods can only accurately sequence a small number of nucleotides, from 300 to 500, from one end of a longer string. Second, it is possible to replicate substrings of a DNA string starting at almost any point as tong as you know a small number of the nucleotides, say nine, to the left of that point. This replication is done using a technology called polymerase chain reaction (PCR),which has had a tremendous impact on experimental molecular biology. Knowing as few as nine nucleotides allows one to synthesize a string that is complementary to those nine nucleotides. This complementary string can be used to create a "primer", which finds its way to the point in the long string containing the complement of the primer. I t then hybridizes with the longer string at that point. This creates the conditions that allow the replication of part of the original string to the right of the primer site. (Usually PCR is done with two primers, one for each end, but here only one "variable" primer is used. The other primer is fixed and can be ignored in this discussion.) The above two facts suggest a method to sequence a long string of DNA, assuming we know the first nine nucleotides at the very start of the string. After sequencing the first 300 (say) nucleotides, synthesize a primer complementary to the last nine nucleotides just sequenced. Then replicate a string containing the next 300 nucleotides, sequence that substring and continue. Hence the longer string gets sequenced by successively sequencing 300 nucleotides at a time, using the end of each sequenced substring to create the primer that initiates sequencing of the next substring. Compared to the shotgun sequencing method (to be discussed in Section 16.14), this directed method requires much less sequencing overall, but because it is an inherently sequential process it takes longer to sequence a long DNA string. (In the Cystic Fibrosis case another idea, called gene jumping, was used to partially parallelize this sequential process, but chromosome walking is generally laboriously sequential.) There is a common problem with the above chromosome walking approach. What happens if the string consisting of the last nine nucleotides appears in another place in the larger string? Then the primer may not hybridize in the correct position and any sequence determined from that point would be incorrect. Since we know the sequence to the left of our current point, we can check the known sequence to see if a string complementary to the primer exists to the left. I f it does, then we want to find a nine-length substring near the end of the last determined sequence that does not appear anywhere earlier. That substring can then by used to form the primer, The result will be that the next substring sequenced will resequence some known nucleotides and so sequence somewhat fewer than 300 new nucleotides. Problem: Formalize this primer selection problem and show how to solve it efficiently using suffix trees. More generally, for each position i in string a find the shortest substring that begins at i and that appears nowhere else in c~ or S.
By using Fact 1 in place of Fact 3, and by modifying the reverse refinement method developed in Exercises 54 and 55, show how to compute all the Ek partitions for all k (not just the powers of two) in O($) time. Give implementation details to maintain the indices of each class sorted in increasing order. Next, extend that method, using Lemma 7.20.1, to obtain an o($)-time algorithm to find all the pm-triples in a string S. 58. To find all the pm-triples in O(nlog n) time, Crochemore 1 1141used one additional idea. To Ek ciasses except one, C say, have been used as refiners introduce the idea, suppose all to create Ek+l from Ek.Let p and q be two indices that are together in some Ek class. We claim that if p and q are not together in the same Ek+, class, then one of them (at least) has already been placed in its proper Ek+, class. The reason is that by Fact 1, p 1 and q 1 cannot both be in the same Ek class. SO by the time C is used as a refiner, either p or q has been marked and moved by an Ek class already used as refiners.
Now suppose that each E class is held in a linked list and that when a refiner identifies a number, p say, then p is removed from its current linked list and placed in the linked list for the appropriate Ek+, class. With that detail, if the algorithm has used all the Ek classes except C as refiners, then all the Ek+, classes are correctly represented by the newly created linked lists plus what remains of the original linked lists for Ek. Explain this k class need not be used as a refiner. in detail. Conclude that one E Being able to skip one class while refining Ek is certainly desirable, but it isn't enough to produce the stated bound. To do that we have to repeat the idea on a larger scale.
Theorem 7.20.1. When refining E k to create Ek+ , suppose that for every k 3 1, exactly one (arbitrary) child of each Ek-t class is skipped (i.e., not used as a refiner). Then the resulting linked Iists correctly identify the Ek+, classes.
Prove Theorem 7.20.1. Note that Theorem 7.20.1 allows complete freedom in choosing which child of an Ek-I class to skip. This leads to the following:
Theorem 7.20.2. I f , for every k > 1, the largest child of each EkW1 class is skipped, then the total size of all the classes used as refiners is at most n log, n.
Prove Theorem 7.20.2. Now provide all the implementation details to find all the pm-triples in S in O(nlog n) time.
59. Above, we established the bound of O(nlog n) pm-triples as a byproduct of the algorithm to find them. But a direct, nonalgorithmic proof is possible, still using the idea of successive refinement and Lemma 7.20.1. In fact, the bound of 3n log, n is fairly easy to obtain in this way. Do it.
60. Folklore has it that for any position i in S, if there are two pm-triples, (i,p,I), and (i,p',lf), and if ]pr1 > 181, then IS'I2 2 1 8 1 .That would limit the number of pm-triples with the same first number to log, n, and the O(nlog n) bound would be immediate. Show by example that the folklore belief is false. 61. Primer selection problem Let S be a set of strings over some finite alphabet X.Give an algorithm (using a generalized suffix tree) to find a shortest string S over C that is a substring in none of the strings of S. The algorithm shoutd run in time proportional to the sum of the lengths of the strings in S. A more useful version of the problem is to find the shortest string S that is longer than a certain minimum length and is not a substring of any string of S. Often, a string a is given along with the set S. Now the problem becomes one of finding a shortest substring of e (if any) that does not appear as a substring of any string in S. More generally, for every i , compute the shortest substring (if any) that begins at position iof e and does not appear as a substring of any string in S.
8
Constant-Time Lowest Common Ancestor Retrieval
8.1. Introduction
We now begin the discussion of an amazing result that greatly extends the usefulness of suffix trees (in addition to many other applications).
Definition In a rooted tree 7,a node u is an ancestor of a node v if u is on the unique path from the root to v. With this definition a node is an ancestor of itself. A proper ancestor of v refers to an ancestor that is not u.
Definition In a rooted tree 7,the lowest common arrcestor (lca)of two nodes x and y is the deepest node in 7 that is an ancestor of both x and y .
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3
is 1. The amazing result is that after a linear amount of preprocessing of a rooted tree, any two nodes can then be specified and their lowest common ancestor found in constant time. That is, a rooted tree with n nodes is first preprocessed in O ( n )time, and thereafter any lowest common ancestor query takes only constant time to solve, independent of n . Without preprocessing, the best worst-case time bound for a single query is O ( n ) ,so this is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan [214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on the later approach.
180
62. In the primer selection problem, the goal of avoiding incorrect hybridizations to the rightof the sequencedpart of the string is more difficult since we don't yet know the sequence. Still, there are some known sequences that should be avoided. As discussed in Section 7.11.1, eukaryotic DNA frequently contains regions of repeated substrings, and the most commanly occurring substrings are known. On the problem that repeated substrings cause for chromosome walking, R. Weinbergs writes: They were like quicksand; anyone treading on them would be sucked in and then propelled, like Alice in Wonderland, through some vast subterranean tunnel system, only to resurface somewhere else in the genome, miles away from the starting site. The genome was riddled wih these sinkholes, called "repeated sequences." They were guaranteed to slow any chromosomal walk to a crawl.
So a more general primer problem is the following: Given a substring a of 300 nucleotides (the last substring sequenced), a string p of known sequence (the part of the long string to the left of a whose sequence is known), and a set S of strings (the common parts of known repetitive DNA strings), find the furthest right substring in a of length nine that is not a substring of p or any string in set S. If there is no such string, then we might seek a string of length larger than nine that does not appear in /?or S. However, a primer much larger than nine nucleotides long may falsely hybridize for other reasons. So one must balance the constraints of keeping the primer length in a certain range, making it unique, and placing it as far right as possible.
Problem: Formalize this version of the primer selection problem and show how to apply suffix trees to it.
Probe selection
A variant of the primer selection problem is the hybridization probe selection problem. In DNA fingerprinting and mapping (discussed in Chapter 16) there is frequent need to see which oligomers (short pieces of DNA) hybridize to some target piece of DNA. The purpose of the hybridization is not to create a primer for PCR but to extract some information about the target DNA. In such mapping and fingerprinting efforts, contamination of the target DNA by vector DNA is common, in which case the oligo probe may hybridize with the vector DNA instead of the target DNA. One approach to this problem is to use specifically designed oligomers whose sequences are rarely in the genome of the vector, but are frequently found in the cloned DNA of interest. This is precisely the primer (or probe) selection problem.
In some ways, the probe selection problem is a better fit than the primer problem is to the exact matching techniques discussed in this chapter. This is because when designing probes for mapping, it is desirable and feasible to design probes so that even a single mismatch will destroy the hybrization. Such stringent probes can be created under certain conditions [I 34, 1771.
Racing to the Beginning of the Road; The Search for the Origin of Cancer. Harmony Books, 1996.
001 0 11 101 111 Figure 8.2: A binary tree with four leaves. The path numbers are written both in binary and in base ten.
that encode paths to them. The notation B will refer to this complete binary tree, and 7 will refer to an arbitrary tree. Suppose that B is a rooted complete binary tree with p leaves (n = 2 p - 1 nodes in total), so that every internal node has exactly two children and the number of edges on the path from the root to any leaf in B is d = log, p . That is, the tree is complete and all leaves are at the same depth from the root. Each node v of B is assigned a d 1 bit number, called itspath number, that encodes the unique path from the root to v. Counting from the left-most bit, the ith bit of the path number for v corresponds to the ith edge on the path from the root to v: A 0 for the ith bit from the left indicates that the ith edge on the path goes to a left child, and a 1 indicates a right child.' For example, a path that goes left twice, right once, and then left again ends at a node whose path number begins (on the left) with 0010. The bits that describe the path are called path bits. Each path number is then padded out to d + 1 bits by adding a 1 to the right of the path bits followed by as many additional 0 s as needed to make d 1 bits. Thus for example, if d = 6, the node with path bits 0010 is named by the 7-bit number 0010100. The root node f o r d = 6 would be 1000000. In fact, the root node always has a number with left bit 1 followed by d 0s. (See Figure 8.2 for an additional example.) We will refer to nodes in B by their path numbers. As the tree in Figure 8.2 suggests, path numbers have another well-known description - that of inorder numbers. That is, when the nodes of B are numbered by an inorder traversal (recursively number the left child, number the root, and then recursively number the right child), the resulting node numbers are exactly the path numbers discussed above. We leave the proof of this for the reader (it has little significance in our exposition). The path number concept is preferred since it explicitly relates the number of a node to the description of the path to it from the root.
Given two nodes i and j, we want to find lca(ij) in 13 (remembering that both i and j are path numbers). First, when lca(iJ ) is either i or j (i.e., one of these two nodes is an ancestor of the other), then this can be detected by a very simple constant-time algorithm, discussed in Exercise 3. So assume that lca(i,j) is neither i nor j. The algorithm begins by taking the exclusive or (XOR) of the binary number for i and the binary number for j , denoting the result by xii. The X O R of two bits is 1 if and only if the two bits are different, and the XOR of two d 1 bit numbers is obtained by independently talung the XOR of
' Note that normally when discussing binary numbers, the bits are numbered from right (least significant) to left
(most significant). r h i s is opposite the left-to-right ordering used for strings and for path numbers.
However, although the method is easy to program, it is not trivial to understand at first and has been described as based on "bit magic". Nonetheless, the result has been so heavily applied in many diverse string methods, and its use is so critical in those methods, that a detailed discussion of the result is worthwhile. We hope the following exposition is a significant step toward making the method more widely understood.
185
standard deplh-first numbering (preorder numbering) of nodes (see Figure 8.1). With this numbering scheme, the nodes in the subtree of any node v in 7 have consecutive depthfirst numbers, beginning with the number for v . That is, if there are q nodes in the subtree rooted at v, and v gets numbered k, then the numbers given to the ather nodes in the subtree are k 1 through k q - 1. For convenience, from this point on the nodes in 7 will be referred to by their depth-first numbers. That is, when we refer to node v, v is both a node and a number. Be careful not to confuse depth-first numbers used for the general tree 7 with path numbers used only for the binary tree B.
Definition For any number k, h(k) denotes the position (counting from the right) of the least-significant I-bit in the binary representation of k.
For example, h(8) = 4 since 8 in binary is 1000, and h(5) = 1 since 5 in binary is 101. Another way to think of this is that h(k) is one plus the number of consecutive zeros at the right end of k .
Definition In a complete binary tree the heigh! of a node is the number of nodes on the path from it to a leaf. The height of a leaf is one.
The following lemma states a crucial fact that is easy to prove by induction on the height of the nodes.
Lemma 8.5.1. For any node k (node with path number k) in B,h(k) equals the height of
node k in L3. For example, node 8 (binary 1000) is at height 4, and the path from it to a leaf has four nodes (three edges). let I(v) be a node w in 7 such that h(w) is maximum Definition For a node v of 7, over all nodes in the subtree of v (including v itself). That is, over all the nodes in the subtree of v, I(v) is a node (depth-first number) whose binary representation has the largest number of consecutive zeros at its right end. Figure 8.4 shows the node numbers from Figure 8.1 in binary and base 10. Then 1(1), I(5), and I(8) are all 8, i(2) and i(4) are both 4, and /(v) = v for every other node in the figure.
Figure 8.4: Node numbers given in four-bit binary, to illustrate the definition of I ( v ) .
Figure 8.3: A binary tree with four leaves. The path numbers are in binary, and the position of the leastsignificant 1-bit is given in base ten.
each bit of the two numbers. For example, XOR of 00 101 and 1001 1 is 101 10. Since i and j are O(log n) bits long, XOR is a constant-time operation in our model. The algorithm next finds the most significant (left-most) 1-bit in xi,. If the left most 1-bit in the XOR of i and j is in position k (counting from the left), then the left most k - 1 bits of i and j are the same, and the paths to i and j must agree for the first k - 1 edges and then diverge. i t follows that the path number for lca(i,j) consists of the left most 1 - k zeros. For example, in k - 1 bits of i (or j) followed by a 1-bit followed by d Figure 8.2, the XOR of 10 1 and 1 1 1 (nodes 5 and 7) is 0 10, so their respective paths share one edge - the right edge out of the root. The XOR of 010 and 101 (nodes 2 and 5) is 11 1, so the paths to 2 and 5 have no agreement, and hence 100, the root, is their lowest common ancestor. Therefore, to find lca(i,j), the algorithm must XOR two numbers, find the left-most 1-bit in the result (say at position k), shift i right by d 1 - k places, set the right most bit to a 1, and shift it back left by d + 1 - k places. By assumption, each of these operations can be done in constant time, and hence the lowest common ancestor of i and j can be found in constant time in 8. In summary, we have
Theorem 8.4.1. In a complete binary tree, after linear-time preprocessing to name nodes by their path numbers, any lowest common ancestor query can be answered in constant
time. This simple case of a complete binary tree is very special, but it is presented both to develop intuition and because complete binary trees are used in the description of the general case. Moreover, by actually using complete binary trees, a very elegant and relatively simple algorithm can answer lca queries in constant time, if O(nlogn) time is allowed for preprocessing 7 and O(n log n) space is available after the preprocessing. That method is explored in Exercise 12. The lca algorithm we will present for general trees builds on the case of a complete binary tree. The idea (conceptually) is to map the nodes of a general tree 7 to the nodes of a complete binary tree t3 in such a way that lca retrievals on L3 will help to quickly solve lca queries on 7. We first describe the general lca algorithm assuming that the 7 to mapping is explicitly used, and then we explain how explicit mapping can be avoided.
0011 0111 1001 Figure 8.7: A node v in B is numbered if there is a node in 7 that maps to v .
all of which have the same Definition A run in 7 is a maximal subset of nodes of 7, I value.
That is, two nodes u and v are in the same run if and only if I ( u ) = I ( v ) . Figure 8.6 shows a partition of the nodes of ? into runs. Algorithmically we can set I ( v ) , for all nodes, using a linear-time bottom-up traversal of 7 as follows: For every leaf v, I ( v ) = v. For every internal node v, I ( v ) = v if h ( u ) is greater than h ( u l ) for every child v' of v . Otherwise, I ( v ) is set to the l ( v l )value of the child v' whose h ( l ( v l ) )value is the maximum over all children of v. The result is that each run forms an upward path of nodes in 7.And, since the h ( I ()) values never decrease along any upward path in 7,it follows that
Lemma 8.6.1. For any node v , node I ( v ) is the deepest node in the run containing node v.
These facts are illustrated in Figure 8.6.
Definition Define the head of a run to be the node of the run closest to the root.
For example, in Figure 8.6 node 1 (0001) is the head of a run of length three, node 2 (0010) is the head of a run of length two, and every remaining node (not in either of those two runs) is the head of a run consisting only of itself.
Clearly, if node v is an ancestor of a node w then h ( I ( v ) )? h ( I ( w ) ) .Another way to say this is that the h ( I ( v ) )values never decrease along any upward path in 7.This fact will be important in several of the proofs below. In the tree in Figure 8.4, node I ( v ) is uniquely determined for each node v . That is, for each node v there is exactly one w in v's subtree such that h ( w ) is maximum. This is no accident, and it will be important in the Ica algorithm. We now prove this fact.
Lemma 8.5.2. For any node v in 7,there is a unique node w in the subtree of v such that
h ( w ) is maximum over all nodes in v 's subtree.
Suppose not, and let u and w be two nodes in the subtree of u such that h(u) = h ( w ) 3 h ( q ) for every node q in that subtree. Assume h(u) = i. B y adding zeros to the left ends if needed, we can consider the two numbers u and w to have the same number of bits, say I. Since u # w , those two numbers must differ in some bit to the left of i (since by assumption bit i is 1 in both u and w , and all bits to the right of i are zero in both). Assume u > w , and let k be the left-most position where such a difference between u and w occurs. Consider the number N composed of the left-most 1 - k bits of u followed by a 1 in bit k followed by k - 1 zeros (see Figure 8.5). Then N is strictly less than u and greater than w . Hence N must be the depth-first number given to some node in the subtree of v , because the depth-first numbers given to nodes below v form a consecutive interval. But h ( N ) = k > i = h ( u ) ,contradicting the fact that h(u) 2 h ( q ) for all nodes in the subtree of v. Hence the assumption that h ( u ) = h ( w ) leads to a contradiction, and the Lemma is proved.
PROOF
189
What is this crazy mapping doing? In the end, the programming details of this mapping (preprocessing) are very simple, and will become simpler in Section 8.9. The mapping only requires standard linear-time traversals of tree 7(a minor programming exercise in a sophomore-level course). However, for most readers, what exactly the mapping accomplishes is quite unintuitive, because it is a many-one mapping. Certainly, ancestry relations in 7 are not perfectly preserved by the mapping into L3 [indeed, how could they be when the depth of 7 can be n while the depth of B is bounded by @(logn)], but much ancestry information is preserved, as shown in the next key lemma. Recall that a node is defined to be an ancestor of itself. Lemma 8.7.1. I f z is an ancestor of x in 7 then I (z) is an ancestor of I(x) in a. Stated differently, if z is an ancestor of x in 7 then either z a n d x are on the same run in 7 o r node I (z) is a proper ancestor of node I (x) in B.
Figures 8.6 and 8.7 illustrate the claim in the lemma.
PROOF OF LEMMA
8.7.1 The proof is trivial if I(z) = I(x), so assume that they are unequal. Since z is an ancestor of x in 7, h(I(z)) 2 h(I(x)) by the definition of I, but equality is only possible if I(z) = I(x). So h(I(z)) > h(I(x)). Now h(I(z)) and h(I(x)) are the respective heights of nodes I(z) and I(x) in L3, so I(z) is at a height greater than the height of I ( x ) in 17. Let h(I(z)) be i. We claim that I(z) and I(x) are identical in all bits to the left of i (recall that bits of a binary number are numbered from the right). If not, then let k > i be the left-most bit where I(z) and I ( x ) differ. Without loss of generality, assume that I(z) has bit 1 and I ( x ) has bit 0 in position k . Since k is the point of left-most difference, the bits to the left of position k are equal in the two numbers, implying that I(z) > I(x). Now z is an ancestor of x in 7, so nodes l ( z ) and I(x) are both in the subtree of z in 7 . Furthermore, since I(z) and I(x) are depth-first number of nodes in the subtree of z in 7,every number between I(x) and I (z) occurs as a depth-first number of some node in the subtree of z. In particular, let N be the number consisting of the bits to the left of position k in I(z) (or I(x)) followed by 1 followed by all 0s. (Figure 8.5 helps illustrate the situation, although z plays the role of u and x plays the role of w , and bit i in I ( z ) is unknown.) Then I(x) iN iI(z); therefore N is also a node in the subtree of z. But k > i, so h(N) > h(I(z)), contradicting the definition of I . It follows that I ( z ) and I ( x ) must be identical in the bits to the left of bit i. Now bit i is the right most 1-bit in I(z), so the bits to the left of bit i describe the complete path in L3 to node I(z). Those identical bits to the left of bit i also form the initial part of the description of the path in L3 to node I(x), since I ( x ) has a 1-bit to the right of bit i. So those bits are in the path descriptions of both I(z) and I(x), meaning that the path to node I(x) in B must go through node I(z). Therefore, node I (z) is an ancestor of node I(x) in B, and the lemma is proved.
Having described the preprocessing of 7 and developed some of the properties of the tree map, we can now describe the way that lca queries are answered.
8.8. Answering an lca query in constant time Let x and y be two nodes in 7 and let z be the Ica of x and y in 7.Suppose we know the height in B of the node that z is mapped to. That is, we know h(I(z)). Below we show, with only that limited information about z, how z can be found in constant time.
188
Definition The tree map is the mapping of nodes of 7 to nodes of a complete binary tree B with depth d = [log nl - 1. In particular, node v of 7 maps to node I(v) of t3 (recall that nodes of B are named by their path numbers).
The tree map is well defined because I (v) is a d 1 bit number, and each node of is named by a distinct d I bit number. Every node in a run of 7 maps to the same node in but not all nodes in L? generally have nodes in 7 mapping to them. Figure 8.7 shows tree L3 for tree 7 from Figure 8.6. A node v in L3 is numbered if there is a node in 7 that maps to v.
a,
1. Do a depth-first traversal of 7 to assign depth-first search numbers to the nodes. During the traversal compute h(v) for each node v. For each node, set a pointer to its parent node in 7. 2. Using the bottom-up algorithm described earlier, compute I (v) for each v. For each number k such that I(v) = k for some node v, set L(k) to point to the head (or Leader) of the run containing node k. {Note that after this step, the head of the run containing an arbitrary node v can be retrieved in constant time: Compute I(v) and then look up L(l(v)).) {This can easily be done while computing the I values. Node u is identified as the head of its run if the I value of v's parent is not I(v).} 3. Let B be a complete binary tree with node-depth d = [log nl - 1 . Map each node u in 7 to node I(v) in t3. (This mapping will be useful because it preserves enough (although not all) of the ancestry relations from 7.)
{The above three steps form the core of the preprocessing, but there is also one more technical step. For each node u in 7,we want to encode some information about where in B the ancestors of u get mapped. That information is collected in the next step. Remember that h(I(q)) is the height in 23 of node I (q) and so it is the height in B of the node that q gets mapped to.) 4. For each node v in 7,create an O(1og n) bit number A,. Bit A , ( i ) is set to 1 if and only if node v has some ancestor in 7 that maps to height i in B, i-e., if and only if v has an ancestor u such that h (I(u)) = i . End. This ends the description of the preprocessing of 7 and the mapping of 7 to D.To test your understanding of A,, verify that the number of bits set to 1 in A, is the number of distincr runs encountered on the path from the root to v . Setting the A numbers is easy by a linear-time traversal of 7 after all the I values are known: If v' is the parent of v then A, is obtained by first copying A,! and then setting bit A,(i) to 1 if h(I(v)) = i (this last step will be redundant if v and v' are on the same run, but it is always correct). As an example, consider node 3 (0011) in Figure 8.6. A = 1 101 (13) since 3 (0011) maps to height 1, 2 (0010) maps to height 3, and 1 (0001) maps to height 4 in L3.
191
Theorem 8.8.2. Let j be the smallest position greater or equal to i such that both A, and A, have I-bits in position j . Then node I ( z ) i s at height j in B,or in other words, h(I(z)) = j,
Suppose I ( : ) is at height k in El. We will show that k = j . Since z is an ancestor of both x and y , both A, and A, have a 1-bit in position k. Furthermore, since I ( z ) is an ancestor of both I ( x ) and I ( y ) in B (by Lemma 8.7.1), k 2 i , and it follows (by the selection of j ) that k 2 j . This also establishes that a position j 2 i exists where both A, and A, have 1-bits. A, has a 1 -bit in position j and j 2 i, so x has an ancestor x' in T ' such that I ( x t ) is an ancestor of I ( x ) in B and I ( x ' ) is at height j 2 i, the height of b in l3. It follows that I ( x t ) is an ancestor of b. Similarly, there is an ancestor y' of y in 7 such that I ( y f )is at height j and is an ancestor of b in B.But if I ( x f )and I(y') are at the same height ( j ) and both are ancestors of the single node b, then it must be that I ( x f ) = I ( y ) , meaning that x' and y' are on the same run. Being on the same run, either x' is an ancestor in 7of y' or vice versa. Say, without loss of generality, that x' is an ancestor of y' in 7.Then x' is a common ancestor of x and y, and x' is an ancestor of z in 7.Hence x' must map to the same height or higher than z in l?. That is, j >_ k. But k >_ j was already established, so k = j as claimed, and the theorem is proved.
PROOF
f
All the pieces for lea retrieval in 7 have now been described, and each takes only constant time. In summary, the lowest common ancestor z of any two nodes x and y in 7 (assuming z is neither x nor y ) can be found in constant time by the following method:
190
Theorem 8.8.1. Let r denote the ica o f x and y in 7.I f we know h(l(z)), then we can j n d z in T in constant time.
Consider the run containing z in 7.The path up 7 from x to z enters that run at some node x ' (possibly z ) and then continues along that run until it reaches z. Similarly, the path up from y to z enters the run at some node 7 and continues along that run until z. It follows that z is either X or 7. In fact, z is the lugher of those two nodes, and so by the numbering scheme, z = X if and only if X < 7. For example, in Figure 8.6 when x = 9 (1001) and y = 6 (01 lo), then X = 8 (1000) and 7 = z = 5 (0101). Given the above discussion, the approach to finding z from h(I(z)) is to use h(I(z)) to find x and j , since those nodes determine z. We will explain how to find F.Let h(I (z)) = j , so the height in B of I(z) is j. By Lemma 8.7.1, node x (which is in the subtree of z in 7)maps to a node I(x) in the subtree of node I(z) in B,so if h(l(x)) = j then x must be on the same run as z (i.e., x = F), and we are finished. Conversely, if x = F,then h(l(x)) must be j. So assume from here on that x # 5 . Let w (which is possibly x ) denote the node in 7on the z-to-x path just below (off) the run containing z. Since x is not i ,x is not on the same run as z,and w exists. From h(I(z)) (which is assumed to be known) and A, (which was computed during the preprocessing), we will deduce h(I(w)) and then I(w), w , and Z. Since w is in the subtree of z in 7 and is not on the same run as z, w maps to a node in B with height strictly less than the height of I(z) (this follows from Lemma 8.7.1). In fact, by Lemma 8.7.1, among all nodes on the path from x to z that are not on 2's run, w maps to a node of greatest height in B.Thus, h ( l ( w ) ) (which is the height in B that w maps to) must be the largest position less than j such that A, has a 1-bit in that position. That is, we can find h(I(w)) (even though we don't know w ) by finding the most significant 1-bit of A, in a position less than j . This can be done in constant time on the assumed machine (starting with all bits set to 1, shift right by d - j + I positions, AND this number together with A,, and then find the left-most 1-bit in the resulting number.) Let h ( l ( w ) )= k. We will now find I(w). Either w is x or w is a proper ancestor of x in 7,so either I(w) = I(x) or node I(w) is a proper ancestor of node I(x) in B.Moreover, by the path-encoding nature of the path numbers in B,numbers I (x) and I ( w ) are identical in bits to the left of k, and I(w) has a 1 in bit k and all 0s to the right. So I(w) can be obtained from I(x) (which we know) and k (which we obtained as above from h(I(z)) and A,). Moreover, I ( w ) can be found from I(x) and h ( l ( w ) )using constant-time bit operations. Given I(w) we can find w because w = L(I (w)). That is, w was just off the z run, so it must be the head of the run that it is on, and each node in 7"points to the head of its run. From w we find its parent xt in constant time.
PROOF
In summary, assuming we know h(I(z)), we can find node 3, which is the closest ancestor of x in 7 that is on the same run as z.Similarly, we find J . Then z is either E or J ; in fact, z is the node among those two with minimum depth-first number in 7.Of course, we must now explain how to find j = h(I(t)).
8.11. EXERCISES
8.1 I. Exercises
I .Using depth-first traversal, show how to construct the path numbers for the nodes of L? in time proportional to n, the number of nodes in B.Be careful to observe the constraints of the RAM model.
2. Prove that the path numbers in
3. The Ica algorithm for a complete binary tree was detailed in the case that Ica(i,j) was neither i nor j. In the case that Ica(i,j) is one of i or j, then a very simple constant-time algorithm can determine Ica(i,j). The idea is first to number the nodes of the binary tree B by a depth-first numbering, and to note for each node v, the number of nodes in the subtree of v (including v ) .Let I(v) be the dfs number given to node v, and let s(v)be the number of nodes in the subtree of v. Then node i is an ancestor of node j if and only if l(i)5 I( j ) and I( j ) < / ( i ) s(i).
Prove that this is correct, and fill in the details to show that the needed preprocessing can be done in O(n)time. Show that the method extends to any tree, not just complete binary trees.
4. In the special case of a complete binary tree 8, there is an alternative way to handle the situation when Ica(i,j) is i or j. Using h(i) and h(j) we can determine which of the nodes i and j is higher in the tree (say i) and how many edges are on the path from the root to node i. Then we take the XOR of the binary for i and for j and find the left-most 1-bit as before, say in position k (counting from the left). Node i is an ancestor of j if and only if k is larger than the number of edges on the path to node i. Fill in the details of this argument and prove it is correct. 5. Explain why in the Ica algorithm for 8, it was necessary to assun-~e that Ica(i,j) was neither i nor j. What would go wrong in that algorithm if the issue were ignored and that case was not checked explicitly?
6. Prove that the height of any node k in
B is h(k).
7. Write a C program for both the preprocessing and the tea retrieval. Test the program on large trees and time the results.
8. Give an explicit O(n)-time RAM algorithm for building the table containing the right-most 1-bit in every log, n bit number. Remember that the entry for binary number i must be in the ith position in the table. Give details for building tables for AND, OR, and NOT for time. bit numbers in O(n) 9. It may be more reasonable to assume that the RAM can shift a word left and right in constant time than to assume that it can multiply and divide in constant time. Show how to solve the Ica problem in constant time with linear preprocessing under those assumptions.
10. In the proof of Theorem 8.8.1 we showed how to deduce I(w) from h ( l ( w ) ) in constant time. Can we use the same technique to deduce I(zjfrom h(l(z))? If so, why doesn't the method do that rather than involving nodes w , X,and
11. The constant-time Ica algorithm is somewhat difficult to understand and the reader might wonder whether a simpler idea works. We know how to find the Ica in constant time in a complete binary tree after O(n)preprocessing time. Now suppose we drop the assumption that the binary tree is complete. So 7 is now a binary tree, but not necessarily complete. Letting d again denote the depth of T , we can again compute d 1 length path numbers that encode the paths to the nodes, and again these path numbers allow easy construction of the towest common ancestor. Thus it might seem that even in incomplete binary trees, one can easily find the Ica in this simple way without the need for the full Ica algorithm. Either give the details for this or explain why it fails to find the Ica in constant time.
192
used in steps 3,4, or 5 to obtain z from h(l(z)). However, it is used in step 1 to find node
.
/
,,
b from I ( x ) and I ( y ) . But all we really need from b is h(b)(step 2), and that can be gotten from the right-most common 1-bit of Z(x) and I ( y ) . So the mapping from 5 ' - to is only conceptual, merely used for purposes of exposition. In summary, after the preprocessing on T, when given nodes x and y, the algorithm finds i = h(b) (without first finding 6)from the right-most common 1-bit in I ( x ) and I ( y ) . Then it finds j = h(Z(z)) from i and A, and A,,, and from j it finds z = lca(x,y). Although the logic behind this method has been difficult to convey, a program for these operations is very easy to write.
8.11. EXERCISES
195
Step 2 For an arbitrary internal node v in B,let 8 , denote the subtree of 8 rooted at v, and
, = nl, h... . , n , be an ordered list containing the elements of L written at the leaves of let L B,, in the same left-to-right order as they appear in 8. Create two lists, Pmin(v) and Smintv), for each internal node v. Each list will have size equal to the number of leaves in v's subtree. The kth entry of list Pmin(v) is the smallest number among inl, n2, . . . , nk). That is, the Mh entry of Pmin(v) is the smallest number in the prefix of list L, ending at position k. Similarly, , starting at position k. the kth entry of list Smin(v) is the smallest number in the suffixof L This is the end of the preprocessing and exercises follow.
b. Prove that the total size of all the Pmin and Smin lists is O(m log m), and show how they can be constructed in that time bound.
After the O(mlog m) preprocessing, the smallest number in any intervalIcan be found in constant time. Here's how. Let interval I in L have endpoints I and r and recall that these To find the smallest number in I , first find the Ica(l,r), say node v. correspond to leaves of 8. r Let v' and v ' be the left and right children of v in 8, respectively. The smallest number in I can be found using one lookup in list Smin~v'),one lookup in Pmin(v"), and one additional comparison.
c. Give complete details for how the smallest number inIis found, and fully explain why only constant time is used. 13. By refining the method developed in Exercise 12, the O(m log m) preprocessing bound (time and space) can be reduced to only O(m log log m) while still maintaining constant retrieval time for any Ica query. (It takes a pretty big value of m before the difference between O(m) and O(mlog log m) is appreciable!) The idea is to divide list L into blocks each of size log m and then separately preprocess each block as in Exercise 12. Also, compute numbers in an ordered list Lmin, and the minimum number in each block, put these preprocess Lmin as in Exercise 12.
&
&
a. Show that the above preprocessing takes O(mlog log m) time and space.
Now we sketch how the retrieval is done in this faster method. Given an interval I with starting and ending positionsI and r, one finds the smallest number in I as follows: If I and r are in the same block, then proceed as in Exercise 12. If they are in adjacent blocks, then find the minimum number fromI to the end of I's block, find the minimum number from the start of r's block to r, and take the minimum of those two numbers. If I and r are in nonadjacent blocks, then do the above and also use Lmin to find the minimum number in all the blocks strictly between the block containing I and the block containing r. The smallest number in I is the minimum of those three numbers.
b. Give a detailed description of the retrieval method and justify that it takes only constant time.
14. Can the above improvement from O(mlogm) preprocessing time to ~ ( r n l o g l o g m ) preprocessing time be extended to reduce the preprocessing time to O(mlog log tog m) preprocessing time? Can the improvements be continued for an arbitrary number of logarithms?
194
A simpler (but slower) Ica algorithm. In Section 8.4.1 we mentioned that if O(n1og n) preprocessing time is allowed, and O(nlog n) space can be allocated during both the preprocessing and the retrieval phases, then a (conceptually) simpler constant-time Ica retrieval method is possible. In many applications, O(nlog n) is an acceptable bound, which is not much worse than the O(n)bound we obtained in the text. Here we sketch the idea of the O(nlog n) method. Your problem is to flesh out the details and prove correctness.
First we reduce the general Ica problem to a problem of finding the smallest number in an interval of a fixed list of numbers. The reduction of Ice t o a list problem Step 1 Execute a depth-first traversal of tree 7 to label the nodes in depth-first order and to build a multilist L of the nodes in the order that they are visited. (For any node v other than the root, the number of times v is in L equals the degree of v.) The only property of the depth-first numbering we need is that the number given to any node is smaller than the number given to any of its proper descendants. From this point on, we refer to a node only by its dfs number. For example, the list for the tree in Figure 8.1 (page 182) is
Notice that if 7 has n nodes, then L has O(n) entries. Step 2 The Ica of any two nodes x and y can be gotten as follows: Find any occurrences of x and y in L; this defines an intervalIin L between those occurrences of x and y. Then ; that number is the Ica(x,y). in L find the smallest number in intervalI For example, if x is 6 and y is 9, then one interval I that they define is {6,5.7,5,8,9), implying that node 5 is Ica(6,9). This is the end of the reduction, Now the first exercise.
a. Ignoring time complexity, prove that in general the Ica of two nodes can be obtained as described in the two steps above.
Now we continue to describe the method. More exercises will follow. With theabove reduction, each Icaquery becomes the problem of findingthe smallest number in a intervalIof a fixed list L of O(n) numbers. Let m denote the exact size of L. To be able to solve each Ica query in constant time, we first do an O(m log m)-time preprocessing of list L. For convenience assume that m is a power of 2. Preprocessing of L Step 1 Build a complete binary tree 8 with m leaves and number the leaves in left-to-right order (as given by an inorder traversal). Then for i from 1 to m, record the ith element of L at leaf i.
s,
,...,
abcdefghijklmnop
. .. .
Figure 9.1: The longest common extension for pair (i, j) has length eight. The match~ng substring is abcdefgh.
With the ability to solve lowest common ancestor queries in constant time, suffix trees can be used to solve many additional string problems. Many of those applications move from the domain of exact matching to the domain of inexact, or approximate, matching (matching with some errors permitted). This chapter illustrates that point with several examples.
Longest common extension problem Two strings S1 and S2 of total length n are first specified in a preprocessing phase. Later, a long sequence of index pairs is specified. For each specified index pair (i, j),we must find the length of the longest substring of SI starting at position i that matches a substring of S2 starting at position j . That is, we must find the length of the longest prefix of suffix i of SI that matches a prefix of suffix j of S2 (see Figure 9.1).
Of course, any time an index pair is specified, the longest common extension can be found by direct search in time proportional to the length of the match. But the goal is to compute each extension in constant time, independent of the length of the match. Moreover, it would be cheating to allow more than linear time to preprocess Sl and S2. To appreciate the power of suffix trees combined with constant-time lca queries, the reader should again try first to devise a solution to the longest common extension problem without those two tools.
199
although in the biological literature the distinction between separated and nonseparated palindromes is sometimes blurred. The problem of finding all separated palindromes is really one of finding all inverted repeats (see Section 7.12) and hence is more complex than finding palindromes. However, if there is a fixed bound on the permitted distance of the separation, then all the separated palindromes can again be found in linear time. This is an immediate application of the longest common extension problem, the details of which are left to the reader. Another variant of the palindrome problem, called the k-mismatch palindrome problem, will be considered below, after we discuss matching with a fixed number of mismatches.
End. The space needed by this method is O(n + m),since it uses a suffix tree for the two strings. However, as detailed in Theorem 9.1.l, only a suffix tree for P plus the matching statistics for T are needed (although we must still store the original strings). Since m > n we have
Theorem 9.3.1. The exact matching problem with k wild cards distributed in the two strings can be solved in O(krn) time and O ( m )space.
IYb
Palindrome problem: Given a string S of length n , the palindrome problem is to locate all maximal palindromes in S.
+ +
+ + +
1. In linear time, create the reverse string Sr from S and preprocess the two strings so that any longest common extension query can be solved in constant time. 2. For each q from 1 to n - 1, solve the longest common extension query for the index pair (q 1, n - q 1) in S and S r , respectively. If the extension has nonzero length k, then there is a maximal palindrome of radius k centered at q .
The method takes O ( n ) time since the suffix tree can be built and preprocessed in that time, and each of the O ( n )extension queries is solved in constant time. In summary, we have
Theorem 9.2.1. All the maximal even-length palindromes in a string can be identified in linear time.
201
4. If count 5 k , then increment count by one, set j to j + 1 + 1, set i' to i 1 + 1, and go to step 2. If count = k + 1 , then a k-mismatch of P does not occur starting at i; stop. End. Note that the space required for this solution is just O(n + m ) , and that the method can be implemented using a suffix tree for the small string P alone. We should note a different practical approach to the k-mismatch problem, based on suffix trees, that is in use in biological database search [320]. The idea is to generate every string P' that can be derived from P by changing up to k characters of P, and then to search for P' in a suffix tree for T. Using a suffix tree, the search for P' takes time just proportional to the length of P' (and can be implemented to be extremely fast), so this approach can be a winner when k and the size of the alphabet are relatively small,
Definition A k-mismatch palindrome is a substring that becomes a palindrome after k or fewer characters are changed. For example, axabbcca is a 2-mismatch palindrome.
With this definition, a palindrome is just a 0-mismatch palindrome. It is now an easy exercise to detail an O(kn)-time method to find all k-mismatch palindromes in a string of length n . We leave that to the reader, and we move on to the more difficult problem of finding randem repeats.
Definition substring.
is a
Each tandem repeat is specified by a starting position of the repeat and the length of the substring B. This definition does not require that be of maximal length. For example, in the string xabnbababy there are a total of six tandem repeats. Two of these begin at position two: abab and abababab. In the first case, B is ab, and in the second case, B is abab. Using longest common extension queries, it is immediate that all tandem repeats can be found in 0 ( n 2 ) time -just guess a start position i and a middle position j for the tandem and do a longest common extension query from i and j . If the extension from i reaches j or beyond, then there is a tandem repeat of length 2 ( j - i 1) starting at position i . There are ~ ( n "choices for i and j , yielding the 0(n2) time bound.
Definition A k-mismatch tandem repeat is a substring that becomes a tandem repeat after k or fewer characters are changed. For example, crrabaybb is a 2-mismatch tandem repeat.
Again, all k-mismatch tandem repeats can be found in 0 ( k n 2 ) time, and the details are left to the reader. Below we will present a method that solves this problem in O(kn log(n/ k)) time. To summarize, what we have so far is
Theorem 9.5.1. All the tandem repeats in S in which the two copies differ by ar most k mismatches can be found in 0 ( k n 2 ) rime. Typically, k is a fixed nztmber, and the time bound is reported as 0(n2)-
Definition Given a pattern P, a text T , and a fixed number k that is independent of the lengths of P and T, a k-mismatch of P is a I P 1-length substring of T that matches at least [PI- k characters of P. That is, it matches P with at most k mismatches.
Note that the definition of a k-mismatch does not allow any insertions or deletions of characters, just matches and mismatches. Later, in Section 12.2, we will discuss bounded error problems that also allow insertions and deletions of characters. The k-mismatch problem is to find all k-mismatches of P in T. For example, if P = bend, T = abentbananaend, and k = 2, then T contains three k-matches of P: P matches substring benr with one mismatch, substring bana with two mismatches, and substring aend with one mismatch. Applications in molecular biology for the k-mismatch problem, along with the more general k-differences problem, will be discussed in Section 12.2.2. The k-mismatch problem is a special case of the match-count problem considered in Section 4.3, and the approaches discussed there apply. But because k is a fixed number unrelated to the lengths of P and T, faster solutions have been obtained. In particular, Landau and Vishkin 12871 and Myers 13411 were the first to show an O(km)-time solution, where P and T have lengths n and m > n, respectively. The value of k can never be more than n, but the motivation for the O(krn) result comes from applications where k is expected to be very small compared to n.
k-mismatch check
Begin
1. Set j to 1 and i' to i , and coimt to 0.
2. Compute the length 1 of the longest common extension starting at positions j of P and i t of T . 3. If j 1 = n + 1, then a k-mismatch of P occurs in T starting at i (in fact, only count mismatches occur); stop.
Figure 9.2: Any position between A and B inclusive is a starting point of a tandem repeat of length 21. As detailed in Step 4, if 1, and I* are both at least one, then a subinterval of these starting points specify tandem repeats whose first copy spans h.
3. Compute the longest common extension in the reverse direction from positions h - 1 and q - 1. Let 1 2 denote the length of that extension. 4. There is a tandem repeat of length 21 whose first copy spans position h if and only if lI 1 2 > 1 and both 11and 1 2 are at least one. Moreover, if there is such a tandem repeat of length 21, then it can begin at any position from Max(h - 12, h - 1 I ) to Min(h lI -1, h ) inclusive. The second copy of the repeat begins 1 places to the right. Output each of these starting positions along with the length 21. (See Figure 9.2.) End,
To solve an instance of subproblem 3 (finding all tandem repeats whose first copy spans position h), just run the above algorithm for each l from 1 to h.
Lemma 9.6.1. The above method correctly solves subproblem 3 for afired h. That is, it finds all tandem repeats whosejrst copy spans position h. Further;forfired h, its running time is O ( n / 2 )+ z h , where zh is the number of such tandem repeats.
PROOF
Assume first that there is a tandem repeat whose first copy spans position h , and it has some length, say 21. That means that position q = h f l in the second copy corresponds to position h in the first copy. Hence some substring starting at h must match a substring starting at q , in order to provide the suffix of each copy. This substring can have length at most l I . Similarly, there must be a substring ending at h - 1 that matches a substring ending at q - 1, providing the prefix of each copy. That substring can have length at most 12. Since all characters between h and q are contained in one of the two copies, l l l2 must be at least 1 . Conversely, by essentially the same reasoning, if l I l2 5 I and both 1 and l2 are at y h . The least one then one can specify a tandem repeat of length 21 whose first c ~ p spans necessary and sufficient condition for the existence of such a tandem is therefore proved. The converse proof that all starting positions fall in the stated range involves similar reasoning and is left to the reader. For the time analysis, note first that for a fixed choice of h, the method takes constant time per choice of l to execute the common extension queries, and so it takes O ( n / 2 ) time for all those queries. For any fixed I , the method takes constant time per tandem that it reports, and it never reports the same tandem twice since it reports a different starting point for each repeat of length 21. Since each repeat is reported as a starting point and a length, it follows that over all choices of 1 , the algorithm never reports any tandem repeat twice. Hence the time spent to report tandem repeats is proportional to zh, the number of tandem repeats whose first copy spans position h .
Theorem 9.6.1. Every tandem repeat in S is found by the execution of subproblems I through 4 and is reported exactly once. The time for the algorithm is O(n log n + z), where z is the total number of tandem repeats in S.
1. Find all tandem repeats contained entirely in the first half of S (up to position h ) . 2. Find all tandem repeats contained entirely in the second half of S (after position h). 3. Find all tandem repeats where the first copy spans (contains) position h of S. 4. Find all tandem repeats where the second copy spans position h of S.
Clearly, no tandem repeat will be found in more than one of these four subproblems. The first two subproblems are solved by recursively applying the Landau-Schmidt solution. The second two problems are symmetric to each other, so we consider only the third subproblem. An algorithm for that subproblem therefore determines the algorithm for finding all tandem repeats.
Algorithm for problem 3 We want to find all the tandem repeats where the first copy spans (but does not necessarily begin at) position h . The idea of the algorithm is this: For any fixed number 1, one can test in constant time whether there is a tandem repeat of length exactly 21 such that the first copy spans position h . Applying this test for all feasible values of 1 means that in O(n) time we can find all the lengths of tandem repeats whose first copy spans position h . Moreover, for each such length we can enumerate all the starting points of these tandem repeats, in time proportional to the number of them. Here is how to test a number 1.
Begin 1. L e t q = h + l . 2. Compute the longest common extension (in the forward direction) from positions h and q . Let li denote the length of that extension.
205
and tandem repeat problems to allow for string complementation and bounded-distance separation between copies.
,,
( ,,
We show below that all the correction factors for all internal nodes can be computed in O(n) total time. That then gives an O(n)-time solution to the k-common substring problem.
204
PROOF
That all tandem repeats are found is immediate from the fact that every tandem is of a form considered by one of the subproblems 1 through 4. To show that no tandem repeat is reported twice, recall that for h = n/2, no tandem is of the form considered by more than one of the four subproblems. This holds recursively for subproblems 1 and 2. Further, in the proof of Lemma 9.6.1 we established that no execution of subproblem 3 (and also 4) reports the same tandem twice. Hence, over the entire execution of the four subproblems, no tandem repeat is reported twice. It also follows that the total time used to output the tandem repeats is O(z). To finish the analysis, we consider the time taken by the extension queries. This time is proportional to the number of extension queries executed. Let T(n) denote the number of extension queries executed for a string of length n. Then, T(n) = 2T (n/2) 2n, and T(n) = O(n logn) as claimed.
Theorem 9.6.2. All k-mismatch tandem repeats in a string of length n can be found in 0(kn log n r ) time.
The bound can be sharpened to O(kn log(n / k) z) by the observation that any 1 ( k need not be tested in subproblems 3 and 4. We leave the details as an exercise. We also leave it to the reader to adapt the solutions for the k-mismatch palindrome
9.8. EXERCISES
207
5. For each identifier i, compute the Ira of each consecutive pair of leaves in Li, and increment h ( w )by one each time that w is the computed Ira. compute, for each node v, S(u) and U ( v ) = 6. With a bottom-up traversal of 7, ~ , : n , , a [ n i ( u) 1 3 -- C [ h ( w ): w is in the subtree of u]. 7. Set C ( v )= S(v) - U ( v )for each node u. 8. Accumulate the table of l ( k ) values as detailed in Section 7.6.
End.
zf=,
Theorem 9.7.1. Let S be a set of K strings of total length n, and let l ( k )denote the length of the longest substring that appears in at least k distinct strings of S. A table of all l ( k ) values, fork from 2 to K , can be biiilt in O ( n )time.
That so much information about the substrings of S can be obtained in time proportional to the time needed just to read the strings is very impressive. It would be a good challenge to try to obtain this result without the use of suffix trees (or a similar data structure).
9.8. Exercises
1. Prove Theorem 9.1. l . 2. Fill in all the details and prove the correctness o f the space-efficient method solving the
longest common extension problem. 3. Give the details for finding all odd-length maximal palindromes in a string in linear time. 4. Show how to solve all the palindrome problems in linear time using just a suffix tree for the string S rather than for both S and Sr. 5. Give the details for searching for complemented palindromes in a linear string.
Figure 9.3: The boxed leaves have identifier i . The circled internal nodes are the lowest common ancestors of the four adjacent pairs of leaves from list Li.
y are any two leaves in L i ( v ) ,then the lca of x and y is a node in the subtree of v . So if we compute the Eca for each consecutive pair of leaves in L i ( v ) ,then all of the n i ( v )- 1 computed lcas will be found in the subtree of u. Further, if x and y are not both in the subtree of u, then the lea of x and y will not be a node in v's subtree. This leads to the following lemma and method.
Lemma 9.7.2. I f we compute the lca for each consecutive pair of leaves in Lj, then for any node v, exactly n i ( v )- 1 of the comprtted lcas will lie in the subtree of u.
Lemma 9.7.2 is illustrated in Figure 9.3. Given the lemma, we can compute n i ( v )- 1 for each node u as follows: Compute the lca of each consecutive pair of leaves in L i , and accumulate for each node w a count of the number of times that w is the computed lca. Let h ( w ) denote that count for node w . Then for any node v , n i ( v )- 1 is exactly C [ h ( w ): w is in the subtree of v ] . A standard O(n)-time bottom-up traversal of T can therefore be used to find n i ( v ) - 1 for each node v . TO find U ( v ) ,we don't want n i ( v ) - 1 but rather [ n i ( u ) 11. However, the algorithm must not do a separate bottom-up traversal for each identifier, since then the time bound would then be O ( K n ) .Instead, the algorithm should defer the bottom-up traversal until each list Li has been processed, and it should let h ( w )count the total number of times that w is the computed lca over all of the lists. Only then is a single bottom-up traversal of T done. At that point, U ( v ) = ~i:.,,,[ni(v) - I] = x [ h ( w ) : w is in the subtree of v ] . We can now summarize the entire O ( n ) method for solving the k-common substring problem.
xi
PART III
208
6. Recall that a plasmid is a circular DNA molecule common in bacteria (and elsewhere). Some bacterial plasmids contain relatively long complemented palindromes (whose function is somewhat in question). Give a linear-time algorithm to find all maximal complemented palindromes in a circular string.
7. Show how to find all the k-mismatch palindromes in a string of length n in O(kn) time. 8. Tandem repeats. In the recursive method discussed in Section 9.6 (page 202) for finding the tandem repeats (no mismatches), problem 3 is solved with a linear number of constant-time common extension queries, exploiting suffix trees and lowest common ancestor computations. An earlier, equally efficient, solution to probtem 3 was developed by Main and Lorenz (3071, without using suffix trees.
The idea is that the problem can be solved in an amortized linear-time bound without suffix trees. In an instance of problem 3, h is held fixed while q = h + 1 - 1 varies over all appropriate values of I.Each forward common extension query is a problem of finding the length of the longest substring beginning at position q that matches a prefix of S[h . . . n]. All those lengths must be found in linear time. But that objective can be achieved by computing Z values (again) from Chapter 1, for the appropriate substring of S. Flesh out the details of this approach and prove the linear amortized time bound. Now show how the backward common extensions can also be solved in linear time by computing Z values on the appropriately constructed substring of S. This substring is a bit less direct than the one used for forward extensions.
9. Complete the details for the O(kn log n z)-time algorithm for the k-mismatch tandem repeat problem. Consider both correctness and time.
11. Try to modify the Main and Lorenz method for finding all the tandem repeats (without errors) to solve the k-mismatch tandem repeat problem in O(kn log n z) time. If you are not successful, explain what the difficulties are and how the use of suffix trees and common ancestors solves these problems.
12. The tandem repeat method detailed in Section 9.6 finds all tandem repeats even if they are not maximal. For example, it finds six tandem repeats in the stringxababababy, even though the left-most tandem repeat abab is contained in the longer tandem repeat abababab. Depending on the application, that output may not be desirable. Give a definition of maximaljty that would reduce the size of the output and try to give efficient algorithms for the different definitions.
13. Consider the following situation: A long string S is given and remains fixed. Then a sequence of shorter strings S,, &, . . . , Sk is given. After each string Si is given (but before S,,, is known), a number of longest common extension queries will be asked about S, and S. Let r denote the total number of queries and n denote the total length of ail the short strings. How can these on-line queries be answered efficiently? The most direct approach is to build a generalized suffix tree for both S and Si when Si is presented, preprocess it (do a depth-first traversal assigning dfs numbers, setting I() values, etc.) for the constant-time Ica algorithm, and then answer the queries for S;. But that would take O(kl SI n r) time. The kiSI term comes from two sources: the time to build the k generalized suffix trees and the time to preprocess each of them for Ica queries.
+ +
Reduce that k / S (term from both sources to I SI, obtaining an overall bound of O(IS1 + n r ) . Reducing the time for building all the generalized suffix trees is easy. Reducing the time for the Ica preprocessing takes a bit more thought.
211
210
At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. "Approximate" means that some errors, of various types detailed later, are acceptable in valid matches. "Alignment" will be given a precise meaning later, but generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings. We also shift from problems primarily concerning substrings to problems concerning subsequences. A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be.' For example, the string xyr is a subsequence, but not a substring, in axayar, The shift from substrings to subsequences is a natural corollary of the shift from exact to inexact matching. This shift of focus to inexact matching and subsequence comparison is Most of the methods we will discuss in Part 111, and accompanied by a shift in techniq~te. many of the methods in Part IV, rely on the tool of dynamicprogramming, a tool that was not needed in Parts I and 11.
Much of computational biology concerns sequence alignments The area of approximate matching and sequence comparison is central in computational molecular biology both because of the presence of errors in molecular data and because of active mutational processes that (sub)sequence comparison methods seek to model and reveal. This will be elaborated in the next chapter and illustrated throughout the book. On the technical side, sequence alignment has become the central tool for sequence comparison in molecular biology. Henikoff and Henikoff 12221 write:
Among the most useful computer-based cools in modern biology are those that involve sequence alignments of proteins, since these alignments often provide important insights into gene and protein function. There are several different types of alignments: global alignments of pairs of proteins related by common ancestry throughout their lengths, local alignments involving related segments of proteins, multiple alignments of members of protein families, and alignments made during data base searches to detect homologies. This statement provides a framework for much of Part 111. We will examine in detail the four types of alignments (and several variants) mentioned above. We will also show how those different alignment models address different kinds of problems in biology. We begin, in Chapter 10, with a more detailed statement of why sequence comparison has become central to current molecular biology. But we won't forget the role of exact matching.
'
It is a common and confusing practice in the biological literature to refer to a substring as n subsequence. But techniques and results for substring problems can be very different from techniques and results for the analogous subsequence problems, so it is important to maintain a clear distinction. In this book we will never use the term "subsequence" when "substring" is intended.
213
And fruit flies aren't special. The following is from a book review on DNA repair [424]:
Throughout the present work we see the insights gained through our ability to look for sequence homologies by comparison of the DNA of different species. Studies on yeast are remarkable predictors of the human system!
So "redundancy", and "similarity" are central phenomena in biology. But similarity has its limits - humans and flies do differ in some respects. These differences make conserved similarities even more significant, which in turn makes comparison and analogy very powerful tools in biology. Lesk [297] writes:
It is characteristic of biological systems that objects that we observe to have a certain form arose by evolution from related objects with similar but not identical from. They must, therefore, be robust, in that they retain the freedom to tolerate some variation. We can take advantage of this robustness in our analysis: By identifying and comparing related objects, we can distinguish variable and conserved features, and thereby determine what is crucial to structure and function.
The important "related objects" to compare include much more than sequence data, because biological universality occurs at many levels of detail. However, it is usually easier to acquire and examine sequences than it is to examine fine details of genetics or cellular biochemistry or morphology. For example, there are vastly more protein sequences known (deduced from underlying DNA sequences) than there are known three-dimensional protein structures. And it isn't just a matter of convenience that makes sequences important. Rather, the biological sequences encode and reflect the more complex common molecular structures and mechanisms that appear as features at the cellular or biochemical levels. Moreover, "nowhere in the biological world is the Darwinian notion of 'descent with modification' more apparent than in the sequences of genes and gene products" [130]. Hence a tractable, though partly heuristic, way to search for functional or structural universality in biological systems is to search for similarity and conservation at the sequence level. The power of this approach is made clear in the following quotes:
Today, the most powerful method for inferring the biological function of a gene (or the protein that it encodes) is by sequence similarity searching on protein and DNA sequence databases. With the development of rapid methods for sequence comparison, both with heuristic algorithms and powerful parallel computers, discoveries based solely on sequence homology have become routine. [360] Determining function for a sequence is a matter of tremendous complexity, requiring biological experiments of the highest order of creativity. Nevertheless, with only DNA sequence it is possible to execute a computer-based algorithm comparing the sequence to a database of previously characterized genes. In about 50% of the cases, such a mechanical comparison will indicate a sufficient degree of similarity to suggest a putative enzymatic or structural function that might be possessed by the unknown gene. [9 11
Thus large-scale sequence comparison, usually organized as database search, is a very powerful tool for biological inference in modem molecular biology. And that tool is almost universally used by molecular biologists. It is now standard practice, whenever a new gene is cloned and sequenced, to translate its DNA sequence into an amino acid sequence and then search for similarities between it and members of the protein databases. No one today would even think of publishing the sequence of a newly cloned gene without doing such database searches.
Sequence comparison, particularly when combined with the systematic collection, curration, and search of databases containing biomolecular sequences, has become essential in modem molecular biology. Commenting on the (then) near-completion of the effort to sequence the entire yeast genome (now finished), Stephen Oliver says In a short time it will be hard to realize how we managed without the sequence data. Biology will never be the same again. [478] One fact explains the importance of molecular sequence data and sequence comparison in biology, T h e first fact of biological sequence analysis Thefirstfact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity. Evolution reuses, builds on, duplicates, and modifies "successful" structures (proteins, exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.). Life is based on a repertoire of structured and interrelated molecular building blocks that are shared and passed around. The same and related molecular structures and mechanisms show up repeatedly in the genome of a single species and across a very wide spectrum of divergent species. "Duplication with modification" (127, 128, 129, 1301 is the central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones. Doolittle emphasizes this point as follows: The vast majority of extant proteins are the result of acontinuous series of genetic duplications and subsequent modifications. As a result, redundancy is a built-in characteristic of protein sequences, and we should not be surprised that so many new sequences resemble already known sequences. [ 1 291 He adds that
. . . all of biology is based on an enormous redundancy . . . .[1301
The following quotes reinforce this view and suggest the utility of the "enormous redundancy" in the practice of molecular biology. The first quote is from Eric Wieschaus, cowinner of the 1995 Nobel prize in medicine for work on the genetics of Drosophiln development. The quote is taken from an Associated Press article of October 9, 1995. Describing the work done years earlier, Wieschaus says We didn't know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans.
11.1. Introduction
In this chapter we consider the inexact matching and alignment problems that form the core of the field of inexact matching and others that illustrate the most general techniques. Some of those problems and techniques will be further refined and extended in the next chapters. We start with a detailed examination of the most classic inexact matching problem solved by dynamic programming, the edit distance problem. The motivation for inexact matching (and, more generally, sequence comparison) in molecular biology will be a recurring theme explored throughout the rest of the book. We will discuss many specific examples of how string comparison and inexact matching are used in current molecular biology. However, to begin, we concentrate on the purely formal and technical aspects of defining and computing inexact matching.
In general, given the two input strings SI and S2,and given an edit transcript for SIand the transformation is accomplished by successively applying the specified operation in the transcript to the next character(s) in the appropriate stringis). In particular, let next1 and
$2,
214
The final quote reflects the potential total impact on biology of the first fact and its exploitation in the form of sequence database searching. It is from an article (1791 by Walter Gilbert, Nobel prize winner for the coinvention of a practical DNA sequencing method. Gilbert writes:
The new paradigm now emerging, is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting point of biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture, only then turning to experiment to follow or test that hypothesis.
Already, hundreds (if not thousands) of journal publications appear each year that report biological research where sequence comparison and/or database search is an integral part of the work. Many such examples that support and illustrate thefirst fact are distributed throughout the book. In particular, several in-depth examples are concentrated in Chapters 14 and 15 where multiple string comparison and database search are discussed. But before discussing those examples, we must first develop, in the next several chapters, the techniques used for approximate matching and (sub)sequence comparison.
Caveat The first fact of biological sequence analysis is extremely powerful, and its importance will be further illustrated throughout the book. However, there is not a one-to-one correspondence between sequence and structure or sequence and function, because the converse of the first fact is not true. That is, high sequence similarity usually implies significant structural or functional similarity (the first fact), but structural or functional similarity does not necessarily imply sequence similarity, On the topic of protein structure, F. Cohen [lo61 writes ".. . similar sequences yield similar structures, but quite distinct sequences can produce remarkably similar structures". This converse issue is discussed in greater depth in Chapter 14, which focuses on multiple sequence comparison.
217
Another example of an alignment is shown on page 215 where vintner and writers are aligned with each other below their edit transcript. That example also suggests a duality between alignment and edit transcript that will be developed below.
That is, D ( i , j)denotes the minimum number of edit operations needed to transform the has n letters first i characters of S , into the first j characters of S2. Using this notation, if SI and S2 has m letters, then the edit distance of SI and S2 is precisely the value D ( n , m ) . We will compute D ( n , m ) by solving the more general problemof computing D ( i . j ) for all combinations of i and j, where i ranges from zero ton and j ranges from zero to m . This is the standard dynamic programming approach used in a vast number of computational problems. The dynamic programming approach has three essential components - the recurrence relation, the tabular compuration, and the traceback. We will explain each one in turn.
216
nextz be pointers into SI and S2. Both pointers begin with value one. The edit transcript is read and applied left to right. When symbol "I" is encountered, character next2 is inserted before character next, in S 1 ,and pointer next2 is incremented one character. When "D" is encountered, character next, is deleted from S1 and nextl is incremented by one character. When either symbol "R" or "M" is encountered, character nextl in S1 is replaced or matched by character next2 from SZ,and then both pointers are incremented by one.
Definition The edit distance between two strings is defined as the minimum number of edit operations -insertions, deletions, and substitutions -needed to transform the first string into the second. For emphasis, note that matches are not counted.
Edit distance is sometimes referred to as Levenshtein distance in recognition of the paper [299] by V. Levenshtein where edit distance was probably first discussed. We will sometimes refer to an edit transcript that uses the minimum number of edit operations as an optimal transcript. Note that there may be more than one optimal transcript. These will be called "cooptimal" transcripts when we want to emphasize the fact that there is more than one optimal. The edit distance problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation. The definition of edit distance implies that all operations are done to one string only. But edit distance is sometimes thought of as the minimum number of operations done on either of the two strings to transform both of them into a common third string. This view is equivalent to the above definition, since an insertion in one string can be viewed as a deletion in the other and vice versa.
Definition A (global) alignment of two strings SI and S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.
The term "global" emphasizes the fact that for each string, the entire string is involved in the alignment. This will be contrasted with local alignment to be discussed later. Notice that our use of the word "alignment" is now much more precise than its use in Parts 1 and IT. There, alignment was used in the colloquial sense to indicate how one string is placed relative to the other, and spaces were not then allowed in either string. As an example of a global alignment, consider the alignment of the strings qacdbd and qawxb shown below:
q q
a a
c w
d x
b b
d -
In this alignment, character c is mismatched with w , both the ds and the x are opposite spaces, and all other characters match their counterparts in the opposite string.
219
Since the last transcript symbol must either be I, D, R , or M, we have covered all cases and established the lemma. o Now we look at the other side.
Lemma 11.3.2, D(i, j ) 5 rnin[D(i - 1, j )
PROOF
The reasoning is very similar to that used in the previous lemma, but it achieves a somewhat different goal. The objective here is to demonstrate constructively the existence of transformations achieving each of the three values specified in the inequality. Then since all three values are feasible, their minimum is certainly feasible. First, it is possible to transform Sl[1 . . i ] into $11 ..j ] with exactly D(i, j - 1) 1 edit operations. Simply transform Sl[ 1.,i] to S2[l.. j - 11 with the minimum number of edit operations, and then use one more to insert character Sz(j) at the end. By definition, the number of edit operations in that particular way to transform SI to S2 is exactly D(i, j 1) 1. Second, it is possible to transform Sl[l..i] to S2[l..j] with exactly D(i - 1, j ) 1 edit operations. Transform Sl[l..i - 11 to S2[1..j] with the fewest operations, and then delete character Sl(i). The number of edit operations in that particular transformation is exactly D(i - 1, j ) 1. Third, it is possible to do the transformation with exactly D(i - 1, j - 1) + t(i, j ) edit operations, using the same argument.
Lemmas 11.3.1 and 11.3.2 immediately imply the correctness of the general recurrence relation for D(i, j).
Theorem 11.3.1. When both i and j are strictly positive, D(i, j ) = rnin[D(i 1, D(i, j - 1) 1, D(i - 1, j - I ) t(i, j)].
1, j )
Lemma 11.3.1 says that D(i, j) must be equal to one of the three values D(i 1, j)+1, D(i, j - l)+1, or D(i - 1, j- l)+t(i, j). Lemma 11.3.2 says that D(i, j ) must be less than o r equal to the smallest of those three values. It follows that D(i, j ) must therefore be equal to the smallest of those three values, and we have proven the theorem. D
PROOF
This completes the first component of the dynamic programming method for edit distance, the recurrence relation.
..
218
and D(0, j) = j. The base condition D(i, 0) = i is clearly correct (that is, it gives the number required by the definition of D(i, 0)) because the only way to transform the first i characters of S1 to zero characters of S2 is to delete all the i characters of Sl. Similarly, the condition D(0, j ) = j is correct because j characters must be inserted to convert zero characters of S1 to j characters of S2. The recurrence relation for D(i, j) when both i and j are strictly positive is D(i, j ) = min[D(i - 1 , j )
where t(i, j ) is defined to have value 1 if Sl(i) # Sz(j), and t(i, j ) has value 0 if Sl(i) = Sz(j).
Lemma 11.3.1. The value of D(i, j ) must be D(i, j - 1) + 1, D(i - 1, j ) + 1, or D(i 1, j - 1) t(i, j). There are no other possibilities.
Consider an edit transcript for the transformation of Si[1 ..i] to S2[l..j] using the minimum number of edit operations, and focus on the last symbol in that transcript. That last symbol must either be I, D, R, or M. If the last symbol is an I then the last edit operation is the insertion of character S2(j) onto the end of the (transformed) first string. It follows that the symbols in the transcript before that I must specify the minimum number of edit operations to transform Sl[I ..i] to S2[l.. j - 11 (if they didn't, then the specified transformation of Sl[ 1..i] to Sz[1.. j ] would use more than the minimum number of operations). By definition, that latter transformation takes D(i, j - 1) edit operations. 1. Hence if the last symbol in the transcript is I, then D(i, j ) = D(i, j - 1) Similarly, if the last symbol in the transcript is a D, then the last edit operation is the deletion of Sl(i), and the symbols in the transcript to the left of that D must specify the minimum number of edit operations to transform Sl[l..i - 11 to S2[I..j]. By definition, that latter transformation takes D(i - 1, j ) edit operations. So if the last symbol in the transcript is D, then D(i, j ) = D(i - 1, j ) + 1. If the last symbol in the transcript is an R, then the last edit operation replaces Sl(i) with S2(j),and the symbols to the left of R specify the minimum number of edit operations to transform Sl[l..i - 11 to S2[1..j - I]. In that case D(i, j ) = D(i - 1, j - 1) 1. Finally, and by similar reasoning, if the last symbol in the transcript is an M, then Sl(i) = S2(j) and D(i, j ) = D(i - 1, j - I). Using the variable t(i, j) introduced earlier [i.e., that t(i, j ) = 0 if Sl(i) = S2(j); otherwise t(i, j ) = 11 we can combine these last two cases as one: If the last transcript symbol is R or M, then D(i. j ) = D(i - 1, j - 1) t(i, j).
PROOF
221
Figure 11.2: Edit distances are filled in one row at a time, and in each row they are filled in from left to right. The example shows the edit distances D(i, j] to column 3 of row 4. The next value to be computed is D(4,4), where an asterisk appears. The value for cell (4,4) is 3, since S1( 4 ) = &(4) = t and D(3. 3 ) = 3.
The reader should be able to establish that the table could also be filled in columnwise instead of rowwise, after row zero and column zero have been computed. That is, column one could be first filled in, followed by column two, etc. Similarly, it is possible to fill in the table by filling in successive anti-diagonals, We leave the details as an exercise.
220
Figure 11 .I : Table to be used to compute the edit distance between vintner and writers. The values in row zero and column zero are already included. They are given directly by the base conditions.
Bottom-up computation In the bottom-up approach, we first compute D(i, j ) for the smallest possible values for i and j , and then compute values of D(i, j ) for increasing values of i and j . Typically, this bottom-up computation is organized with a dynamic programming table of size (n + 1) x (m 1). The table holds the values of D(i, j)for all the choices of i and j (see Figure 1 1.1). Note that string S1corresponds to the vertical axis of the table, while string S2 corresponds to the horizontal axis. Because the ranges of i and j begin at zero, the table has a zero row and a zero column. The values in row zero and column zero are filled in directly from the base conditions for D(i, j). After that, the remaining n x m subtable is filled in one row at time, in order of increasing i. Within each row, the cells are filled in order of increasing j. To see how to fill in the subtable, note that by the general recurrence relation for D(i, j), all the values needed for the computation of D( 1, 1) are known once D(0, O), D(1, O), and D(0, 1) have been computed. Hence D(1, I) can be computed after the zero row and zero column have been filled in. Then, again by the recurrence relations, after D(1, 1) has been computed, all the values needed for the computation of D( 1,2) are known. Following this idea, we see that the values for row one can be computed in order of increasing index j. After that, all the values needed to compute the values in row two are known, and that row can be filled in, in order of increasing j. By extension, the entire table can be filled in one row at a time, in order of increasing i , and in each row the values can be computed in order of increasing j (see Figure 1 1.2).
Time analysis How much work is done by this approach? When computing the value for a specific cell (i, j ) , only cells (i - 1, j - I), (i, j - l), and (i - 1, j ) are examined, along with the two characters Si(i) and Sz(j). Hence, to fill in one cell takes a constant number of cell examinations, arithmetic operations, and comparisons. There are O(nm) cells in the table, so we obtain the following theorem. Theorem 11.3.2. The dynamic programming table for computing the edit distance between a string oflength n and a string of length m can befilled in with O(nm) work. Hence, using dynamic programming, the edit distance D ( n , m) can be computed in O(nm) time.
Flgure11.4: Edit graph for the strings CAN and ANN.The weight on ea& d g e is one, except for the three zero-weight edges marked in the figure.
operations. Conversely, any optimal edit transcript is specified bj slrch a path. Moreover; since a path describes only one transcript, the correspondence beruseenpaths a n d optimal transcripts is one-to-one. The theorem can be proven by essentially the same reasoning rhat established the correctness of the recurrence relations for D(i, j),and this is lefi to the reader. An alternative way to find the optimal edit transcript(s), without using pointen. is discussed in Exercise 9. Once the pointers have been established, all the cooptima1 edit transcripts can be enumerated in O(n + m) time per transcript. That is the focus of Exercise 12.
In the case of the edit distance problem, the edit graph contruns a directed edge from l), (i 1 , j ) , and (i 1, j I), provided each node (i, j ) to each of the nodes (i, j those nodes exist. The weight on the first two of these edges is one; the weight on the third (diagonal) edge is t(i + 1, j + 1). Figure 1 1.4 shows the edit graph for strings CAN and ANN. The central property of an edit graph is that any shortestpath (one whose total weight is minimum) from start node (0,O) to destination node (n, m) specities an edit transcript with the minimum number of edit operations. Equivalently, any shortest path specifies a global alignment of minimum total weight. Moreover, the following theorem and corollary can be stated.
Theorem 11.4.1. An edit transcriptfor S1, S2has the minimum number of edit operations ifand only ifit corresponds to a shortestpath from (0,O) to (n, rn) in the edit graph. Corollary 11.4.1. The set of all shortest paths from (0,O) to (n, m) in the edit graph
exactly specifies the set of all optimal edit transcripts of S1 to S2.Equivalendy, it specifies all the optimal (minimum weight) alignments of SI and SZ. Viewing dynamic programming as a shortest path problem is often useful because there
222
Figure 11.3: The complete dynamic programming table with pointers included. The arrow c in cell (i, f) points to cell ( I ,j - I), the arrow t points to cell ( i - 1, j), and the arrow *\ points to cell ( i - 1, j I ) .
If there is more than one pointer from cell (n, m), then a path from (n, m) to (0,O) can start with either of those pointers. Each of them is on a path from (n, m) to (0,O). This property is repeated from any cell encountered. Hence a traceback path from ( n , m) to (0,O) can start simply by following any pointer out of (n, m); it can then be extended by following any pointer out of any cell encountered. Moreover, every cell except (0,O) has a pointer out of it, so no path from (n, m) can get stuck. Since any path of pointers from (n, m) to (0,O) specifies an optimal edit transcript or alignment, we have the following:
Theorem 11.3.3, Once the dynamic programming table with pointers has been computed, a n optimal edit transcript can be found in 0(n m) time.
We have now completely described the three crucial components of the general dynamic programming paradigm, as illustrated by the edit distance problem. We will later consider ways to increase the speed of the solution and decrease its needed space.
Theorem 11.3.4. Any path from (n , m ) to (0,O) following pointers established during
+'.
.-.--.,*,*;,-
nf
n(;
i j rnorifiov
nn ~ d i rron.~cript t with
8 3
.2
225
The operation-weight edit distance problem can also be represented and solved as a shortest path problem on a weighted edit graph, where the edge weights correspond in the natural way to the weights of the edit operations. The details are straightforward and are thus left to the reader.
' In a pure computer science or mathematical discussion of alphabet-weight edit distance, we would prefer to use the
general term "weight matrix" for the matrix holding the alphabet-dependent substitution scores. However, molecular biologists use the terms "amino acid substitution matrix" or "nucleotide substitution matrix" for those matrices, and they use the term "weight matrix" for tl very different object (See Section 14.3.1). Therefore, to maintain generality. and yet to keep in some harmony with the molecular biology titerature, we will use the general term "scoring matrix".
224
are many tools for investigating and compactly representing shortest paths in graphs. This view will be exploited in Section 13.2 when suboptimal solutions are discussed.
Definition With arbitrary operation weights, the operation-weight edit distance problem is to find an edit transcript that transforms string S1 into S2 with the minimum total operation weight.
In these terms, the edit distance problem we have considered so far is just the problem of finding the minimum operation-weight edit transcript when d = 1, r = 1 , and e = 0. But, for example, if each mismatch has a weight of 2, each space has a weight of 4, and each match a weight of 1, then the alignment
w v r i n i t t n e e r r s -
has a total weight of 17 and is an optimal ali,onment. Because the objective function is to minimize total weight and because a substitution can be achieved by a deletion followed by an insertion, if substitutions are to be allowed then a substitution weight should be less than the sum of the weights for a deletion plus an insertion.
and
The terms "weight" or "cost" are heavily used in the computer science literature, while the term "score" is used in the biological literature. We will use these terms more or less interchangeably in discussing algorithms, but the term "score" will be used when talking about specific biological applications,
Definition V ( i , j ) is defined as the value of the optimal alignment of prefixes SI [I ..i] and S2[1 .. j ] .
Recall that a dash ("-") is used to represent a space inserted into a string. The base conditions are
and
The correctness of this recurrence is established by arguments similar to those used for edit distance. In particular, in any alignment A, there are three possibilities: characters S l ( i ) and S 2 ( j )are in the same position (opposite each other), S , ( i ) is in a position after S z ( j ) ,or S l ( i ) is in a position before S 2 ( j ) .The correctness of the recurrence is based on that case analysis. Details are left to the reader. If SI and S2 are of length n and m , respectively, then the value of their optimal alignment is given by V ( n , m ) . That value, and the entire dynamic programming table, can be obtained in 0 ( n m )time, since only three comparisons and arithmetic operations are needed per cell. By leaving pointers while filling in the table, as was done with edit distance, an optimal alignment can be constructed by following any path of pointers from cell ( n , rn) to cell (0,O). So the optimal (global) alignment problem can be solved in O ( n m ) time, the same time as for edit distance.
Definition In a string S, a subsequence is defined as a subset of the characters of S arranged in their original "relative" order. More formally, a subsequence of a string S of length n is specified by a list of indices i l c i2 .= i3 -= . . . < i k .for some k 5 n. The subsequence specified by this list of indices is the string S ( i l ) S ( i 2 ) S ( i 3.).. S ( i k ) .
To emphasize again, a subsequence need not consist of contiguous characters in S, whereas the characters of a substring must be contigu~us,~ Of course, a substring satisfies the definition for a subsequence. For example, "its" is a subsequence of "winters" but not a substring, whereas "inter" is both a substring and a subsequence.
Definition Given two strings S1 and S2, a common subsequence is a subsequence that appears both in S1 and S2.The longest common subsequence problem is to find a longest common subsequence (Ics)of SI and S2.
' The distinction between subsequence and substring is often lost in the biological literature. But algorithms for
substrings are usually quite different in spirit and efficiency than algorithms for subsequences, so the distinction is an important one.
226
similarity, the language of alignment is usually more convenient than the language of edit transcript. We now begin to develop a precise definition of similarity.
Definition Let C be the alphabet iised for strings SI and SZ,and let C' be C with the added character "-" denoting a space. Then, for any two characters x, y in C', s(x, y) denotes the value (or score) obtained by aligning character x against character y . Definition For a given alignment A of Si and S2, let S; and S; denote the strings after the chosen insertion of spaces, and let 1 denote the (equal) length of the two strings S; and S; in A. The value of alignment A is defined as s(S;(i). S;(i)).
XI=,
That is, every position i in A specifies a pair of opposing characters in the alphabet C', and the value of d is obtained by summing the value contributed by each pair. For example, let C = ( a , b, c, d ) and let the pairwise scores be defined in the following matrix:
has a total value of 0 1 - 2 0 3 3 - 1 = 4. In string similarity problems, scoring matrices usually set s(x, y ) to be greater than or equal to zero if characters x , y of C' match and less than zero if they mismatch. With such a scoring scheme, one seeks an alignment with as large a value as possible. That alignment will emphasize matches (or similarities) between the two strings while penalizing mismatches or inserted spaces. Of course, the meaningfulness of the resulting alignment may depend heavily on the scoring scheme used and how match scores compare to mismatch and space scores. Numerous character-pair scoring matrices have been suggested for proteins and for DNA [81, 122, 127,222,252,4001, and no single scheme is right for all applications. We will return to this issue in Sections 13.1, 15.7, and 15.10.
+ + +
Definition Given a pairwise scoring matrix over the alphabet C', the similarify of two strings S l and S2 is defined as the value of the alignment A of S i and S2 that maximizes total alignment value. This is also called the optimal alignment value of Sl and S2.
String similarity is clearly related to alphabet-weight edit distance, and depending on the specific scoring matrix involved, one can often transform one problem into the other. An important difference between similarity and weighted edit distance will become clear in Section 1 1.7, after we discuss local alignment.
229
One example where end-spaces should be free is in the shotgun sequence assembly (see Sections 16.14 and 16.15). In this problem, one has a large set of partially overlapping substrings that come from many copies of one original but unknown string; the problem is to use compatisons of pairs of substrings to infer the correct original string. Two random substrings from the set are unlikely to be neighbors in the original string, and this is reflected by a low end-space free alignment score for those two substrings. But if two substrings do overlap in the original string, then a "good-sized" suffix of one should align to a "goodsized" prefix of the other with only a small number of spaces and mismatches (reflecting a small percentage of sequencing errors). This overlap is detected by an end-space free weighted alignment with high score. Similarly the case when one substring contains another can be detected in this way. The procedure for deducing candidate neighbor pairs is thus to compute the end-space free alignment between every pair of substrings; those pairs with high scores are then the best candidates. We will return to shotgun sequencing and extend this discussion in Part IV, Section 16.14. To implement free end spaces in computing similarity, use the recurrences for global alignment (where all spaces count) detailed on page 227, but change the base conditions to V(i, 0) = V(0, j ) = 0, for every i and j. That takes care of any spaces on the left end of the alignment. Then fill in the table as in the case of global alignment. However, unlike global alignment, the value of the optimal alignment is not necessarily found in cell (n, m). Rather, the value of the optimal alignment with free ends is the maximum value over all cells in row n or column m . Cells in row n correspond to alignments where the last character of string Si contributes to the value of the alignment, but characters of S2 to its right do not. Those characters are opposite end spaces, which are free. Cells in column m have a similar characterization. Clearly, optimal alignment with free end spaces is solved in O(nm) time, the same time as for global alignment.
The problem of determining if there is an approximate occurrence of P in T is an important and natural generalization of the exact matching problem. It can be solved as follows: Use the same recurrences (given on page 227) as for global alignment between P and T and change only the base condition for V(0, j ) to V(0, j) = 0 for all j. Then fill in the table (leaving the standard backpointers). Using this variant of global alignment, the following theorem can be proved.
Theorem 11.6.2. There is an approximate occurrence of P in T ending a t positiorl j of T ifand only if V(n, j) 2 6. Moreover; T [k.. j ] is a n approximate occurrence of P in T if and only if V (n, j ) 2 6 a n d there is a path of backpointers from cell (n, j ) to cell (0, k).
Clearly, the table can be filled i n using O(nm) time, but if all approximate occurrence of P in T are to be explicitly output, then O(nm) time may not be sufficient. A sensible compromise is to identify every position j in T such that V(n, j ) >_ 6, and then for each such j, explicitly output only the shortest approximate occurrence of P that ends at position j . That substring T' is found by traversing the backpointers from (n, j ) until a
228
The lcs problem is important in its own right, and we will discuss some of its uses and some ideas for improving its computation in Section 12.5. For now we show that it can be modeled and solved as an optimal alignment problem.
Theorem 11.6.1. With a scoring scheme that scores a one for each match and a zero for each mismatch o r space, the matched characters in a n alignment of maximum valueform a longest common subsequence.
The proof is immediate and is left to the reader. It follows that the longest common subsequence of strings of lengths n and m, respectively, can be computed in O(nm) time. At this point we see the first of many differences between substring and subsequence problems and why it is important to clearly distinguish between them. In Section 7.4 we established that the longest common substring could be found in O(n m) time, whereas here the bound established for finding longest common subsequence is O(n x m) (although this bound can be reduced somewhat). This is typical - substring and subsequence problems are generally solved by different methods and have different time and space complexities.
the two spaces at the left end of the alignment are free, as is the single space at the right end. Making end spaces free in the objective function encourages one string to align in the interior of the other, or the suffix of one string to align with a prefix of the other. This is desirable when one believes that those kinds of alignments reflect the "true" relationship of the two strings. Without a mechanism to encourage such alignments, the optimal alignment might have quite a different shape and not capture the desired relationship.
231
strings may be related. When comparing protein sequences, local alignment is also critical because proteins from very different families are often made up of the same structural or functional subunits (motifs or domains), and local alignment is appropriate in searching for these (unknown) subunits. Similarly, different proteins are often made from related motifs that form the inner core of the protein, but the motifs are separated by outside surface looping regions that can be quite different in different proteins. A very interesting example of conserved domains comes from the proteins encoded by homeobox genes. Homeobox genes [3 19,38 11 show up in a wide variety of species, from fruit flies to frogs to humans. These genes regulate high-level embryonic development, and a single mutation in these genes can transform one body part into another (one of the original mutation experiments causes fruit fly antenna to develop as legs, but it doesn't seem to bother the fly very much). The protein sequences that these genes encode are very different in each species, except in one region called the homeodomain. The homeodomain consists of about sixty amino acids that form the part of the regulatory protein that binds to DNA. Oddly, homeodomains made by certain insect and mammalian genes are particularly similar, showing about 50 to 95% identity in alignments without spaces. Protein-to-DNA binding is central in how those proteins regulate embryo development and cell differentiation. So the amino acid sequence in the most biologically critical part of those proteins is highly conserved, whereas the other parts of the protein sequences show very little similarity. In cases such as these, local alignment is certainly a more appropriate way to compare protein sequences than is global alignment. Local alignment in protein is additionally important because particular isolated characters of related proteins may be more highly conserved than the rest of the protein (for example, the amino acids at the active site of an enzyme or the amino acids in the hydrophobic core of a globular protein are the most highly conserved). Local alignment will more likely detect these conserved characters than will global alignment. A good example is the family of serineproteases where a few isolated, conserved amino acids characterize the family. Another example comes from the Helix-Turn-Helix motif, which occurs frequently in proteins that regulate DNA transcription by binding to DNA. The tenth position of the Helix-Turn-Helix motif is very frequently occupied by the amino acid glycine, but the rest of the motif is more variable. The following quote from C. Chothia [ l o l l further emphasizes the biological importance of protein domains and hence of local string comparison.
Extant proteins have been produced from the original set notjust by point mutations, insertions and deletions but also by combinations of genes to give chimeric proteins. This is particularly true of the very large proteins produced in the recent stages of evolution. Many of these are built of different combinations of protein domains that have been selected from a relatively small repertoire.
Doolittle [129] summarizes the point: "The underlying message is that one must be alert to regions of similarity even when they occur embedded in an overall background of dissimilarity." Thus, the dominant viewpoint today is that local alignment is the most appropriate type of alignment for comparing proteins from different protein families. However, it has also been pointed out [359,360] that one often sees extensive global similarity in pairs of protein strings that are first recognized as being related by strong local similarity. There are also suggestions [316] that in some situations global alignment is more effective than local alignment in exposing important biological commonalities.
230
cell in row zero is reached, breaking ties by choosing a vertical pointer over a diagonal one and a diagonal one over a horizontal one.
Local alignment problem Given two strings S1 and S2, find substrings cr and fi of S1 and S2, respectively, whose similarity (optimal global alignment value) is maximum over all pairs of substrings from S, and S2. We use v* to denote the value of an optimal solution to the local alignment problem.
For example, consider the strings SI = pqraxabcsrvq and S2 = xyaxbacsll. If we give each match a value of 2, each mismatch a value of -2, and each space a value of -1, then the two substrings a = axabcs and B = axbacs of SI and S2, respectively, have the following optimal (global) alignment
a a x x a b b a c c s s
which has a value of 8. Furthermore, over all choices of pairs of substrings, one from each of the two strings, those two substrings have maximum similarity (for the chosen scoring scheme). Hence, for that scoring scheme, the optimal local alignment of S1 and $2 has value 8 and is defined by substrings axabcs and axbacs. It should be clear why local alignment is defined in termsof similarity, which maximizes an objective function, rather than in terms of edit distance, which minimizes an objective. When one seeks a pair of substrings to minimize distance, the optimal pairs would be exactly matching substrings under most natural scoring schemes. But the matching substrings might be just a single character long and would not identify a region of high similarity. A formulation such as local alignment, where matches contribute positively and mismatches and spaces contribute negatively, is more likely to find more meaningful regions of high similarity.
Why local alignment? Global alignment of protein sequences is often meaningful when the two strings are members of the same protein family. For example, the protein cytochrome c has almost the same length in most organisms that produce it, and oneexpects to see a relationship between two cytochromes from any two different species over the entire length of the two strings. The same is true of proteins in the globin family, such as myoglobin and hemoglobin. In these cases, global alignment is meaningful. When trying to deduce evolutionary history by examining protein sequence similarities and differences, one usually compares proteins in the same sequence family, and so global alignment is typically meaningful and effective in those applications. However, in many biological applications, local similarity (local alignment) is far more meaningful than global similarity (global alignment). This is particularly true when long stretches of anonymous DNA are compared, since only some internal sections of those
233
Theorem 11.7.2. I f i'. j ' is an index pair maximizing v(i, j ) over all i, j pairs, then a ' also solves the local pair of substrings solvirtg the local suffix alignment problem for i', j alignment problem.
Thus a solution to the local suffix alignment problem solves the local alignment problem. j) : i 5 n , j 5 m ] and a We now turn our attention to the problem of finding max[v(i, pair of strings whose alignment has maximum value.
Theorem 11.7.3. For i > 0 and j > 0,the proper recurrence for v(i,j) is
The argument is similar to the justifications of previous recurrence relations. Let cr and ,b be the substrings of S1 and S2 whose global alignment establishes the optimal ..j], local alignment. Since cr and B are permitted to be empty suffixes of SI[l..i]and S2[l it is correct to include 0 as a candidate value for v(i,j). However, if the optimal cr is not empty, then character Sl(i)must either be aligned with a space or with character S 2 ( j ) . Similarly, if the optimal ,b is not empty, then S2(j)is aligned with a space or with Sl(i). So we justify the recurrence based on the way characters Sl(i) and Sz(j)may be aligned in the optimal local suffix alignment for i, j. If Si(i) is aligned with S2(j)in the optimal local i, j suffix alignment, then those two S2(j))to v(i,j),and the remainder of v(i,j) is determined characters contribute s(Sl(i), by the local suffix alignment for indices i - 1, j - 1. That local suffix alignment must be and S2(j)are aligned with optimal and so has value v(i - 1, j - 1). Therefore, if Sl(i) each other, v(i,j) = v(i - 1, j - 1) s(Sl(i),s2(j>>If Sl(i)is aligned with a space, then by similar reasoning v(i,j) = v(i - 1 , j) s(Sl(i), -), and if S2(j) is aligned with a space then v(i,j) = v(i,j - 1) s(-, S,(j)). Since all cases are exhausted, we have proven that v(i,j) must either be zero or be equal to one of the three other terms in the recurrence. On the other hand, for each of the four terms in the recurrence, there is a way to choose [1 ..i] and S2[l ..j] so that an alignment of those two suffixes has the value suffixes of S l given by the associated term. Hence the optimal suffix alignment value is at least the maximum of the four terms in the recurrence. Having proved that v(i,j) must be one of the four terms, and that it must be greater than or equal to the maximum of the four terms, it follows that v(i,j) must be equal to the maximum which proves the theorem.
PROOF
The recurrences for local suffix alignment are almost identical to those for global alignment. The only difference is the inclusion of zero in the case of local suffix alignment. This makes intuitive sense. In both global alignment and local suffix alignment of prefixes S r[ l . . i ] and S2[l .. j] the end characters of any alignment are specified, but in the case of local sufix alignment, any number of initial characters can be ignored. The zero in the recurrence implements this, acting to "restart" the recurrence. Given Theorem 1 1.7.2, the method to compute v* is to compute the dynamic programming table for v(i,j) and then find the largest value in any cell in the table, say in cell (i*, j * ) . As usual, pointers are created while filling in the values of the table. After cell
232
s u m
For example, suppose the objective function counts 2 for each match and - 1 for each mismatch or space. If SI = abcxdex and S2 = xxxcde, then v(3,4) = 2 (the two cs match), t1(4,5) = 1 (cx aligns with cd), v(5,5) = 3 (x-d aligns withxcd), and v(6,6) = 5 (.r-de aligns with xcde). Since the definition allows either or both of the suffixes to be empty, v(i, j ) is always oreater than or equal to zero. c ' The following theorem shows the relationship between the local alignment problem and the local suffix alignment problem. Recall that v* is the value of the optimal local alignment for two strings of length n and m .
Theorem 11.7.1 only specifies the value v', but its proof makes clear how to find substrings whose alignment have that value. In particular,
11.8. GAPS
c c t _ t t c a a a c c c c a a a t c c
Figure 11.5: An alignment with seven spaces distributed into four gaps.
with similarity (global alignment value) of v(i, j). Thus, an easy way td look for a set of highly similar substrings is to find a set of cells in the table with a value above some set threshold. Not all similar substrings will be identified in this way, but this approach is common in practice.
A gap may begin before the start of S , in which case it is bordered on the right by the first character of S, or it may begin after the end of S,in which case it is bordered on the left by the last character of S. Otherwise, a gap must be bordered on both sides by characters of S. A gap may be as small as a single space. As an example of gaps, consider the alignment in Figure 1 1.5, which has four gaps containing a total of seven spaces. That alignment would be described as having five matches, one mismatch, four gaps, and seven spaces. Notice that the last space in the first string is followed by a space in the second string, but those two spaces are in two gaps and do not form a single gap. By including a term in the objective function that reflects the gaps in the alignment one has some influence on the distribution of spaces in an alignment and hence on the overall shape of the alignment. In the simplest objective function that includes gaps,
Sometimes in the biology literature the term "space" (as we use it) is not used. Rather, the term "gap" is used both for "space" and for "gap" (as we have defined it here). This can cause much confusion, and in this book the terms "gap" and "space" have distinct meanings.
234
(i*, j*) is found, the substrings rr and fi giving the optimal local alignment of S1 and S2 are found by tracing back the pointers from cell (i*, j * ) until an entry (if, j') is reached that has value zero. Then the optimal local alignment substrings are rr = Sl[i'..i*]and fi = S2[j'.. j*].
Time analysis
Since it takes only four comparisons and three arithmetic operations per cell to compute v ( i , j ) , it takes only O ( n m ) time to fill in the entire table. The search for u* and the traceback clearly require only O(nm) time as well, so we have established the following desired theorem:
Theorem 11.7.4, For two strings S1 and S2 of lengths n and m, the local alignment problem can be solved in O ( n m ) tirne, the same time as for global alignment.
Recall that the pointers in the dynamic programming table for edit distance, global alignment, and similarity encode all the optimal alignments. Similarly, the pointers in the dynamic programming table for local alignment encode the optimal local alignments as follows.
Theorem 11.7.5. All optimal local alignments of two strings are represented in the dynamic programming table for v ( i , j ) and can be found by tracing any pointers back from any cell with value v*.
We leave the proof as an exercise.
11.8. GAPS
Figure 11.6: Each of the four rows represents part of the RNA sequence of one strain of the HIV-1 virus. The HIV virus mutates rapidly, so that mutations can be observed and traced. The bottom three rows are from virus strains that have each mutated from an ancestral strain represented in the top row. Each of the bottom sequences is shown aligned to the top sequence. A dark box represents a substring that matches the corresponding substring in the top sequence, while each white space represents a gap resulting from a known sequence deletion. This figure is adapted from one in [123].
long string
shows up as a gap when two proteins are aligned. In some contexts, many biologists consider the proper identification of the major (long) gaps as the essential problem of protein alignment. If the long (major) gaps have been selected correctly, the rest of the alignment - reflecting point mutations - is then relatively easy to obtain. An alignment of two strings is intended to reflect the cost (or likelihood) of mutational events needed to transform one string to another. Since a gap of more than one space can be created by a single mutational event, the alignment model should reflect the true distribution of spaces into gaps, not merely the number of spaces in the alignment. It follows that the model must specify how to weight gaps so as to reflect their biological meaning. In this chapter we will discuss different proposed schemes for weighting gaps, and in later chapters we will discuss additional issues in scoring gaps. First we consider a concrete example illustrating the utility of the gap concept.
236
each gap contributes a constant weight W,, independent of how long the gap is. That is, each individual space is free, so that s(x , -) = s(-, x ) = 0 for every character x . Using the notation established in Section 11.6, (page 226), we write the value of an alignment containing k gaps as
I
Changing the value of W , relative to the other weights in the objective function can change how spaces are distributed in the optimal alignment. A large W, encourages the alignment to have few gaps, and the aligned portions of the two strings will fall into a few , allows more fragmented alignments. The influence of W,on the substrings. A smaller W alignment will be discussed more deeply in Section 13.1.
11.8. GAPS
239
Certainly, you don't want to set a large penalty for spaces, since that would align all the cDNA string close together, rather than allowing gaps in the alignment corresponding to the long introns. You would also want a rather high penalty for mismatches. Although there may be a few sequencing errors in the data, so that some mismatches will occur even when the cDNA is properly cut up to match the exons, there should not be a large percentage of mismatches. In summary, you want small penalties for spaces, relatively large penalties for mismatches, and positive values for matches. What kind of alignment would likely result using an objective function that has low space penalty, high mismatch penalty, positive match value of course, and no term for gaps? Remember that the long string contains more than one gene, that the exons are separated by long introns, and that DNA has an alphabet of only four letters present in roughly equal amounts. Under these conditions, the optimal alignment would probably be the longest common subsequence between the short cDNA string and the long anonymous DNA string. And because the introns are long and DNA has only four characters, that common subsequence would likely match all of the characters in the cDNA. Moreover, because of small but real sequencing errors, the true alignment of the cDNA to its exons would not match all the characters. Hence the longest common subsequence would likely have a higher score than the correct alignment of the cDNA to exons. But the longest common subsequence would fragment the cDNA string over the longer DNA and not give an alignment of the desired form - it would not pick out its exons. Putting a term for gaps in the objective function rectifies the problem. By adding a constant gap weight W , for each gap in the alignment, and setting W, appropriately (by experimenting with different values of W,), the optimal alignment can be induced to cut up the cDNA to match its exons in the longer string6 As before, the space penalty is set to zero, the match value is positive, and the mismatch penalty is set high.
Processed pseudogenes
A more difficult version of cDNA matching arises in searching anonymous DNA for processed pseudogenes. A pseudogene is a near copy of a working gene that has mutated sufficiently from the original copy so that it can no longer function. Pseudogenes are very common in eukaryotic organisms and may play an important evolutionary role, providing a ready pool of diverse "near genes". Following the view that new genes are created by the process of duplication with modification of existing genes [127, 128, 1301, pseudogenes either represent trial genes that failed or future genes that will function after additional mutations. A pseudogene may be located very far from the gene it corresponds to, even on a different chromosome entirely, but it will usually contain both the introns and the exons derived from its working relative. The problem of finding pseudogenes in anonymous sequenced DNA is therefore related to that of finding repeated substrings in a very long string. A more interesting type of pseudogene, the processed pseudogene, contains only the exon substrings from its originating gene. Like cDNA, the introns have been removed and the exons concatenated. It is thought that a processed pseudogene originates as an mRNA that is retranscribed back into DNA (by the enzyme Reverse Transcriptase) and inserted into the genome at a random location. Now, given a long string of anonymous DNA that might contain both a processed pseudogene and its working ancestor, how could the processed pseudogenes be located?
"his
238
11.8. GAPS
241
The alphabet-weight version of the affine gap weight model again sets s(x, -) = s(-, X ) = 0 and has the objective of finding an alignment to
The affine gap weight model is probably the most commonly used gap model in the molecular biology literature, although there is considerable disagreement about what W, and W, should be [I611 (in addition to questions about W,,, and W,,$). For aligning amino acid strings, the widely used search program FASTA 13593 has chosen the default settings of W, = 10 and W,= 2. We will return to the question of the choice of these settings in Section 13.1. It has been suggested [57,183,466] that some biological phenomena are better modeled by a gap weight functjon where each additional space in a gap contributes less to the gap weight than the preceding space (a function with negative second derivative). In other , ~ not affine, function of its length. An example is words, a gap weight that is a c o n ~ e x but the function W, log, q , where q is the length of the gap. Some biologists have suggested that a gap function that initially increases to a maximum value and then decreases to near zero would reflect a combination of different biological phenomena that insert or delete DNA. Finally, the most general gap weight we will consider is the arbitrar)?gap weight, where the weight of a gap is an arbitrary function w(q) of its length q. The constant, affine, and convex weight models are of course subcases of the arbitrary weight model.
Time bounds for gap choices As might be expected, the time needed to optimally solve the alignment problem with arbitrary gap weights is greater than for the other models. In the case that w(q) is a totally arbitrary function of gap length, the optimal alignment can be found in 0(nm 2 + n 2 m) time, where n and m 1 n are the lengths of the two strings. In the case that w(q) is convex, we will show that the time can be reduced to O(nm log m) (a further reduction is possible, but the algorithm is much too complex for our interests). In the affine (and hence constant) case the time bound is O(nm), which is the same time bound established for the alignment model without the concept of gaps. In the next sections we will first discuss alignment for arbitrary gap weights and then show how to reduce the running time for the case of affine weight functions. The O(nm log m)-time algorithm for convex weights is more complex than the others and is deferred until Chapter 13.
240
The problem is similar to cDNA matching but more difficult because one does not have the cDNA in hand. We leave it to the reader to explore the use of repeat finding methods, local alignment, and gap weight selection in tackling this problem.
Caveat The problems of cDNA and pseudogene matching illustrate the utility of including gaps in the alignment objective function and the importance of weighting the gaps appropriately. It should be noted, however, that in practice one can approach these matching problems by a judicious use of local alignment without gaps. The idea is that in computing local alignment, one can find not only the most similar pair of substrings but many other highly similar pairs of substrings (see Sections 13.2.4, and 11.7.3). In the context of cDNA or pseudogene matching, these pairs will likely be the exons, and so the needed match of cDNA to exons can be pieced together from a number of nonoverlapping local alignments. This is the more typical approach in practice.
where s(x, -) = s(-, x ) = 0 for every character x, and S;and S; represent the strings S1 and S2 after insertion of spaces. A generalization of the constant gap weight model is to add a weight W, for each space , is called the gap initiation weight because it can represent the in the gap. In this case, W cost of starting a gap, and W, is called the gap extension weight because it can represent the cost of extending the gap by one space. Then the operator-weight version of the problem is: Find an alignment to maximize [ W,, (# matches) - W,,(# mismatches) - 4 (# gaps) - W ,(# spaces)]. This is called the a B n e g a p weight model7 because the weight contributed by a single gap of length q is given by the affine function W , q W,. The constant pap weight model is simply the affine model with W, = 0.
' The affine gap model is sometimes called the [ i n m r weight model, and I prefer that term. However. "at'tine" has
become the dominant term in the biological literature, and "linear" there usually refers to an aftine function with
w,= 0.
11.8. GAPS
243
where G(0,O) = 0, but G ( i , j ) is undefined when exactly one of i or j is zero. Note that V(0,O) = w(O), which will most naturally be assigned to be zero. When end spaces, and hence end gaps, are free, then the optimal alignment value is the maximum value over any cell in row n or column m , and the base cases are
Time analysis Theorem 11.8.1. Assuming that IS1( = n and I S2I = m, the recurrences can be evaluated in 0(nm2 + n2m) time.
PROOF
We evaluate the recurrences by the usual approach of filling in an (n 1) x (m 1) size table one row at time, where each row is filled from left to right. For any cell (i, j ) , the algorithm examines one other cell to evaluate G(i, j), j cells of row i to evaluate E ( i , j ) , and i cells of column j to evaluate F ( i , j ) . Therefore, for any fixed row, m(m 1)/2 = 0(m2)cells are examined to evaluate all the E values in that row, and for any fixed column, @(n2)cells are examined to evaluate all the F values of that column. The theorem then follows since there are n rows and m columns.
The increase in running time over the previous case (O(nm) time when gaps are not in the model) is caused by the need to look j cells to the left and i cells above to determine V(i, j ) . Before gaps were included in the model, V ( i , j)depended only on the three cells adjacent to (i, j),and so each V (i, j ) value was computed in constant time. We will show next how to reduce the number of cell examinations for the case of affine gap weights; later we will show a more complex reduction for the case of convex gap weights.
242
Figure tt -8: The recurrences for alignment with gaps are divided into three types of alignments: 1. those that align Sl ( i ) to the left of &(I), 2. those that aiign Sl ( i ) to the right of S z ( ] ) , and 3. those that align them opposite each other.
I. Alignments of Sl [ 1. .i] and Sz[ 1.. j ] where character Sl (i) is aligned to a character strictly to the left of character S 2 ( j ) .Therefore, the alignment ends with a gap in S I . 2. Alignments of the two prefixes where Sl (i) is aligned strictly to the right of S2(j). Therefore, the alignment ends with a gap in S2. 3. Alignments of the two prefixes where characters St(i) and S z ( j )are aligned opposite each other. This includes both the case that S i ( i )= S 2 ( j )and that S l ( i ) # S z ( j ) .
Clearly, these three types of alignments cover all the possibilities.
E ( i , j ) = max [ V ( i ,k ) - w ( j - k ) ] ,
Osk<j-t 05/51- 1
F ( i , j ) = max [ V ( l , j ) - w ( i - I ) ] . TO complete the recurrences, we need to specify the base cases and where the optimal alignment value is found. If all spaces are included in the objective function, even spaces that begin or end an alignment, then the optimal value for the alignment is found in cell (n, m), and the base case is
11.9. EXERCISES
245
11.9, Exercises
1. Write down the edit transcript for the alignment example on page 226.
2. The definition given in this book for string transformation and edit distance allows at most one operation per position in each string. But part of the motivation for string transformation and edit distance comes from an attempt to model evolution, where there is no restriction on the number of mutations that could occur at the same position. A deletion followed by an insertion and then a replacement could all happen at the same position. However, even though multiple operations at the same position are allowed, they will not occur in the transformation that uses the fewest number of operations. Prove this.
3. In the discussion of edit distance, all transforming operations were assumed to be done to one string only, and a "hand-waiving" argument was given to show that no greater generality is gained by allowing operations on both strings. Explain in detail why there is no loss in generality in restricting operations to one string only.
4. Give the details for how the dynamic programming table for edit distance or alignment can
be filled in columnwise or by successive antidiagonals. The antidiagonal case is useful in the context of practical parallel computation. Explain this.
5. In Section 11 -3.3,we described how to create an edit transcript from the traceback path through the dynamic programming table for edit distance. Prove that the edit transcript created in this way is an optimal edit transcript.
6. In Part I we discussed the exact matching problem when don't-care symbols are allowed. Formalize the edit distance problem when don't-care symbols are allowed in both strings, and show how to handle them in the dynamic programming solution.
7. Prove Theorem 11.3.4 showing that the pointers in the dynamic programming table completely capture all the optimal alignments.
8. Show how to use the optimal (global) alignment value to compute the edit distance of two strings and vice versa. Discuss in general the formal relationship between edit distance and string similarity. Under what circumstances are these concepts essentially equivalent, and when are they different?
9. The method discussed in this chapter to construct an optimal alignment left back-pointers
while filling in the dynamic programming (DP) table, and then used those pointers to trace However, there is an alternate approach that back a path from cell (n. m) to cell (0,O). works even if no pointers are avaiiable. If given the full DP table without pointers, one can construct an alignment with an algorithm that "works through" the table in a single pass Make this precise and show it can be done as fast as the from cell (n, m) to cell (0,O). algorithm that fills in the table.
10. For most kinds of alignments (for example, global alignment without arbitrary gap weights), the traceback using pointers (as detailed in Section 11.3.3) runs in O(n m) time, which is less than the time needed to fill in the table. Determine which kinds of alignments allow this speedup.
11. Since the traceback paths in a dynamic programming table correspond one-to-one with the optimal alignments, the number of distinct cooptimal alignments can be obtained by computing the number of distinct traceback paths. Give an algorithm to compute this number in O(nm) time.
Hint: Use dynamic programming.
12. As discussed in the previous problem, the cooptimal alignments can be found by enumerating all the traceback paths in the dynamic programming table. Give a backtracking method to find each path, and each cooptimal alignment, in O(n m) time per path.
13. In a dynamic programming table for edit distance, must the entries along a row be
244
has already begun or whether a new gap is being started (either opposite character i of SI or opposite character j of S2). This insight, as usual, is formalized in a set of recurrences.
The recurrences
For the case where end gaps are included in the alignment value, the base case is easily seen to be
so that the zero row and columns of the table for V can be filled in easily. When end gaps are free, then V(i, 0) = V(0, j ) = 0, The general recurrences are V(i, j ) = max[E(i, j ) , F(i, j ) , G(i, j)],
F(i, j ) = max[F(i - 1, j), V(i - 1, j) - W,] - W , . To better understand these recurrences, consider the recurrence for E(i, j ) . By definition, Sl(i) will be aligned to the left of Sz(j). The recurrence says that either 1. S l ( i ) is exactly one place to the left of S 2 ( j), in which case a gap begins in S1 opposite character S2(j),and E ( i , j ) = V(i, j - 1)- W,- W, or2. S r ( i )is totheleftof S2(j-l),in whichcase the same gap in Si is opposite both S 2 ( j- 1) and S2(j),and E(i, j ) = E(i, j - 1) - W,. An explanation for F(i, j ) is similar, and G(i, j ) is the simple case of aligning Sl(i) opposite S2(j). As before, the value of the optimal alignment is found in cell (n, m) if right end spaces contribute to the objective function. Otherwise the value of the optimal alignment is the maximum value in the nth row or m th column. The reader should be able to verify that these recurrences are correct but might wonder why V(i, j - 1) and not G(i, j - 1) is used in the recurrence for E ( i , j ) . That is, why is E(i, j) not max[E(i, j - I ), G(i, j - 1) - W,] - W,? This recurrence would be incorrect because it would not consider alignments that have a gap in S2 bordered on the left by character j - 1 of Szand ending opposite character i of S1,followed immediately by a gap in SI.The expanded recurrence E(i, j ) = max[E(i, j - 1). G(i, j - 1) - W,.V(i, j 1) - W,q] - W, would allow for all alignments and would be correct, but the inclusion of the middle term (G(i, j - 1 ) - W g )is redundant because the last term (V(i, j - 1) - W,) includes it.
Time analysis Theorem 11.8.2. The optimal alignment with aflne gap weights can be computed in
O(nm) time, the same time as fur optimal alignment without a gap term.
PROOF
Examination of the recurrences shows that for any pair (i, j ) , each of the terms V(i, j ) , E(i, j ) , F ( i , j ) , and G ( i , j ) is evaluated by a constant number of references to previously computed values, arithmetic operations, and comparisons. Hence O(nm) time I ) cells in the dynamic programming table. 0 suffices to fill in all the (n 1) x (m
11.9. EXERCISES
247
do not contribute to the cost of the alignment. Show how to use the affine gap recurrences developed in the text to solve the end-gap free version of the affine gap model of alignment. Then consider using the alternate recurrences developed in the previous exercise. Both should run in O(nm) time. Is there any advantage to using one over the other of these recurrences?
29. Show how to extend the agrep method of Section 4.2.3 to allow character insertions and deletions. 30. Give a simple algorithm to solve the local alignment problem in O(nm)time if no spaces are allowed in the local alignment.
31. Repeated substrings. Local alignment between two different strings finds pairs of substrings from the two strings that have high similarity. It is also important to find substrings of a single string that have high similarity. Those substrings represent inexact repeated substrings. This suggests that to find inexact repeats in a single string one should locally align of a string against itself. But there is a problem with this approach. If we do local alignment of a string against itself, the best substring will be the entire string. Even using all the values in the table, the best path to a cell ( i , j) for i # j may be strongly influenced by the main diagonal. There is a simple fix to this problem. Find it. Can your method produce two substrings that overlap? 1 s that desirable? Later in Exercise 17 of Chapter 13, we will examine the problem of finding the most simijar nunoverlapping substrings in a single string.
32. Tandem repeats. Let P be a pattern of length n and T a text of length m. Let Pm be the concatenation of P with itself m times, so Pm has jength mn. We want to compute a local alignment between Pmand T. That wilt find an interval in T that has the best global alignment (according to standard alignment criteria) with some tandem repeat of P. This problem differs from the problem considered in Exercise 4 of Chapter 1, because errors (mismatches and insertions and deletions) are now allowed. The particular problem arises in studying the secondary structure of proteins that form what is called a coiled-coil [I 581. In that context, Prepresents a motif or domain (a pattern for our purposes) that can repeat in the protein an unknown number of times, and T represents the protein. Local alignment between Pmand T picks out an interval of T that "optimally" consists of tandem repeats of the motif (with errors allowed). If Pmis explicitly created, then standard local alignment will solve the problem in 0 ( n m 2 )time. But because Pm consists of identical copies of P, an O(nm)-time solution is possible. The method essentialjy simulates what the dynamic programming algorithm for local alignment would do if it were executed with Pm and T explicitly. Below we outline the method.
The dynamic programming algorithm will fill in an m + 1 by n 1 1 table V, whose rows are numbered 0 to n, and whose columns are numbered 0 to m.Row 0 and column 0 are initialized to all 0 entries. Then in each row i , from 1 to m, the algorithm does the following: It executes the standard local alignment recurrences in row i ; it sets V(i,0)to V(i,n); and then it executes the standard local alignment recurrences in row i again. After completely filling in each row, the algorithm selects the cell with largest V value, as in the standard solution to the local alignment problem.
Clearly, this algorithm only takes O(nm) time. Prove that it correctly finds the value of the optimal local alignment between Pmand T. Then give the details of the traceback to construct the optimal local alignment. Discuss why Pwas (conceptually) expanded to Pm and not a longer or shorter string.
33. a. Given two strings S, and & (of lengths n and m)and a parameter 8 , show how to construct 1 the following matrix in O(nm) time: M(i, 1') = 1 if and only if there is an alignment of S and $ in which characters Sl( i ) and &(I) are aligned with each other and the value of the
246
CORE STRING EDITS, ALIGNMENTS, AND DYNAMIC PROGRAMMING nondecreasing? What about down a column or down a diagonal of the table? Now discuss the same questions for optimal global alignment.
f 4. Give a complete argument that the formula in Theorem 11 -6.1 is correct. Then provide the details for how to find the longest common subsequence, not just its length, using the algorithm for weighted edit distance.
15. As shown in the text, the longest common subsequence problem can be solved as an optimal alignment or similarity problem. It can also be solved as an operation-weight edit distance problem. Let u represent the length of the longest common subsequence of two strings of lengths n and m. Using the operation weights of d = 1, r = 2, and e = 0, we claim that D(n, m)= m+ n- 2u or u = ( m i -n- D(n, m))/2.So, Qn, m)is minimized by maximizing u. Prove this claim and explain in detail how to find a longest common subsequence using a program for operation-weight edit distance.
16. Write recurrences for the longest common subsequence problem that do not use weights. That is, solve the lcs problem more directly, rather than expressing it as a special case of similarity or operation-weighted edit distance. 17. Explain the correctness of the recurrences for similarity given in Section 11.6.1. 18. Explain how to compute edit distance (as opposed to similarity) when end spaces are free.
19. Prove the one-to-one correspondence between shortest paths in the edit graph and mini-
21. Prove Theorem 1 I .6.2, and show in detail the correctness of the method presented for finding the shortest approximate occurrence of Pin Tending at position j.
22. Explain how to use the dynamic programming table and traceback to find all the optimal solutions (pairs of substrings) to the local alignment problem for two strings Sl and &.
23. In Section 11,7.3, we mentioned that the dynamic programming table is often used to identify pairs of substrings of high similarity, which may not be optimal solutions to the local alignment problem. Given similarity threshold t, that method seeks to find pairs of substrings with similarity value t or greater. Give an example showing that the method might miss some qualifying pairs of substrings.
24. Show how to solve the alphabet-weight alignment problem with affine gap weights in O(nm) time.
25. The discussions for alignment with gap weights focused on how to compute the vatues in the dynamic programming table and did not detail how to construct an optimal alignment. Show how to augment the algorithm so that it constructs an optimal alignment. Try to limit the amount of additional space required.
26. Explain in detail why the recurrence E(i, 1) = max[E(i, j- I), G(i, j - 1) - W,, V ( i , j - 1) W,] - W, is correct for the affine gap model, but is redundant, and that the middle term (G(i, j - 1) - W,) can be removed.
27. The recurrences relations we developed for the affine gap model follow the logic of paying W, W, when a gap is "initiated" and then paying W, for each additional space used in that gap. An alternative logic is to pay W, W, at the point when the gap is "completed." Write recurrences relations for the affine gap model that follow that logic. The recurrences should compute the alignment in O(nm)time. Recurrences of this type are developed in 11661.
28. In the end-gap free version of alignment, spaces and gaps at either end of the alignment
11.9. EXERCISES
249
Usually a scoring matrix is used to score matches and mismatches, and a affine (or linear) gap penalty model is atso used. Experiments [51,447] have shown that the success of this approach is very sensitive to the exact choice of the scoring matrixand penalties. Moreover, it has been suggested that the gap penalty must be made higher in the substrings forming the a and 3 , regions than in the rest of the string (for example, see 1511 and [296]). That is, no fixed choice for gap penalty and space penalty (gap initiation and gap extension penalties in the vernacular of computational biology) will work. Or at least, having a higher gap penalty in the secondary regions will more likely result in a better alignment. High gap penalties tend to keep the cr and fi regions unbroken. However, since insertions and deletions do definitely occur in the loops, gaps in the alignment of regions outside the core should be allowed. This leads to the following alignment problem: How do you modify the alignment model and penalty structure to achieve the requirements outlined above? And, how do you find the optimal alignment within those new constraints? Technically, this problem is not very hard. However, the application to deducing secondary structure is very important. Orders of magnitude more protein sequence data are available than are protein structure data. Much of what is "known" about protein structure is actually obtained by deductions from protein sequence data. Consequently, deducing structure from sequence is a central goal. A multiple alignment version of this structure prediction problem is discussed in the first part of Section 14.1 0.2.
37, Given two strings S, and S and a text T, you want to find whether there is an occurrence of S1and & interwoven (without spaces) in T. For example, the strings abac and bbc occur interwoven in cabbabccdw. Give an efficient algorithm for this problem. (It may have a relationship to the longest common subsequence problem.)
38. As discussed earlier in the exercises of Chapter 1, bacteria! DNA is often organized into
circular molecules. This motivates the following problem: Given two linear strings of lengths n and m, there are n circular shifts of the first string and m circular shifts of the second string, and so there are nm pairs of circular shifts. We want to compute the global alignment for each of these nm pairs of strings. Can that be done more efficiently than by solving the alignment problem from scratch for each pair? Consider both worst-case analysis and "typicaln running time for "naturally occurring" input. Examine the same problem for local alignment.
39. The stuttering subsequence problem [328].Let P and T be strings of n and m characters each. Give an O(m)-time algorithm to determine if P occurs as a subsequence of T.
Now let PI denote the string P where each character is repeated i times. For example, if P = abc then P3 is aaabbbccc. Certainly, for any fixed i, one can test in O(m) time whether PI occurs as a subsequence of T. Give an algorithm that runs in O(mlog m) time to determine the largest i such that Pi is a subsequence of T. Let Maxi(P, T) denote the value of that largest i. Now we will outline an approach to this problem that reduces the running time from O(mlogm) to a m ) . You will fill in the details. For a string T, let d be the number of distinct characters that occur in T. For string T and character x in T, define odd(x) to be the positions of the odd occurrences of x in T, that is, the positions of the first, third, fifth, etc. occurrence of x in T. Since there are d distinct characters in T, there are d such oddsets. For example, if T = 0120002112022220110001 then odd(1) is 2,9,18. Now define hal((T) as the subsequence of T that remains after removing all the characters in positions specified by the d odd sets. For example, half( 7-)
248
matrix be computed? The motivation for this matrix is essentially the same as for the matrix described in the preceding problem and is used in 14431 and 14451. 34. Implement the dynamic programming solution for alignment with a gap term in the objective function, and then experiment with the program to find the right weights to solve the cDNA matching problem. 35. The process by which intron-exon boundaries (called splice sites) are found in mRNA is not well understood. The simplest hope -that splice sites are marked by patterns that always occur there and never occur elsewhere - is false. However, it is true that certain short patterns very frequently occur at the splice sites of introns, In particular, most introns start with the dinucleotide G T and end with AG. Modify the dynamic programming recurrences used in the cDNA matching problem to enforce this fact. There are additional pattern features that are known about introns. Search a library to find information about those conserved features - you'll find a lot of interesting things while doing the search.
11.9. EXERCISES
Figure 11.10: A rough drawing of a cloverleaf structure. Each of the small horizontal or vertical lines inside a stem represents a base pairing of a- u or c- g.
42. Transfer RNA (tRNA) molecules have a distinctive planar secondary structure called the cloverleaf structure. In a cloverleaf, the string is divided into alternating stems and loops (see Figure 11.1 0). Each stem consists 01 two parallel substrings that have the property that any pair of opposing characters in the stem must be complements (a with u; c with g). Chemically, each complementary stem pair forms a bond that contributes to the overall stability of the molecule. A c- g bond is stronger than an a- u bond.
Relate this (very superficial) description of tRNA secondary structure to the weighted nested pairing problem discussed above.
43. The true bonding pattern of complementary bases (in the stems) of tRNA molecules mostly conforms to the noncrossing condition in the definition of a nested pairing. However, there are exceptions, so that when the secondary structure of known tRNA molecules is represented by lines through the circle, a few lines may cross. These violations of the noncrossing condition are called psuedoknots.
Consider the problem of finding a maximum cardinality proper pairing where a fixed number of psuedoknots are allowed. Give an efficient algorithm for this problem, where the complexity is a function of the permitted number of crossings.
44. RNA sequence and structure alignment. Because of the nested pairing structure of
RNA, it is easy to incorporate some structural considerations when aligning RNA strings. Here we examine alignments of this kind. Let P be an RNA pattern string with a known pairing struclure, and let T be a larger RNA text string with a known pairing structure. To represent pairing structure in P, let Op(i) be the offset(positive or negative) of the mate of the character at position i, if any. For example, if the character at position 17 is mated to the character at position 46, then Op(17) = 29 and O(46) = -29. If the character at position i has no mate, then Op(i) is zero. The structure of T is similarly represented by an offset vector OT.Then Pexactly occurs in T starting at position j if and only if P(i) = T (j + i - 1) and Op(i) = Or( j+ i - I ) , for each position i in P .
a. Assuming the lengths of P and Tare n and m, respectively, give an O(n+ m)-time algorithm to find every place that Pexactly occurs in T.
b. Now consider a more liberal criteria for deciding that P occurs in T starting at position j. We again require that P(i) = T( j + i - 1) for each position i in P,but now only require that Op(i) = OT(j + i - 1) when Op(i)is not zero.
250
above is 0021220101. Assuming that the number of distinct symbols, d, is fixed ahead of time, give an O(m)-time algorithm to find half(T). Now argue that the length of half(T) is at most m / 2 . This will be used later in the time analysis. Now prove that 1 Maxi( P, T ) - 2 Maxi( P, half(T))I 5 1. This fact is the critical one in the method. The above facts allow us to find Maxi( P, in O(m) time by adivide-and-conquer recursion. Give the details of the method: Specify the termination conditions of the divide and conquer, prove correctness of the method, set up a recurrence relation to analyze the running time, and then solve the relation to obtain an O(m) time bound. Harder problem: What is a realistic application for the stuttering subsequence problem?
40. As seen in the previous problem, it is easy to determine if a single pattern P occurs as a subsequence in a text T. This takes a m ) time. Now consider the problem of determining if any pattern in a set of patterns occurs in a text. If n is the length of all the patterns in the set, then O(nm) time is obtained by solving the problem lor each pattern separately. Try for a time bound that is significantly better than O(nm). Recall that the analogous substring set problem can be solved in O(n + m) time by Aho-Corasik or suffix tree methods. 41. The tRNA folding problem. The following is an extremely crude version of a problem that arises in predicting the secondary (planar) structure of transfer RNA molecules. Let S be a string of n characters over the RNA alphabet a, c, u, g. We deiine a pairing as set of disjoint pairs of characters in S. A pairing is called proper if it only contains (a, u ) pairs or (c, g) pairs. This constraint arises because in RNA a and u are complementary nucleotides, as are c and g. If we draw S as a circular string, we define a nestedpairing as a proper pairing where each pair in the pairing is connected by a line inside the circle, and where the lines do not cross each other. (See Figure 11.9). The problem is to find a nested pairing of largest cardinality. Often one has the additional constraint that a character may not be in a pair with either of its two immediate neighbors. Show how to solve this version time using dynamic programming. of the tRNA folding problem in O($)
Now modify the problem by adding weights to the objective function so that the weight of an a-u pair is different than the weight of a c- g pair. The goal now is to find a nested pairing of maximum total weight. Give an efficient algorithm for this weighted problem.
11.9. EXERCISES
two adjacent gaps where each is in a different string. For example, the alignment
x x x x a i b d c e y y y y
would never be found by these modified recurrences. There seems no modeling justification to prohibit adjacent gaps in opposite strings. In fact some mutations, such as substring inversions (which are common in DNA), would be best represented in an alignment as adjacent gaps of this type, unless the model of alignment has an explicit notion of inversion (we will consider such a model in Chapter 19). Another example where adjacent spaces would be natural occurs when comparing two mRNA strings that arise from alternative intron splicing. In eukaryotes, genes are often comprised of alternating regions of exons and introns. In the normal mode of transcription, every intron is eventually spliced out, so that the mRNA molecule reflects a concatenation of the exons. But it can also happen, in what is called alternative splicing, that exons can be spliced out as well as introns. Consider then the situation where all the introns plus exon i are spliced out, and the situation where all the introns plus exon i + 1 are spliced out. When these two mRNA strings are compared, the best alignment may very well put exon i against a gap in the second string, and then put exon i + 1 against a gap in the first string. In other words, the informative alignment would have two adjacent gaps in alternate strings. In that case, the recurrences above do not correctly implement the second viewpoint. Write recurrences for arbitrary gap weights to allow adjacent gaps in the two opposite strings and yet prohibit adjacent gaps in a single string.
252
c. Discuss when the more liberal definition is reasonable and when it may not be.
n- i
Figure 12.1: The similarity of the first icharacters of S{ and the first j characters of S i equals the similarity of the last i characters of Sland the last j characters of S2.(The dotted lines denote the substrings being aligned.)
single row of the full table can be found and stored in those same time and space bounds. This ability will be critical in the method to come. As a further refinement of this idea, the space needed can be reduced to one row plus one additional cell (in addition to the space for the strings). Thus m 1 space is all that is needed. And, if n < m then space use can be further reduced to n 1. We leave the details as an exercise.
+ +
Clearly, the table of V r (i, j ) values can be computed in O(nm) time, and any single preselected row of that table can be computed and stored in O(nm) time using only O(m) space. The initial piece of the full alignment is computed in linear space by computing V(n, m) in two parts. The first part uses the original strings; the second part uses the reverse strings. The details of this two-part computation are suggested in the following lemma.
Lemma 12.1.1. V(n, m ) = maxoikr,[V(n/2,
k)
+ Vr(n/2, m - k)j.
In this chapter we look at a number of important refinements that have been developed for certain core string edit and alignment problems. These refinements either speed up a dynamic programming solution, reduce its space requirements, or extend its utility.
, -
k2
Figure 12.2: After finding k*,the alignment problem reduces to finding an optimal alignment in section A of the table and another optimal alignment in section B of the table. The total area of subtables A and B is at most cnm/2. The subpath Ln,2 through celi (1712,k*) is represented by a dashed path.
path from cell (n/2, k*) to a cell k2 in row n/2 1. That path identifies a subpath of an optimal path from (n/2, k*) to (n, m ) . These two subpaths taken together form the subpath Lni2that is part of an optimal path L from (0, 0) to (n, m). Moreover, that optimal path goes through cell (n/2, k*). Overall, O(nm) time and O(m) space is used to find k*, k l , k2, and LtI/2. To analyze the full method to come, we will express the time needed to fill in the dynamic programming table of size p by q as cpq, for some unspecified constant c , rather than as O(pq). In that view, the n/2 row of the first dynamic program computation is found in cnm/2 time, as is the n / 2 row of the second computation. Thus, a total of cnm time is needed to obtain and store both rows. The key point to note is that with a cnm-time and O(m)-space computation, the algorithm learns k*, k l , k2, and L),j2.This specifies part of an optimal alignment of S I and Sz, and not just the value V(n, m).By Lemma 12.1.1 it learns that there is an optimal alignment of S I and S2 consisting of an optimal alignment of the first n/2 characters of S I with the first k* characters of S2, followed by an optimal alignment of the last n/2 characters of SI with the last m - k* characters of Sz. In fact, since the algorithm has also learned the subpath (subalignment) L,,jz, the problem of aligning SI and S2 reduces to two smaller alignment problems, one for the strings Sl [I ..n/2 - 11and S2[1..kl], and one for the strings SI[n/2 l..n] and Sz[kz..ml. We call the first of the two problems the top problem and the second the bottom problem. Note that the top problem is an alignment problem on strings of lengths at most n / 2 and k*, while the bottom problem is on strings of lengths at most n/2 and m - k*. In terms of the dynamic programming table, the top problem is computed in section A of the original n by m table shown in Figure 12.2, and the bottom problem is computed in section 3 of the table. The rest of the table can be ignored. Again, we can determine the values in the middle row of A (or B ) in time proportional to the total size of A (or B ) . Hence the middle row of the top problem can be determined at most ck*n/2 time, and the middle row in the bottom problem can be determined in at most c(m - k*)n/2 time. These two times add to cnm/2. This leads to the full idea for computing the optimal alignment of S t and Sz.
256
PROOF
This result is almost obvious, and yet it requires a proof. Recall that S l [ I ..i] is the prefix of string Si consisting of the first i characters and that S;[l..i] is the reverse of the suffix of S l consisting of the last i characters of Sr.Similar definitions hold for S2 and S;. For any fixed position k' in S2, there is an alignment of S1 and Sz consisting of an aIignment of S l [ 1..n/2] and S2[l..kl] followed by a disjoint alignment of S I[n/2 1..n] and S2[k' l..m]. By definition of V and V r , the best alignment of the first type has value V(n/2, k') and the best alignment of the second type has value vr(n/2, m - k'), so the combined alignment has value V(n/2, k') + V r (n/2, m - k') 5 maxk[V(n/2, k) V r (n/2, m - k)] 5 V(n, m). Conversely, consider an optimal alignment of S1and S2.Let k' be the right-most position in Sz that is aligned with a character at or before position n / 2 in S1.Then the optimal alignment of S1 and S2 consists of an alignment of Sl[1 ..n/2] and Sz[l..kt] followed by an alignment of Sl [n/2 + 1..n] and S2[kf 1..m]. Let the value of the first alignment be denoted p and the value of the second alignment be denoted q . Then p must be equal to V(n/2, kr), for if p < V(n/2, k') we could replace the alignment of Sl[1 . .n/2] and S2[l..kt] with the alignment of S,[1 ..n/2] and Sa[l..k'3 that has value V(n/2, k'). That would create an alignment of S1 and S2 whose value is larger than the claimed optimal. Hence p = V(n/2, k'). By similar reasoning, q = V r (n/2, m - k'). So V(n, m) = V(n/2, k') V r (n/2, m - k') 5 maxk[V(n/2, k) V r (n/2, m - k)]. Having shown both sides ofthe inequality, we conclude that V(n, m) = maxk[V(n/2, k) Vr(n/2, m - k)].
+ V (n/2, rn - k)].
r
By Lemma 12.1.1, there is an optimal alignment whose traceback path in the full dynamic programming table (if one had filled in the full n by m table) goes through cell (n/2, k'). Another way to say this is that there is an optimal (longest) path L from node (0,O) to node (n, m) in the alignment graph that goes through node (n/2, k*). That is the key feature of k*.
Definition Let Ln,* be the subpath of L that starts with the last node of L in row n/2 - 1 and ends with the first node of L in row n/2 I .
Lemma 12.1.2. A position k* in row n/2 can be found in O(nm) time and O(m) space. Moreover, n subpath Lnj2 can befound and stored in those time and space bounds.
PROOF
First, execute dynamic programming to compute the optimal alignment of S I and S2, but stop after iteration n/2 (i.e., after the values in row n / 2 have been computed). Moreover, when filling in row n/2, establish and save the normal traceback pointers for the cells in that row. At this point, V(n/2, k) is known for every 0 5 k 5 m. Following the earlier discussion, only O(m) space is needed to obtain the values and pointers in rows n/2. Second, begin computing the optimal alignment of S; and S; but stop after iteration n/2. Save both the values for cells in row n/2 along with the traceback pointers for those cells. Again, O(m) space suffices and value V r (n/2, m - k) is known for every k. Now, for each k, add V(n/2, k) to V r (n/2, m - k), and let k* be an index k that gives the largest sum. These additions and comparisons take O(m) time. Using the first set of saved pointers, follow any traceback path from cell (n/2, k') to a cell kl in row n/2 - 1. This identifies a subpath that is on an optimal path from cell (0,O) to cell (n/2, k*). Similarly, using the second set of traceback pointers, follow any traceback
259
most cnm/2'-I, The final dynamic programming pass to describe the optimal alignment takes cnm time. Therefore, we have the following theorem:
Theorem 12.1.1. Using Hirschberg's procedure OPTA, a n optimal alignment of two strings o f length n and m can befound in ZJ": ; cnm/2'-l 5 2cnm time and O(m) space.
For comparison, recall that cnm time is used by the original method of filling in the full n by m dynamic programming table. Hirschberg's method reduces the space use from O(nm) to O(m) while only doubling the worst-case time needed for the computation.
The call that begins the computation is to OPTA(1, n, 1, m). Note that the subpath Lh is output between the two OPTA calls and that the top problem is called before the bottom problem, The effect is that the subpaths are output in order of increasing h value, so that their concatenation describes an optimal path L from (0,O) to (n, m), and hence an optimal alignment of S I and S2.
261
' I recently attended a meeting concerning the Human Genome Project, where numerous examples were presented
in talks. 1 stopped taking notes after the tenth one.
Definition Given strings S I and S2 and a fixed number k. the k-difference global alignment problem is to find the best global alignment of S l and S2 containing at most k mismatches and spaces (if one exists).
The k-difference global alignment problem is a special case of edit distance and is useful when Siand S2 are believed to be fairly similar. It also arises as a subproblem in more complex string processing problems, such as the approximate PCR primer problem considered in Section 12.2.5. The solution to the k-difference global alignment problem will also be used to speed up global alignment when no bound k is specified.
Definition Given strings P and T , the k-d@?rence inexact mntchingproblem is to find all ways (if any) to match P in T using at most k character substitutions, insertions, and deletions. That is, find all occurrences of P in T using at most k mismatches and spaces. (End spaces in T but not P are free.)
The inclusion of spaces, in addition to mismatches, allows a more robust version of the k-mismatch problem discussed in Section 9.4, but it complicates the problem. Unlike our solution to the k-mismatch problem, the k-differences problem seems to require the use of dynamic programming. The approach we take is to speed up the basic O(nm)-time dynamic programming solution, making use of the assumption that only alignments with at most k differences are of interest.
main diagonal Figure 12.3: The main diagonal and a strip that is k = 2 spaces off the main diagonat on each side. comparisons with all sequences in SwissProt . . . Sequences belonging to the same species and having more than 98 percent similarity over 33 amino acids were combined.
t t
A similar example is discussed in [399] where roughly 170,000 DNA sequences "were subjected to an optimal alignment procedure to identify sequence pairs with at least 97% identity". In these alignment problems, one can impose a bound on the number of allowed differences. Alignments that exceed that bound are not of interest - the computation only needs to determine whether two sequences are "sufficiently similar" or not. Moreover, because these applications involve a large number of alignments (all database entries against themselves), efficiency of the method is important. Admittedly, not every bounded-difference alignment problem in biology requires a sophisticated algorithm. But applications are so common, the sizes of some of the applications are so large, and the speedsups so great, that it seems unproductive to completely dismiss the potential utility to molecular biology of bounded-difference and bounded-mismatch methods. With this motivation, we now discuss specific techniques that efficiently solve bounded-difference alignment problems.
262
their similarities and differences is a first step in sorting out their history and the constraints on how they can mutate. The history of their mutations is then represented in the form of an evolutionary tree (see Chapter 17). Collections of HIV viruses have been studied in this way. Another good example of molecular epidemiology [348] arises in tracing the history of Hantavirus infections in the southwest United States that appeared during the early 1990s. The final two examples come from the milestone paper [162] reporting the first complete DNA sequencing of a free-living organism, the bacteria Haemophilus injluenzae Rd. The genome of this bacteria consists of 1,830,137 base pairs and its full sequence was determined by pure shotgun sequencing without initial mapping (see Section 16.14). Before the large-scale sequencing project, many small, disparate pieces of the bacterial genome had been sequenced by different groups, and these sequences were in the DNA databases. One of the ways the sequencers checked the quality of their large-scale sequencing was to compare, when possible, their newly obtained sequence to the previously determined sequence. If they could not match the appropriate new sequences to the old ones with only a small number of differences, then additional steps were taken to assure that the new sequences were correct. Quoting from [ 1621,"The results of such a comparison show that our sequence is 99.67 percent identical overall to those GenBank sequences annotated as H. injluenzae Rd'. From the standpoint of alignment, the problem discussed above is to determine whether or not the new sequences match the old ones with few differences. This application illustrates both kinds of bounded difference alignment problems introduced earlier. When the location in the genome of the database sequence is known, the corresponding string in the full sequence can be extracted for comparison. The resulting comparison problem is then an instance of the k-digerence global alignment problem that will be discussed next, in Section 12.2.3. When the genome location of the database sequence P is nor known (and this is common), the comparison problem is to find all the places in the full sequence where P occurs with a very small number of allowed differences. That is then an instance of the k-difference inexact matching problem, which will be considered in Section 12.2.4. The above story of H. injctenzae sequencing will be repeated frequently as systematic large-scale DNA sequencing of various organisms becomes more common. Each full sequence will be checked against the shorter sequences for that organism already in the databases. This will be done not only for quality control of the large-scale sequencing, but also to correct entries in the databases, since it is generally believed that large-scale sequencing is more accurate. The second application from [162j concerns building a nonredundant database of bacterial proteins (NRBP). For a number of reasons (for example, to speed up the search or to better evaluate the statistical significance of matches that are found), it is helpful to reduce the number of entries in a sequence database (in this case, bacterial protein sequences) by culling out, or combining in some way, highly similar, "redundant" sequences. This was done i n the work presented in [162], and a "nonredundant" version of GenBank is regularly compiled at The National Center for Biotechnology Information. Fleischmann et al. [162] write:
Redundancy was removed from NRBP at two stages. All DNA coding sequences were extracted from GenBank . . . and sequences from the same species were searched against each other. Sequences having more than 97 percent identity over regions longer than LOO nucleotides were combined. In addition, the sequences were translated and used in protein
Since end spaces in the text T are free, row zero of the dynamic programming table is initialized with all zero entries. That allows a left end of T to be opposite a gap without incurring any penalty.
Definition A d-path in the dynamic programming table is a path that starts in row zero and specifies a total of exactly d mismatches and spaces. Definition A d-path is farthest-reaching in diagonal i if it is a d-path that ends in diagonal i, and the index of it's ending column c (along diagonal i ) is greater than or equal to the ending column of any other d-path ending in diagonal i.
Graphically, a d-path is farthest reaching in diagonal i if no other d-path reaches a cell further along diagonal i .
LW
that this implies that m - n 5 k is a necessary condition for there to be any solution.) Therefore, to find any k-difference global alignment, it suffices to fill in the dynamic programming table in a strip consisting of 2k + 1 cells in each row, centered on the main diagonal, When assigning values to cells in that strip, the algorithm follows the established recurrence relations for edit distance except for cells on the upper and lower border of the strip. Any cell on the upper border of the strip ignores the term in the recurrence relation for the cell above it (since it is out of the strip); similarly, any cell on the lower border ignores the term in the recurrence relation for the cell to its left. If rn = n, the size of the strip can be reduced by half (Exercise 4). If there is no global alignment of SI and S2with k or fewer differences, then the value obtained for cell (n, m) will be greater than k. That value, greater than k, is not necessarily the correct edit distance of S, and SZ,but it will indicate that the correct value for (n, m) is greater than k. Conversely, if there is a global alignment with d 5 k differences, then the corresponding path is contained inside the strip and so the value in cell (n, m) will be correctly set to d. The total area of the strip is O(kn) which is O(krn), because n and m can differ by at most k. In summary, we have
Theorem 12.2.1. There is a global alignment of S1 a n d S2 with a t most k dlflerences i f and only i f the above algorithm assigns a value of k o r less to cell (n, m). Hence the
k-diflerence global alignmentproblem can be solved in O(km) rime and O(km) space.
Let k' be the largest value of k used in the method. Clearly, k' 5 2 k ' . S o the total work in the method is O(kfrn kfm/2 ktrn/4 - . . m) = O(kfm) = O(k*m).
+ +
267
Figure 12.6: The dashed line shows path R', the farthest-reaching ( d - 1)-path ending on diagonal i. The edge M on diagonal i just past the end of R' must correspond to a mismatch between Pand T (the characters involved are denoted P ( k )and T(k')in the figure).
Theorem 12.2.3. Each of the three paths R1, R2, and R3 are d-paths ending on diagonal
i. The farthest-reaching d-path on diagonal i is the path R1, Rz, o r R3 that extends the furthest along diagonal i.
PROOF Each of the three paths is an extension of a (d - 1)-path, and each extension adds either one more space or one more mismatch. Hence each is a d-path, and each ends on diagonal i by definition. So the farthest-reaching d-path on diagonal i must either be the farthest-reaching of RI, RZ,and R3,or it must reach farther on diagonal i than any of those three paths. Let R' be the farthest-reaching (d - 1)-path on diagonal i . The edge of the alignment graph along diagonal i that immediately follows R' must correspond to a mismatch, otherwise R' would not be the farthest-reaching (d - 1)-path on i. Let M denote that edge (see Figure 1 2.6). Let R* denote the farthest-reaching d-path on diagonal i . Since R* ends on diagonal i, there is a point where R* enters diagonal i for the last time and then never leaves diagonal i. If R* enters diagonal i for the last time above edge M, then R* must traverse edge M, otherwise R* would not reach as far as R3. When R* reaches M (which marks the end of R'), it must also have (d- 1) differences; if that portion of R* had less than a total of (d- 1) differences, then it could traverse M creating a (d - I)-path on diagonal i that reached farther on diagonal i than Rf, contradicting the definition of R'. It follows that if R* enters diagonal i above M, then it will have d differences after it traverses M, and so it will end exactly where R j ends. S o if R* is not R3, then R* must enter diagonal i below edge M. Suppose R* enters diagonal i for the last time below edge M. Then R* must have d differences, at that point of entry; if it had fewer differences then R' would again fail to be the farthest-reaching (d - I)-path on diagonal i. Now R* enters diagonal i for the last time either from diagonal i - I or diagonal i + 1, say i + 1 (the case of i - 1 is symmetric). So R* traverses a vertical edge from diagonal i + 1 to diagonal i, which adds a space to R*. That means that the point where R* ends on diagonal i 1 defines a (d - I)-path on diagonal i + 1. Hence R* leaves diagonal i 1 at or above the point where the path R1 does. Then R, and R* each have d spaces or mismatches at the points where they enter diagonal i for the last time, and then they each run along diagonal i until reaching an edge corresponding to a mismatch. It follows that R* cannot reach farther along diagonal i then RI does. S o in this case, R*.ends exactly where R I ends.
Figure 12.5: Path R, consists of a farthest-reaching (d - 1)-path on diagonal i + 1 (shown with dashes), followed by a vertical edge (dots), which adds the dth difference to the alignment, followed by a maximal path (solid line) on diagonal i that corresponds to (maximal) identical substrings in Pand T.
Hybrid dynamic programming: the high-level idea At the high level, the O(km) method will run in k iterations, each taking O(m) time. In every iteration d 5 k, the method finds the end of the farthest-reaching d-path on diagonal i , for each i from -n to m. The farthest-reaching d-path on diagonal i is found from the farthest-reaching (d - 1)-paths on diagonals i - 1, i , and i + 1. This will be explained in detail below. Any farthest-reaching d-path that reaches row n specifies the end location (in T ) of an occurrence of P with exactly d differences. We will implement each iteration in
O(n
+ m) time, yielding the desired O(km)-time bound. Space will be similarly bounded.
Details
To begin, when d = 0, the farthest-reaching 0-path ending on diagonal i corresponds to the longest common extension of T[i..m] and P [ l . . n ] ,since a 0-path allows no mismatches or spaces. Therefore, the farthest-reaching 0-path ending on diagonal i can be found in constant time, as detailed in Section 9.1. For d > 0, the farthest-reaching d-path on diagonal i can be found by considering the following three particular paths that end on diagonal i.
a
Path R1 consists of the farthest-reaching (d - 1)-path on diagonal i 1, followed by a vertical edge (a space in text T) to diagonal i, followed by the maximal extension along diagonal i that corresponds to identical substrings in P and T. (See Figure 12.5). Since R1 begins with a (d - 1)-path and adds one more space for the vertical edge, R I is a d-path. Path R2 consists of the farthest-reaching ( d - 1)-path on diagonal i - I , followed by a horizontal edge (a space in pattern P) to diagonal i, followed by the maximal extension along diagonal i that corresponds to identical substrings in P and T. Path R2 is a d-path. Path R j consistsof the farthest-reaching (d - 1)-path on diagonal i , followed by a diagonal edge corresponding to a mismatch between a character of P and a character of T, followed by a maximal extension along diagonal i that corresponds to identical substrings from P and T. Path R3 is a d-path. (See Figure 12.6.)
Each of the paths R 1 , R2, and Rj ends with a maximal extension corresponding to identical substrings of P and T . In the case of R , (or R 2 ) ,the starting positions of the two substrings are given by the last entry point of R I (or RZ)into diagonal i. In the case of R3, the starting position is the position just past the last mismatch on R3.
269
Theorem 12.2.4. All locations in T where pattern P occurs with at most'k dzfferences can be found in O(km)-time and O(km) space. Moreover; the actual alignment of P and T for each of these locations can be reconstructed in O(km) total time. Sometimes this k differences result is reported in a somewhat simpler but less useful form, requiring less space. If one is only interested in the end locations in T where P inexactly matches in T with at most k differences, then the O(km) space bound can be reduced to O ( n m). The idea is that the ends of the farthest-reaching (d - 1)-paths in each diagonal would then not be needed after iteration d and could be discarded. Thus only O(n + m) space is needed to solve the simpler problem.
Theorem 12.2.5, In O(km)-time and O ( n + m) space, the algorithm canfind all the end locations in T where P matches T with at most k drfferences.
uu
The case that R* enters diagonal i for the last time from diagonal i - 1 is symmetric, and R* ends exactly where Rz ends. In each case we have shown that R*, the assumed farthestreaching d-path on diagonal i, ends at the ending point of either R ( , R2, or R j . Hence the farthest-reaching d-path o n diagonal i is the farthest-reaching of R ,, R2, and R 3 . Theorem 12.2.3 is the key to the O(km)-time method.
end.
Implementation and time analysis For each value of d and each diagonal i, we record the column in diagonal i where the farthest-reaching d-path ends. Since d ranges from 0 to k and there are only O(n + m) diagonals, all of these values can be stored in O(km) space. In iteration d , the algorithm only needs to retrieve the values computed in iteration (d - 1). The entire set of stored values can be used to reconstruct any alignment of P in T with at most k differences. We leave the details of that reconstruction as an exercise. Now we proceed with the time analysis. For each d and each i, the end of three particular (d - 1)-paths must be retrieved. For a fixed d and i , this takes constant time, so these retrievals take O(km)-time over the entire algorithm. There are also O ( k m ) path extensions, each along a diagonal, that must be computed. But each path extension corresponds to a maximal identical substring in P and T starting at particular known positions in P and T . Hence each path extension requires finding the longest substring starting at a given location in T that matches a substring starting at a given location of P. In other words, each path extension requires a longest common extension computation. In Section 9.1 on page 196 we showed that any longest common extension computation can be done in constant time, after linear preprocessing of the strings. Hence the O(km) extensions can all be computed in O ( n + ni + km) = O(km) total time. Furthermore, as shown in Section 9.1.2, these extensions can be implemented using only a copy of the two strings and a suffix tree for the smaller of the two strings. In summary, we have
27 1
explained and analyzed in full detail. Two other methods (Wu-Manber [482]and PevznerWaterman [373]) will also be mentioned. These methods do not completely achieve the goal of provable linear and sublinear expected running times for all practical ranges of errors (and this remains a superb open problem), but they do achieve the goal when the error rate k/n is "modest". Let a be the size of the alphabet used in P and T . As usual, n is the length of P and m is the length of T. For the general discussion, an occurrence of P in T with at most k errors (mismatches or differences depending on the particular problem) will be called an approximate occurrence of P. The high-level outline of most of the methods is the following:
Partition approach to approximate matching a. Partition T or P into consecutive regions of a given length r (to be specified later). b. Search phase Using various exact matching methods, search T to find length-r intervals of T (or regions, if T was partitioned) that could be contained in an approximate occurrence of P. These are called surviving intervals. The nonsurviving intervals are definitely not contained in any approximate occurrence of P, and the goal of this phase is to eliminate as many intervals as possible. c. Check phase For each surviving interval R of T, use some approximate matching method to explicitly check if there is an approximate occurrence of P in a larger interval around R .
The methods differ primarily in the choice of r , in the choice of string to partition, and in the exact matching methods used in the search phase. The methods also differ in the definition of a region but are not generally affected by the specific choice of checking algorithm. The point of the partition approach is to exclude a large amount of T, using only (sub)linear expected time in the search phase, so that only (sub)linear expected time is needed to check the few surviving intervals. A balance is needed between searching and checking because a reduction in the time used in one phase causes an increase in the time used in the other phase.
Lemma 12.3.1, Suppose P matches a substring T' o f T with a t most k differences. Then TI must contain a t least one interval oflength r that exactly matches one of the r-length regions of the partition of P.
PROOF
In the alignment of P to T', each region of P aligns to some part of T' (see Figure 12.7), defining k 1 subalignments. If each of those k 1 subalignments were to contain at least one error (mismatch or space), then there would be more than k differences in total, a contradiction. Therefore, one of the first k + 1 regions of P must be aligned to an interval of T' without any errors.
Note that the lemma also holds even for the k-mismatch problem (i.e., when no space
270
a specified p, the k-difference primer problem can be solved for a small range of choices for k and still be expected to pick out useful primer candidates. How t o solve the k-difference primer problem We follow the approach introduced in [243]. The method examines each position j in a separately. For any position j , the k-difference primer problem becomes: Find the shortest prefix of string a[j..n] (if it exists) that has edit distance at least k from every substring in B. The problem for a fixed j is essentially the "reverse" of the k-differences inexact matching problem. In the k-difference inexact matching problem we want to find the substrings of T that P matches, with a t most k differences. But now, we want to reject any with I less than k differences. The viewpoint prefix of u[j..n] that matches a substring of # is reversed, but the same machinery works. The solution is to run the k-differences algorithm with string a[j..n] playing the role of P and 0 playing the role of T. The algorithm computes the farthest-reaching d-paths, for d = k, in each diagonal. If row n is reached by any d-path for d 5 k - 1, then the entire string a[j..n] matches a substring of b with less than k differences, so no acceptable primer can start at j. But, if none of the farthest-reaching (k - 1)-paths reach row n, then there is an acceptable primer starting at position j.In detail, if none of the farthest-reaching of the d-paths f o r d = k - 1 reach row r c n, then the substring y = cr[j..r] has edit distance at least k from every substring in 0. Moreover, if r is the smallest row with that property, then a[j..r] is the shortest substring starting at j that has edit distance at least k from every substring in B, The above algorithm is applied to each potential starting position j in a , yielding the following theorem: Theorem 12.2.6, I f a has length n and has length m, then the k-diferences primer selection problem can be solved in O(knm) total time.
273
takes only O ( k n ) worst-case time. If no spaces are allowed in the alignment of P to T' (only matches and mismatches) then the simpler O(kn)-time approach based on longest common extension (Section 9.1) can be used, or if attention is paid to exactly where in P any match is found, then O(n) time suffices for each check.
mn2(k
0
+ 1)
'
< cm,
for some constant c. To simplify the analysis, replace k by n - 1, and solve for r in
mn = cm. 0'
3
$, so r = log,
n 3 - log, c . But r =
L&J,
so
P 1
Figure 12.7: The first k
k+ 1
insertions are allowed). Lemma 12.3.1 leads to the following approximate matching algorithm:
Algorithm BYP
a. Let P be the set of k + 1 substrings of P taken from the first k + 1 regions of P's partition. b. Build a keyword tree (Section 3.4) for the set of "patterns" P. c. Using the Aho-Corasik algorithm (Section 3.4), find 2, the set of all starting locations in T where any pattern in P occurs exactly. d. For each index i E Z use an approximate matching algorithm (usually based on dynamic programming) to locate the end points of all approxitnate occurrences of P in the substring T [ i - n - k..i + n + k] (i.e., in an appropriate-length interval around i).
By Lemma 12.3.1, it is easy to establish that the algorithm correctly finds all approximate occurrences of P in T. The point is that the interval around each i is "large enough" to align with any approximate occurrence of P that spans i , and there can be no approximate occurrence of P outside such an interval. A formal proof is left as an exercise. Now we focus on specific implementation details and time analysis. Building the keyword tree takes O(n) time, and the Aho-Corasik algorithm takes O(m) (worst-case) time (Section 3.4). So steps b and c take O(n+ m) time. There are a number of alternate implementations for steps b and c. One is to build a suffix tree for T, and then use it to find every occurrence in T of a pattern in P (see Section 7.1). However, that would be very space intensive. A space-efficient version of this approach is to construct a generalized suffix tree for only P, and then match T to it (in the way that matching statistics are computed in Section 7.8.1). Both approaches take O(n + m ) worst-case time, but are no faster in expected time because every character in T is examined. A faster approach in practice is to use the Boyer-Moore set matching method based on suffix trees, which was developed in Section 7.16. That algorithm will skip over parts of T, and hence it breaks the O ( m )bottleneck. A different variation was developed by Wu and Manber [482] who implement steps b and c using the Shift-And method (Section 4.2) on a set of patterns. Another approach, found in the paper of Pevzner and Waterman [373] and elsewhere, uses hushing to identify long exact matching substrings of P and T. Of course, one can use suffix trees to find long common substrings, and one could deveIop a Karp-Rabin type method as well. Hashing, or approaches based on suffix trees, that look directly for long common substrings between P and T, seem a bit more robust than BYP because there is no string partition involved. But the only stated time bounds in 13731are the same as those for BYP. In the checking phase, step d, the algorithm executes some approximate matching algorithm between P and an interval of T of length O ( n ) , for each index in I. Naively, each of these checks can be done in 0 ( n 2 )time by dynamic programming (global alignment). Even this time bound will be adequate to establish an expected O(m)overall running time for the range of error rates that will be detailed below. Alternately, the LandauVishkin method (Section 12.2) based on suffix trees could be used, so that each check
275
The C L search is executed on 2m/n regions of T . For any region R let j' be the last value of j (i.e., the value of j when cn reaches k or when j - j " exceeds n/2)&Thus, in R, matching statistics are computed for the interval of length j' - j* < n/2. With the matching statistics algorithm in Section 7.8.1, the time used to compute those matching statistics is O ( j ' - j*). Now the expected value of j' - j*is less than or equal to k times the expected value of ms(i), for any i. Let E(M) denote the expected value of a matching statistic, and let e denote the expected number of regions that survive the search phase. Then the expected time for the search phase is O(2mk E(M)/n), and the expected time for the checking phase is O(kne). In the following analysis, we assume that P is a random string where each character is chosen uniformly from an alphabet of size a.
Lemma 12.3.3. E(M), the expected value of a matching statistic, is O(log, n).
For fixed length d, there are roughly n substrings of length d in P, and there are o dsubstrings of length d that can be constructed. So, for any specific string a! of length d, the probability that a! is found somewhere in P is less than n / a d . This is true for any d , but vacuously true until o d= n (i.e., when d = log, n). Let X be the random variable that has value log, n for ms(i) 5 log, n; otherwise it has value ms(i). Then
PROOF
Corollary 12,3.1. The expected time that CL spends in the search phase is O(2mk log, n/n), which is sublinear in m for k < n/ log, n.
The analysis for e, the expected number of surviving regions is too difficult to present here. It is shown in [94] that when k = O(n/log, n), then e = m/n" so the expected time that CL spends in the checking phase is 0(km/n3) = o(m). The search phase of CL is so effective in excluding regions of T that the checlung phase has very small expected running time.
P
Figure 12.8: Each full region in T has length r = n/2. This assures that no matter how Pis aligned with T, P spans one full region.
Figure 12.9: Blowup of one region in T aligned with one copy of P. Each black box shows a mismatch between a character in P and its counterpart in T.
any substring of P with at most k mismatches. These regions are excluded, and then an interval around each surviving region is checked using an approximate matching method, as in BYP. The search phase of CL relies heavily on the matching statistics discussed in Section 7.8.1. Recall that the value of matching statistic rns(i) is the length of the longest substring starting at position i of T that matches a substring somewhere (an unspecified location) in P. Recall also, that for any string S, all the matching statistics for the positions in S can be computed in O()SJ)total time. This is true even when S is a substring of a larger string T. Now let T' be the substring of one of the regions of T's partition that matches a substring P' of P with at most k mismatches (see Figure 12.9). The alignment of P' and T' can be divided into at most k 1 intervals where no mismatches occur, alternating with intervals containing only mismatches. Let i be the starting position of any one of those matching intervals, and let 1 be its length. Then clearly, m s ( i ) > 1 . The CL search phase exploits this observation. It executes the following algorithm for each region R in the partition of T:
If R is a surviving region, then in the checking phase CL executes an approximate matching algorithm for P against a neighborhood of T that starts n / 2 positions to the left of R and ends n / 2 positions to its right. This neighborhood is of size 3 n / 2 , and so each check can be executed in O ( k n ) time. The correctness of the C L method comes from the following lemma, and the Fact that the neighborhoods are "large enough".
Lemma 12.3.2. When the CL search declares a region R excluded, then there is no occurrence of P in T with at most k mismatches that completely contains region R .
The proof is easy and is left to the reader, as is its use in a formal proof of the correctness of CL. Now we consider the time analysis.
277
Since the intervals of interest double in length, the time used per interval grows four fold in each successive iteration. However, the number of surviving matches is expected to fall hyper-exponentially in each successive iteration, more than offsetting the increase in computation time per interval. With this iterative expansion, the effort expended to check any initial surviving match is doled out incrementally throughout the O(1og ~) iterations, and is not continued for any surviving match past an iteration where it is excluded. We now describe in a bit more detail how the initial surviving matches are found and how they are incrementally extended in successive iterations.
The first iteration Definition For a string S and value of set of all strings that -match S.
E,
For example, over the two-letter alphabet (a,b}, if S = aba and d = 1, then the 1-neighborhood of S is {bba, aaa, abb, aaba, abaa, baba, abba, abab, ba, aa, ab}. It is created from S by the operations of mismatch, insertion and deletion respectively. The condensed d-neighborhood of S is created from the d-neighborhood of S by removing any substring that is a prefi of another string in the d-neighborhood. The condensed 1-neighborhood S is (bba, aaa, aaba, abaa, baba, abba, abab}. Recall that pattern P is initially partitioned into subpatterns of length log, m (assumed to be an integer). Let P be the set of these subpatterns. In the first iteration, the algorithm (conceptually) constructs the condensed d-neighborhood for each subpattern in F,and then finds all locations of substrings in text T that exactly match one of the substrings in one of the condensed d-neighborhoods. In this way, the method finds all substrings of T that 6-match one of the subpatterns in F . These 6-matches form the initial surviving matches. In actuality, the tasks of generating the substrings in the condensed d-neighborhoods and of searching for their exact occurrences in T are intertwined and require text T to have been preprocessed into some index structure. This structure could be a suffix tree, a suffix array or a hash table holding short substrings of T. Details are found in [342]. Myers [342] shows that when the length of the subpatterns is O(log, m), then the first iteration can be implemented to run in O(kmP(') logm) expected time. The function p ( 6 ) is complicated, but it is convex (negative second derivative) increasing, and increases more slowly as the alphabet size grows. For DNA, it has value less than one for c 5 f , and for proteins it has value less than one for 6 5 0.56.
Successive iterations To explain the central idea, let a = aoa 1, where ]aO I is assumed equal to lal 1
Lemma 12.3.4. Suppose a -marches 0. Then can be divided into two substrings and 01such that B = Poj31,and either a0 -matches Poor a, E-marches PI.
Do
This lemma (used in reverse) is the key to determining how to expand the intervals around the surviving matches in each iteration. For simplicity, assume that n is a power of two and that log, m is also a power of two. Let B be a binary tree representing successive divisions of P into two equal size parts, until eachpart has length log, m (see Figure 12.10). The substrings written at the leaves are the subpatterns used in the first iteration of Myers's algorithm. Iteration i of the algorithm examines substrings of P that label (some) nodes of B i levels above the leaves (counting the leaves as level 1).
276
here, but we can introduce some of the ideas it uses to address deficiencies in the other exclusion methods. There are two basic problems with the Baeza-Yates-Perlberg and the Chang-Lawler methods (and the other exclusion methods we have mentioned). First, the exclusion criteria they use permit a large expected number of surviving regions compared to the expected number of true approximate matches. That is, not every initial surviving region is actually contained in an approximate match, and the ratio of expected survivors to expected matches is fairly high (for random patterns and text). Further, the higher the permitted error rate, the more severe is the problem. Second, when a surviving region is first located, the methods move directly to full dynamic programming computations (or some other relatively expensive operations) to check for an approximate match in a large interval around the surviving region. Hence the methods are required to do a large amount of computation for a large number of intervals that don't contain any approximate match. Compared to the other exclusion methods, Myers's method contains two different ideas to make it both more selective (finding fewer initial surviving regions) and less expensive to test the ones that are found. Myers's algorithm begins in a manner similar to the other exclusion methods. It partitions P into short substrings (to be specified later) and then finds all locations in T where these substrings appear with a small number of allowed differences. The details of the search are quite different from the other methods, but the intent (to exclude a large portion of T from further consideration) is the same. Each of these initial alignments of a substring of P that is found (approximately) in T is called a surviving match. A surviving match roughly plays the role of a surviving region in the other exclusion methods, but it specifies two substrings (one in P and one in T ) rather than just a single substring, as a surviving region does. Another way to think of a surviving region is as a roughly diagonal subpath in the alignment graph for P and T. Having found the initial surviving matches (or surviving regions), all the other exclusion methods we have mentioned would next check a full interval of length roughly 2n around each surviving region in T to see if it contains an approximate match to P. In contrast, Myers's method will incrementally extend and check a growing interval around each initial surviving match to create longer surviving matches or to exclude a surviving match from further consideration. This is done in about O(1og n ) iterations. (Recall that n is the length of the pattern and rn is the length of the text.)
Definition For a given error rate E, a string S -matches a substring of T if S matches the substring using at most c / S (insertions, deletions, and mismatches.
For example, let S = aba and E = 2/3. Then ac -matches S using one mismatch and one deletion operation. In the first iteration, the pattern P is partitioned into consecutive, nonoverlapping subpatterns of length log, m (assumed to be an integer), and the algorithm finds all substrings in T that -match one of these short subpatterns (discussed in more detail below). The length of these subpatterns is short enough that all the -matches can be found in sublinear expected time for a wide range of E values. These -matches are the initial surviving matches. The algorithm next tries to extend each initial surviving match to become an E-match between substrings (in P and T ) that are roughly twice as long as those in the current surviving match. This is done by dynamic programming in an appropriate interval around the surviving match. In each successive iteration, the method applies a more selective and expensive filter, trying to double the length of the -match around each surviving match.
279
Two problems
We assume the existence of a scoring matrix used to compute the value of any alignment, and hence "edit distance" here refers to weighted edit distance. We will discuss two problems in the text and introduce two more related problems in the exercises. 1. The P-against-all problem Given strings P and T, compute the edit distance between P and every substring T' of T. 2. The threshold all-against-all problem Given strings P and T and a threshold d , find every pair of substrings P' of P and T' of T such that the edit distance between P' and T' is less than d . The threshold all-against-all problem is similar to problems mentioned in Section 12.2.1 concerning the construction of nonredundant sequence databases. However, the threshold all-against-all problem is harder, because it asks for the alignment of all pairs of substrings,
Figure 12.10: Binary tree B defining the successive divisions of Pand its partition into regions of length log, m (equal to two in this figure).
Suppose at iteration i - 1 that substrings P' and T' in the query and text, respectively, form a surviving match (i.e., are found to align to form an -match). Let P" be the parent of P' in tree B. If P is a left child of P", then in iteration i, the algorithm tries to c-match P" to a substring of T in an interval that extends T' to the right. Conversely, if P is a right child of P", then the algorithm tries to -match P" with a substring in an interval that extends T' to its left. By Lemma 12.3.4, if the -match of P fto T' is part of an -match of P to a substring of T , then P" will c-match the appropriate substring of T . Moreover, the specified interval in T that must be compared against P" is just twice as long as the interval for T'. The end result, as detailed in [342], is that all of the checking, and hence the entire algorithm, runs in O(kmJ'(" log m) expected time.
f
There are several points to emphasize. First, the exposition given above is only intended to be an outline of Myers's method, without any analysis. The full details of the algorithm and analysis are found in [342]; [337] provides an overview, in relation to other exclusion methods. Second, unlike the BYP and CL methods, the error rates that establish sublinear (or linear) running times do not depend on the length of P.In BYP and CL, the pennitted error rate decreases as the length of P increases. In Myers's method, the permitted error rate depends only on the alphabet size. Third, although the expected running times for both CL and for Myers's method are sublinear (for the proper range of error rates), there is an important difference in the nature of these sublinearities. In the CL method, the sublinearity is due to a multiplicative factor that is less than one. But in Myers's method, the sublinearity is due to an exponent that is less than one. So as a function of m, the CL bound increases linearly (although for any fixed value of m the expected running time is less than m), while the bound for Myers's method increases sublinearly in m. This is an important distinction since many databases are rapidly increasing in size. However, Myers's method assumes that the text T has already been preprocessed into some index structure, and the time for that preprocessing (while linear in m) is not included in the above time bounds. In contrast, the running times of the BYP and CL methods include all the work needed for those methods. Finally, Myers has shown that in experiments on problems of meaningful size in molecular biology (patterns of length 80 on texts of length 3 million), the k-difference algorithms of Sections 12.2.4 and 12.2.3 run 100 to 500 times slower than his expected sublinear method.
1 2 . 4 . 5UFFl.X I K k k 5 A N V
H Y B K l V U Y NAMlL t'KWUKAMMIIYU
Figure 12.11: A cartoon of the dynamic programming tables for computing the edit distance between P and substring T' (top) and between P and substring T" (bottom). The two tables share the subtable for P and substring A (shown as a shaded rectangle). This shaded subtabte only needs to be computed once.
root
Figure 12.12: A piece of the suffix tree for T. The traversal from the root to node v is accompanied by the computation of subtable A (from the previous figure). At that point, the last row and column of subtable A are stored at node v. Computing the subtable 8 corresponds to the traversal from v to the leaf representing substring T'. After the traversal reaches the leaf for T ' , it backs up to node v, retrieves the row and column stored there, and uses them to compute the subtable C needed to compute the edit distance between P and TI'.
be tween P and every substring beginning at position i of T . When the depth-first traversal backs up to a node v , and v has an unvisited child v', the row and column stored at v are retrieved and extended as the traversal follows a new (v , v') edge (see Figure 12.12). It should be clear that this suffix-tree approach does correctly compute the edit distance between P and every substring of T, and it does exploit repeated substrings (small or large) that may occur in T. But how effective is it compared to the 8(nm 2 )-time dynamic programming approach?
not just the alignment of all pairs of strings. This critical distinction has been the source of some confusion in the literature [50],[56].
Recent estimates put the amount of repeated human DNA at 50 to 60%.That is, 50 to 60% of a11 human DNA is contained in nontrivial lengrh, structured substrings that show up repeatedly throughout the genome. Similar levels of redundancy appear in many other organisms.
283
in DNA) should give rise to suffix trees with lengths that are small enough to make this method useful. We examined this question empirically for DNA strings up to one million characters, and the lengths of the resulting suffix trees were around m2/10.
An O(C
+ R)-time method
+
The method uses a suffix tree 7.p for string P and a suffix tree '7;. for string T , The worstcase time for the method will be shown to be O ( C R), where C is the length of Tp times the length of TTindependent of whatever the output criteria are, and R is the size of the output. (The definition of the length of a suffix tree is found in Section 12.4.1.) That is, the method will compute certain dynamic programming cell values, which will be the same no matter what the output criteria are, and then when a cell value satisfies the particular output criteria, the algorithm will collect the relevant substrings associated with that cell. Hence our description of the method holds for the full all-against-all problem, the threshold version of the problem, or any other version with different reporting criteria. To start, recall that each node in Tprepresents a substring of P and that every substring of P is a prefix of a substring represented by a node of 7p. In particular, each suffix of P is represented by a leaf of Tp. The same is true of T and Tr. Definition The dynamic programming table for a pair of nodes (u, v ) , from Tpand 5, respectively, is defined as the dynamic programming table for the edit distance between the string represented by node K and the string represented by node v.
LOL
KC~~IV~LV L LV I K ~ b IKINCi
Definition The string-length of an edge label in a suffix tree is the length of the string labeling that edge (even though the label is compactly represented by a constant number of characters). The length of a s u f i tree is the sum of the string-lengths for all of its edges. The length for a suffix tree 7 for a string T of length m can be anywhere between @(m) and @ ( t n 2 ) ,depending on how much repetition exists in T. In computational experiments using long substrings of mammalian DNA (length around one million), the string-lengths of the resulting suffix trees have been around mZ/lO. Now the number of dynamic programming columns that are generated during the depth-first traversal of 7 is exactly the length of 7 . Each column takes O(n) time to generate, and so we can state
Lemma 12.4.1. The time used to generate the needed columns in the depth-first traversal
is O(n x (length of 7 ) ) . We must also account for the time and space used to write the rows and columns stored at each node of 7.In a suffix tree with m leaves there are O(m) internal nodes and a single row and column take at most O(m + n ) time and space to write. Therefore, the time and space needed for the row and column stores is @(m2 nm) = O(mZ). Hence, we have
Theorem 12.4.1. The total timefor the s u m - t r e e approach is O(n x (length of 7 ) + m'), a n d the marinzum space used is 0 ( m 2 ) .
Reducing space
The size of the required output is 0 ( m 2 ) , since the problem calls for the edit distance between P and each of 0 ( m 2 ) substrings of T , making the @(m2)term in the time bound acceptable. On the other hand, the space used seems excessive since the space needed by the dynamic programming solution without using a suffix tree is just O(nm) and can be reduced to O(m). We now modify the suffix-tree approach to also use only O(n m) space and the same time bounds as before. First, there is no need to store the current column at each node v. When backing up from a child v' of v, we can use the current column at v' and the string labeling edge ( v , v ' ) to recompute the column for node v . This does, however, double the total time for computing the columns. There is also no need to keep the current row n at each node v. Instead, only O(n) space is needed for row entries. The key idea is that the current table is expanded columnwise, so if the string-depth of v' is j and the string-depth of v' is j + d, then the row n stored at v and v' would be identical for the first j entries. We leave it as an exercise to work out the details. In summary, we have
Theorem 12.4.2. The hybrid sufftr-tree/dynamicprogrammingapproachto the P-againstall problem can be implemented to run in O[n(length of 7 ) + m2] time attd O(n + m) space.
The above time and space bounds should be compared to the @(nm2)time and O(n +m) space bounds that result from a straightforward application of dynamic programming. The effectiveness in practice of this method depends on the length of 7for realistic strings. It is known that for random strings, the length of 7 is @(m2),making the method unattractive. (For random strings, the suffix tree is bushy for string-depths of log, m or less, where CT is the size of the alphabet. But beyond that depth, the suffix tree becomes very sparse, since the probability is very low that a substring of length greater than log, m occurs more than once in the string.) However, strings with more structured repetitions (as occur
12.4.
suffix tree for T suffix wee for P Figure 12.14: The sutfix trees for Pand T with nodes numbered by string-depth. Note that these numbers
are not the standard sutfix position numbers that label the leaves. The ordered list of node pairs begins (1 ,I ),(I,2),(1,3). . . and ends with (6,8).
Details of the algorithm First, number the nonroot nodes of Tpaccording to string-depth, with smaller string-depth first.3 Separately, number the nodes of TTaccording to string-depth. Then form a list L of all pairs of node numbers, one from each tree, in lexicographic order. Hence, pair ( u , v) appears before pair ( p , q ) in the list if and only if u is less than p, or if u is equal to p and v is less than q . (See Figure 12.14). It follows that if u' is the parent of u in Tpand v' is then ( u ' , v') appears before ( u , v). the parent of v in Try Next, process each pair of nodes ( u , v) in the order that it appears in L . Assume again that u' is the parent of u , that v' is the parent of v, and that the labels on the respective edges are a and B. To process a node pair (u, v), retrieve the value in the single lower right cell from the stored part of the (u', v') table; retrieve the column stored with the pair ( u , v'), and retrieve the row stored with the pair ( u ' , v). These three pairs of nodes have already been processed, due to the lexicographic ordering of the list. From those retrieved values, and from the substrings a and 8, compute the new I a l by IB I subtable completing the ( u , v) table. Store with pair (u, v) the last row and column of newly computed subtable. a 1 by IPI subtable, and its value satisfies the Now suppose cell ( i , j ) is in the new l output criteria. The algorithm must find and output all locations of the two substrings specified by ( i , j).As usual, a depth-first traversal to the leaves below u and v will then find all the starting positions of those strings. The length of the strings is determined by i and j. Hence, when it is required to output pairs of substrings that satisfy the reporting criteria, the time to collect the pairs is just proportional to the number of them. Correctness and time analysis
The correctness of the method follows from the fact that at the highest level of description, the method computes the edit distance for every pair of substrings, one from each string. It does this by generating and examining every cell in the dynamic programming table for every pair of substrings (although it avoids redundant examinations). The only subtle point is that the method generates and examines the cells in each table in an incremental manner to exploit the commonalities between substrings, and hence it avoids regenerating and reexamining any cell that is part of more than one table. Further, when the method finds a cell satisfying the reporting criteria (a function of value and length), it can find all
' Actually, any topological numbering will do, but string-depth has some advantages when heuristic accelerations
are added.
Figure 12.13: The dynamic programming table for ( u ,v ) is shown below the suffix trees for P and T. The string on the path to node u is Za and the string to node v is X Y p . Every cell in the ( u , v )table, except any in the lower right rectangle, is also in the (u,v'), (u',v), or ( u ' , v r ) tables. The new part of the (u,v) table can be computed from the shaded entries and substrings u and p . The shaded entries contain exactly one entry from the ( u r , v ' )table; l a 1 entries from the last column in the ( u , v ' ) table; and entries from the last row in the (u',v) table.
The threshold all-against-all problem could be solved (ignoring time) by computing the dynamic programming table for each pair of leaves, one from each tree, and then examining every entry in each of those tables. Hence it certainly would be solved by computing the dynamic programming table for each pair of nodes and then examining each entry in those tables. This is essentially what we will do, but we proceed in a way that avoids redundant computation and examination. The following lemma gives the key observation.
Lemma 12.4.2, Let u' be the parent of node u in TPand let cr be the string labeling the edge between them. Similarly, let v' be the parent of v in TTand let p be the string
labeling the edge betw~een them. Then, all but the bottom right ]aI I#l I entries in the clynamic programming table for the pair ( u , v) appear in one of the tables for (u', v'), (u', v), or ( u , v'). Moreovel; that bottom right part of the ( u , v ) table can be obtained from the orher three tclbles in O(lal Ip I) time. (See Figure 12.13.)
The proof of this lemma is immediate from the definitions and the edit distance recurrences. The computation for the new part of the ( u , u ) table produces an l a by (#l( rectangular subtable that forms the lower right section of the (u, v) table. In the algorithm to be developed below, w e will store and associate with each node pair ( i l , v) the last column and the last row of this Icrl by subtable* We can now fully describe the algorithm.
287
clearer description of it we cannot define precisely what specific all-against-all problem was solved.
For example, if Il = 5, 3,4,9,6,2, 1, 8,7, 10 then ( 3 , 4 , 6 , 8 ,10) and {5,9, 101 are both increasing subsequences in Il.(Recall the distinction between subsequences and substrings.) We are interested in the problem of computing a longest increasing subsequence in n. The method we develop here will later be used to solve the problem of finding the longest common subsequence of two (or more) strings.
7 is a subsequence of Definition A decreasing subsequence of I are nonincreasing from left to right.
For example, under this definition, ( 8 , 5 , 5 , 3 , 1, 1) is a decreasing subsequence in the sequence 4, 8 , 3 , 9 , 5 , 2 , 5 ,3, 10, l , 9 , 1,6. Note the asymmetry in the definitions of increasing and decreasing subsequences. The term "decreasing" is slightly misleading. Although "nonincreasing" is more precise, it is too clumsy a term to use in high repetition.
l is a set of decreasing subsequences of l l that contain all the Definition A cover of l l . numbers of l
Forexample, { 5 , 3 , 2 , 1); (4); { 9 , 6 ) ;{8,7);{lO]isacoverof I 7 = 5 , 3 , 4 , 9 , 6 , 2 ,l , 8 , 7 , 10. It consists of five decreasing subsequences, two of which contain only a single number.
Definition The size of the cover is the number of decreasing subsequences in it, and a smallesr cover is a cover with minimum size among all covers.
We will develop an O(n log n)-time method that simultaneously constructs a longest increasing subsequence (lis) and a smallest cover of TI. The following lemma is the key.
" \
of substrings specified by that cell using a traversal to a subset of leaves in the trees. A formal proof of correctness is left to the reader as an exercise. h e analysis, recall that the length of TPis the sum of lengths of all the edge 2 p . If P has length n, then the length of 7,ranges between n and n / 2 , depending how repetitive P is. The length of TTis similarly defined and ranges between m and 2 m / 2 , where m is the length of T.
Lemma 12.4.3. The time used by the algorithm for all the needed dynamic programming computations and cell examinations is proportional to the product of the length o f Tpand
the length of TT. Hence that time, de$ned as C,ranges between nm and n 2 m 2 . In the algorithm, each pair of nodes is processed exactly once. At the point a pair (u, v ) is processed, the algorithm spends O(lcwllBI) time to compute a subtable and examine it, where a and are the labels on the edges into u .and u, respectively. Each edge-label in 'Irp therefore forms exactly one dynamic programming table with each of The t i m e to build those tables is Icul(1ength of TT). Summing over the edge-labels in TT. a11 edges in gives the claimed time bound. o
PROOF
The above lemma counts all the time used in the algorithm except the time used to collect and report pairs of substrings (by their starting position, length, and edit distance). But since the algorithm collects substrings when it sees a cell value that satisfies the reporting criteria, the time devoted to output is just the time needed to traverse the tree to collect output pairs. We have already seen that this time is proportional to the number of pairs collected, R. Hence, we have
Theorem 12.4.3. The complete time fur the algorithm is O(C + R).
Wow effective is the suffix tree approach? As in the P-against-all problem, the effectiveness of this method in practice depends Clearly, the product of those lengths, C, falls as P and on the lengths of Tpand TT. T increase in repetitiveness. We have built a suffix tree for DNA strings of total length around one million bases and have observed that the tree length is around one tenth of the maximum possible. In that case, C is around n2m2/100, so aH else being equal (which is unrealistic), standard dynamic programming for the all-against-all problem should run about one hundred times slower than the hybrid dynamic programming approach. A vastly larger "all-against-all" computation on amino acid strings was reported in 11831. Although their description is very vague, they essentially used the suffix tree approach described here, computing similarity instead of edit distance. But, rather than a hundred-fold speedup, they claim to have achieved nearly a million-fold speedup over standard dynamic That level of speedup is not supported by theoretical considerations (recall that for a random string S of length m, a substring of length greater than log, m is very unlikely to occur in S more than once). Nor is it supported by the experiments we have done. The explanation may be the incorporation of an early stopping rule described in [I831 only by the vague statement "Time is saved because the matching of patricia5 subtrees is aborted when the score falls below a liberally chosen similarity limit". That rule is apparently very effective in reducing running time, but without a
They finish a computation in 405 cpu days that they claim would otherwise have taken more than a million cpu years without the use of suffix trees. A patricia tree is a variant of a suffix tree.
289
We will shortly see how to reduce the time needed to find the greedy cover to O ( n log n ) , but we first show that the greedy cover is a smallest cover of l7 and that a longest increasing subsequence can easily be extracted from it.
Lemma 12.5.3. There is an increasing subsequence I of n containing exactly one number from each decreasing subsequence in the greedy cover C . Hence I is the longest possible, and C is the smallest possible.
Let x be an arbitrary number placed into decreasing subsequence i > 1 (counting from the left) by the greedy algorithm. At the time x was considered, the last number y of subsequence i - 1 must have been smaller than x. Also, since y was placed before x was, y appears before x in n , and { y , x ) forms an increasing subsequence in n. Since x was arbitrary, the same argument applies to y, and if i - 1 > 1 then there must be a number z in subsequence i - 2 such that z < y and z appears before y in n. Repeating this argument until the first subsequence is reached, we conclude that there is an increasing subsequence in n containing one number from each of the first i subsequences in the greedy cover and ending with x . Choosing x to be any number in the last decreasing subsequence proves the lemma.
PROOF
Algorithmically, we can find a longest increasing subsequence given the greedy cover as follows:
end. Since no number is examined twice during this algorithm, a longest increasing subsequence can be found in O ( n ) time given the greedy cover. An alternate approach is to use pointers. As the greedy cover is being constructed, whenever a number x is added to subsequence i , connect a pointer from x to the number at the current end of subsequence i - 1. After the greedy algorithm finishes, pick any number in the last decreasing subsequence and follow the unique path of pointers starting from it and ending at the first subsequence.
10
3
2 1
Lemma 12.5.1. I f I is an increasing subsequence of l-I with length equal to the size of a cover of n , call it C, then I is a longest increasing subseqlrence of n and C is a smallest
cover of
PROOF
n.
NO increasing subsequence of I7 can contain more than one number contained in any decreasing subsequence of n , since the numbers in an increasing subsequence strictly increase left to right, whereas the numbers in a decreasing subsequence are nonincreasing l can have length greater than the size left to right. Hence no increasing subsequence of I of any cover of l7. Now assume that the length of I is equal to the size of C .This implies that I is a longest increasing subsequence of ll because no other increasing subsequence can be longer than the size of C. Conversely, C must be a smallest cover of n, for if there were a smaller cover C' then I would be longer than the size of C', which is impossible. Hence, if the length of I equals the size of C, then I is a longest increasing subsequence and C is a smallest cover. Lemma 12.5.1 is the basis of a method to find a longest increasing subsequence and a smallest cover of n. The idea is to decompose l7 into a cover C such that there is an increasing subsequence I containing exactly one number from each decreasing subsequence in C . Without concern for efficiency, a cover of n can be built in the following straightforward way:
Naive cover algorithm Starting from the left of n, examine each successive number in and place it at the end of the first (left-most) decreasing subsequence that it can extend. If there are no decreasing subsequences it can extend, then start a new (decreasing) subsequence to the right of all the existing decreasing subsequences.
To elaborate, if x denotes the current number from FI being examined, then x extends a subsequence i if x is smaller than or equal to the current number at the end of subsequence i, and if x is strictly larger than the last number of each subsequence to the left of i. For example, with l7 as before the first two numbers examined are put into a decreasing subsequence {5,3}.Then the number 4 is examined, which is in position 3 of n. Number 4 cannot be placed at the end of the first subsequence because 4 is larger than 3. So 4 begins a new subsequence of its own to the right of the first subsequence. Next, the number 9 is considered and since it cannot be added to the end of either subsequence {5,3) or 4, it begins a third subsequence. Next, 6 is considered; it can be added to 9 but not to the end of any of the two subsequences to the left of 9. The final cover of produced by the algorithm is shown in Figure 12.15, where each subsequence runs vertically. Clearly, this algorithm produces a cover of n, which we call the greedy cover. To see whether a number x can be added to any particular decreasing subsequence, we only have to compare x to the number, say y, currently at the end of the subsequence - x can be added if and only if x 5 y. Hence if there are k subsequences at the time x is considered, then the time to add x to the correct subsequence is O ( k ) .Since k 5 n, we have the following:
291
the list associated with the character Sl( i ) . For example, list n ( S I ,S2)for the above two strings is 6,3,2,4, 1,6, 3, 2, 5. To understand the importance of n(SI, Sz),we examine what an increasing subsequence in that list means in terms of the original strings.
Theorem 12.5.2. Every increasing subsequence I in 17(SI,S2)specfles an equal length common subsequence of SI and S2 and vice versa. Thus a longest common subsequence of S1 and S2 corresponds to a longest increasing subsequence in the list n ( S l , S2)First, given an increasing subsequence I of n ( S I , S2), we can create a string S and show that S is a subsequence of both S1and S2. String S is successively built up during a left-to-right scan of I. During this scan, also construct two lists of indices specifying a subsequence of S1 and a subsequence of Sz. In detail, if number j is encountered in I during the scan, and number j is contained in the sublist contributed by character i of S I , then add character Sl(i) to the right end of S, add number i to the right end of the first index list, and add j to the right end of the other index list. For example, consider I = 3 , 4 , 5 in the running example. The number 3 comes from the sublist for character 1 of S1,the number 4 comes from the sublist for character 2, and the number 5 comes from the sublist for character 4. So the string S is abc. That string is a subsequence of SI found in positions 1 , 2 , 4 and is a subsequence of Sz found in positions 3,4,5. The list 17(Sl,Sz)contains one sublist for every position in SI, and each such sublist in n ( S , , S2)is in decreasing order. So at most one number from any sublist is in I and any position in Sl contributes at most one character to S. Further, the m lists are arranged left to right corresponding to the order of the characters in SI,so S is certainly a subsequence of S1.The numbers in I strictly increase and correspond to positions in S2, SO S is also a subsequence of S2. In summary, we have proven that every increasing subsequence in n(S1,Sz)can be used to create an equal length common subsequence in S1and S2.The converse argument, that a common subsequence yields an increasing subsequence, is very similar and is left as an exercise.
PROOF
n ( S , , S2)is a list of r integers, and the longest increasing subsequence problem can be solved in O(r logl) time on an r-length list when the longest increasing subsequence is of length 1. If n 5 m then 1 5 n, yielding the following theorem:
Theorem 12.5.3. The longest common subsequence problem can be solved in O(r log n) time.
The O(r log n) result for lcs was first obtained by Hunt and Szymanski 12381. Their algorithm is superficially very different than the one above, but in retrospect one can see similar ideas embodied in it. The relationship between the Ics and lis problems was partly identified by Apostolico and Guerra [25,27] and made explicit by Jacobson and Vo [2U] and independently by Pevzner and Waterman 13701. The lcs method based on lis is an example of what is called sparse dynamic programming, where the input is a relatively sparse set of pairs that are permitted to align. This approach, and in fact the solution technique discussed here, has been very extensively generalized by a number of people and appears in detail in [137] and [138].
290
is, the last number from any subsequence i - 1 appears in L before the last number from subsequence i. Lemma 12.5.4. At any point in the execution of the algorithm, the list L is sorted in increasing order. Assume inductively that the lemma holds through iteration k- 1. When examining the kth number in n , call it x, suppose x is to be placed at the end of subsequence i. Let w be the current number at the end of subsequence i - 1, let y be the current number at the end of subsequence i (if any), and let z be the number at the end of subsequence i 1 (if it exists). Then w < x 5 y by the workings of the algorithm, and since y < z by the inductive assumption, x < z also. In summary, LL) -= x < Z, SO the new subsequence L remains sorted.
PROOF
Note that L itself need not be (and generally will not be) an increasing subsequence of n. Although x < z, x appears to the right of z in n. Despite this, the fact that L is in sorted order means that we can use binary search to implement each iteration of the 7 algorithm building the greedy cover. Each iteration k considers the kth number x in I and the current list L to find the left-most number in L larger than x. Since L is in sorted order, this can be done in O(1ogn) time by binary search. The list l7 has n numbers, so we have Theorem 12.5.1. The greedy cover can be constructed in O(n logn) time. A longest l can therefare be found in O(n log n) increasing subsequence a n d a smallest cover of I time. In fact, if p is the length of the lis, then it can be found in O(n log p) time.
ry=, r(i).
For example, suppose we are using the normal English alphabet; when Si = abacx 2 = baabca then r(1) = 3, r(2) = 2, r(3) = 3, r(4) = 1, and r(5) = 0, so r = 9. and S Clearly, for any two strings, r will fall in the range 0 to nm. We will solve the lcs problem in O ( r log n) time (where n 5 m), which is inferior to O(nm) when the r is large. However, r is often substantially smaller than nm, depending on the alphabet C . We will discuss this more fully later. The reduction For each alphabet character x that occurs at least once in S , , create a fist of the positions where character x occurs in string S2; write this list in decreasing order. Two distinct alphabet characters will have totally disjoint lists. In the above example (S1 = abacx and Sz = baabca) the list for character a is 6,3, 2 and the list for b is 4, 1. Now create a list called n ( S I , Sz) of length r, in which each character instance in SI is replaced with the associated list for that character. That is, for each position i in S,, insert
293
abacx and S2 = banbca (as above) and S3 = babbac, then the list for character a is (6,5), (6,2), (3,5), ( 3 . 3 , (2,5), (2,2). The lists for each character are again concatenated in the order that the characters appear in string SI,forming the sequence of pairs n ( S , , S2, S3).We define an increasing subsequence in n ( S I ,S2, S3) to be a subsequence of pairs such that the first numbers in each pair form an increasing subsequence of integers, and the second numbers in each pair also form an increasing subsequence of integers. We can easily modify the greedy cover algorithm to find a longest increasing subsequence of pairs under this definition. This increasing subsequence is used as follows.
Theorem 12.5.4. Every increasing subsequence in n ( S I , S2, S3)specijies an equal length common subsequence of S1, S2,S3 and vice versa. Therefore, a longest common subsequence o f SI , S2, S3 curresponds to a lungesf increasing subseqrrence in n (S1 , S2,S3).
The proof of this theorem is similar to the case of two strings and is left as an exercise. Adaptation of the greedy cover algorithm and its time analysis for the case of three strings is also left to the reader. Extension to more than three strings is immediate. The combinatorial approach to computing lcs also has a nice space-efficiency feature that we will explore in the exercises.
Under this weighting model, the cost to initiate a gap is at most 35.03, and declines with increasing evolutionary (PAM) distance between the two sequences. In addition to this initiation weight, the function adds 17.02 log,, q for the actual length, q , of the gap. It is hard to believe that a function this precise could be correct, but the key point is that, for a fixed PAM distance, the proposed gap weight is a convex function of its length.' The alignment problem with convex gap weights is more difficult to solve than with affine gap weights, but it is not as difficult as the problem with arbitrary gap weights. In this section we develop a practical algorithm to optimally align two strings of lengths n and m > n, when the gap weights are specified by a convex function of the gap length. The algorithm runs in O(nm log m) time, in contrast to the O(nm)-time bound for affine gap weights and the 0(nm2) time for arbitrary gap weights. The speedup for the convex case was established by Miller and Myers [3221 and independently by GaIiI and Giancario
' Unfortunately, there is no standard agreement on terminology, and some of the papers refer to the model as the
"convex" gap weight model, while others call it the "concave" gap model. In this book. a convex function is one with a negative or zero second derivative, and a concave function is one with a positive second derivative.
292
Constrained lcs The Ics method based on lis has another advantage over the standard dynamic programming approach. In some applications there are additional constraints imposed on which pairs of positions are permitted to align in the lcs. That is, in addition to the constraint that position i in S1 can align with position j in S2 only if Sl(i) = S2(j), some additional constraints may apply. The reduction of lcs to lis can be easily modified to incorporate these additional constraints, and we leave the details to the reader. The effect is to reduce the size of r and consequently to speed up the entire lcs computation. This is another example and variant of sparse dynamic programming.
295
those recurrences. For convenience, we restate the general recurrences for arbitrary gap weights. V ( i , j ) = max[E(i, j ) . F(i, j ) , G ( i , j ) ] ,
G(i, j ) is undefined when i or j is zero. Even with arbitrary gap weights, the work required by the first and second recurrences is O ( m ) per row, which is within our desired time bound. It is the recurrences for E ( i , j)and F(i, j ) that respectively require @ ( m 2 )time per row and 0 ( n 2 )time per column when the function w is arbitrary, Hence, it is the evaluation of E and F for any given row or column that will be improved in the case where w is convex. We will focus on the computation of E for a single row. The computation of F and the associated time analysis for a single column is symmetric, with one caveat to be discussed later.
Simplifying notation
The value E ( i , j ) depends on i only through the values V ( i ,k ) for k < i . Hence, in any fixed row, we can drop the reference to the row index i, simplifying the recurrence for E. That is, in any fixed row we define
E ( j ) = max [ V ( k )- w ( j - k ) ] .
OskLj-1
therefore,
E( j ) = rnax Cand(k, j ) .
OCklj-1
The term Cand stands for "candidate"; the meaning of this will become clear later.
4+d
q'
q'cd
[170]. However, the solution in the second paper is given in terms of edit distance rather than similarity. Similarity is often more useful than edit distance because it can be used to handle the extremely important case of local comparison. Hence we will discuss convex gap weights in terms of similarity (maximum weighted alignment) and leave it to the reader to derive the analogous algorithms for computing edit distance with convex gap weights. More advanced results on alignment with convex or concave gap weights appear in [136], [138], and [276]. Recall from the discussion of arbitrary gap weights that w(q) is the weight given to a gap of length q . That gap then contributes a penalty of -w(q) to the total weight of the alignment.
Definition Assume that w(q) is a nonnegative function of q. Then w(q) is convex if and only if w(q 1) - w(q) 5 w(q) - w(q - 1) for every q.
That is, as a gap length increases, the additional penalty contributed by the gap decreases for each additional unit of the gap. It follows that w(q d ) - w(q) 2 w(q' d ) - w(qt) for q < q' and any fixed d (see Figure 12.16). Note that the function w can have regions of both positive and negative slope, although any region of positive slope must be to the left of the region of negative slope. Note that the definition allows w(q) to become negative for large enough n and rn. At that point, -w(q) becomes positive, which is probably not desirable. Hence, gap weight functions with negative slope must be used with care. The convex gap weight was introduced in [466] with the suggestion that mutational events that insert or delete varying length blocks of DNA can be more meaningfully modeled by convex gap weights, compared to affine or constant gap weights. A convex gap penalty allows the modeler more specificity in reflecting the cost or probability of different gap lengths. and yet it can be more efficiently handled than arbitrary gap weights. One particular convex function that is appealing in this context is the log function, although it is not clear which base of the logarithm might be most meaningful. The argument for or against convex gap weights is still open, and the affine gap model remains dominant in practice. Still, even if the convex gap model never becomes popular in molecular biology it could well find application elsewhere. Furthermore, the algorithm for alignment with convex gaps is of interest in itself, as a representative of a number of related algorithms in the general area of "sparse dynamic programming".
To solve the convex gap weight case we use the same dynamic programming recurrences developed for arbitrary gap weights (page 242), but reduce the time needed to evaluate
j'
j"
Figure 12.17: Graphical illustration of the key observation. Winning candidates are shown with a solid curve and losers with a dashed curve. If the candidate from j loses to the candidate from k at cell j ' , then the candidate from j will lose to the candidate from k at every cell j" to the right of j ' .
Key observation Let j be the current cell. If Cand(j, j') 5 E( j') for some j' > j, then Cand(j, j") 5 E(j") for every j " > jf.That is, "one strike and you're out",
Hence the current cell j need not send forward any candidate values to the right of the first cell j' > j where Cand(j, j ' ) is less than or equal to F ( j f ) . This suggests the obvious practical speedup of stopping the loop labeled {Loop 1) in the Forward dynamic programming algorithm as soon as j ' s candidate loses. But this improvement does not lead directly to a better (worst-case) time bound. For that, we will have to use one more trick. But first, we prove the key observation with the following more precise lemma.
Lemma 12.6.1. Let k < j < jr < j" be anyfour cells in the same raw. IfCand(j, j') 5 Cand(k, j') the11 Cand(j, j") 5 Cand(k, j"). See Figure 12.17 for reference.
PROOF
Cand(k, j') 2 Cand(j, j ' ) implies that V(k) - w(j' - k) 2 V(j) - w ( j l - j), w(j - j ) . so V(k) - V ( j ) > - w(j' - k) Trivially, ( j ' - k) = ( j ' - J ) ( j - k). Similarly, ( j " - k) = (J" - j ) ( j - k). For future use, note that ( j ' - k) < ( j " - k). Now let q denote (J' - j), let q' denote ( j " - j ) , and let d denote ( j - k). Since j' < j", then q < q ' . By convexity, w(q + d ) - w(q) 2 w(ql d ) - w(ql) (see Figure 12.16). Translating back, we have w ( j l - k) - w ( j l - j ) > w(j" - k) - w(j" - J). Combining this with the result in the first paragraph gives V(k) - V ( j ) 2 w(j" - k) - w(j" - j ) , and rewriting gives V(k) - w ( j " - k) >_ V ( j ) - w(j" - J), i.e., Cand(k, j") >_ Cand(J, j"), as claimed. a
I +
296
In the forward implementation, we first initialize a variable E(j') to Cand(0, j') for each cell j' > 0 in the row. The E values are set left to right in the row, as in backward dynamic programming. However, to set the value of E ( j ) (for any j > 0) the algorithm merely sets E ( j ) to the current value of E ( j ) , since every cell to the left of j will have contributed a candidate value to cell j. Then, before setting the value of E ( j l ) , the algorithm traverses forwards in the row to set E(j')(for each j' > j ) to be the maximum of the current E ( j l ) and Cand(j, j'). To summarize, the forward implementation for a fixed row is:
0 1 2
Figure t2.19: The three possible ways that the block partition changes after E(1) is set. The curves with arrows represent the common pointer for the block and leave from the last entry in the block.
Cells 2 through m might get divided into two blocks, where the common pointer for the first block is b = I , and the common pointer for the second is b = 0 . This happens (again by Lemma 12.6.1) if and only if for some k < m Cand(l, j') ;z E(j') for j' from 2 to k and Cand(1, j') 5 E ( j f )for j' from k 1 tom. Cells 2 through m might remain in a single block, but now the common pointer b is set to I. This happens if and only if Cand(1, j') > E( j ) for j' from 2 to m .
Figure 12.19 illustrates the three possibilities. Therefore, before making any changes to the El values, the new partition of the cells from 2 to m can be efficiently computed as follows: The algorithm first compares E(2) and Cand(l,2). If r ( 2 ) 2 C a n d ( l , 2 ) then all the cells to the right of 2 remain in a single block with common b pointer set to zero, However, if E(2) <Ccmd(l, 2) then the algorithm searches for the left-most cell j' > 2 such that E ( j J ) >_ Cand(1, j ' ) , If j' is found, then cells 2 through j ' - 1 form a new block with common pointer to cell one, and the remaining cells form another block with common pointer to cell zero. If no j' is found, then all cells 2 through m remain in a single block, but the common pointer is changed to one. Now for the punch line: By Corollary 12.6.1, this search for j' can be done by binary search. Hence only O(1ogm) coniparisons are used in searching for j'. And, since we only record one b pointer per block, at most one pointer update is needed. Now consider the general case of j > I. Suppose that E ( j ) has just been set and that the cells j + 1, . . . , m are presently partitioned into r maximal blocks ending at cells pl < pz < . . c p, = m . The block ending at p, will be called the ith block. We use bi to denote the common pointer for cells in block i. We assume that the algorithm has a list of the ejzd-of-block positions p, < p 2 < . - - < p, and a parallel list of common pointers bl > b2 > . . + > b,. After E ( j ) is set, the new partition of cells j 1 through m is found in the following way: First, if F ( j 1) >_ C a i d ( j , j 1) then, by Lemma 12.6.1, E(jr)2 Cand(j, j') for all j' > j, so the partition of cells greater than j remains unchanged. Otherwise (if E ( j 1) < Cand(j, j I)), the algorithm successively compares E(pi) to Cand(j, pi)
Figure 12.18: Partition of the cells j + 1 through m into maximal blocks of consecutive cells such that all the cells in any block have the same b value. The common b value in any block is less than the common b value in the preceding block.
k < j' that has contributed the best candidate yet seen for cell j'. Pointer b(j') is updated every time the value of E( j') changes. The use of these pointers combined with the next lemma leads ultimately to the desired speedup.
Lemma 12.6.2. Consider the point when j is the current cell, but before j sends fomard any candidate values. At that point, b( j') 2 b( j' + I )for every cell j'from j 1 to m - 1.
For notational simplicity, let b(jl) = k and b(j + 1) = k'. Then, by the selection of k, Cand(k, j') 2 Cand(k', j'). Now suppose k < k'. Then, by Lemma 12.6.1, Cand(k, j' 1) >_ Cand(kr, j' + I), in which case b(jl 1) should be set to k, not k'. Hence k 3 k' and the lemma is proved.
PROOF
l
Corollary 12.6.1. At the point that j is the current cell but before j sends forward any candidates, the values of the b pointers form a nonincreasing sequence from left to right. Therefore, cells j, j + 1, j + 2, . . . , m are partitioned into maximal blocks of consecutive cells such that all b pointers in the block have the same value, and the pointer values decline in successive blocks. Definition The partition of cells j through m referred to in Corollary 12.6.1 is called the current block-partition. See Figure 12.18.
Given Corollary 12.6.1, the algorithm doesn't need to explicitly maintain a b pointer for every cell but only record the common b pointer for each block. This fact will next be exploited to achieve the desired speedup.
.?? values in these three cases are the values before any .??changes.
v(j ) := max[G( j ) , E ( j ) , F ( j ) l ; {As before we assume that the needed F and G values have been computed.)
{Now see how j's candidates change the block-partition.} Set j' equal to the first entry on the end-of-block list. {look for the first index s in the end-of-block list where j loses) If Cand(b(j'), j 1) < Cand(j, j 1) then {j's candidate wins one) begin While The end-of-block list is not empty and Cand(b(j'), j') < Cand(j, j') do begin remove the first entry on the end-of-block list, and remove the corresponding b-pointer If the end-of-block list is not empty then set j' to the new first entry on the end-of-block list. end; end {while}; If the end-of-block list is empty then place m at the head of that list; Else {when the end-of-block list is not empty) begin Let p, denote the first end-of-block entry. Using binary search over the cells in block s , find the right-most point p in that block such that Cand(j, p ) > Cand(b,, p). Add p to the head of the end-of-block list; end;
Time analysis
An E value is computed for the current cell, or when the algorithm does a comparison involved in maintaining the current block-partition. Hence the total time for the algorithm is proportional to the number of those comparisons. In iteration j, when j is the current cell, the 'comparisons are divided into those used to find block s and those used in the binary search to split block s. If the algorithm does 1 > 2 comparisons to find s in iteration j, then at least 1 - 1 full blocks coalesce into a single block. The binary search then splits at most one block into two. Hence if, in iteration j , the algorithm does 1 > 2 comparisons to find s, then the total number of blocks decreases by at least 1 - 2. If it does one or two comparisons, then the total number of blocks at most increases by one. Since the algorithm begins with a single block and there are m iterations, it follows that over the entire algorithm there can be at most O(m) comparisons done to find every s , excluding the comparisons done during the binary searches. Clearly, the total number of comparisons used in the rn binary searches is O(m logm). Hence we have Theorem 12.6.1. Fur anyfied row; all rhe E ( j ) values can be computed in O(m log m) total rime.
p3
I
p4
I
p5
I I
p6
j+ l
coalesced block
Figure 12.20: To update the block-partition the algorithm successively examines cell pi to find the first index s where g(ps)zCand(j, p,). In this figure, s is 4. Blocks 1 through s - 1 = 3 coalesce into a single block with some initial part of block s = 4. Blocks to the right of s remain unchanged.
for i from 1 to r, until either the end-of-block list is exhausted, or until it finds the first index s with E(p,) >_ Cand(j, p,). In the first case, the cells j 1, . . . , m fall into a single block with common pointer to cell j . In the second case, the blocks s f 1 through r remain unchanged, but all the blocks 1 through J. - 1 coalesce with some initial part (possibly all) of blocks, forming one block with common pointer to cell j (see Figure 12.20). Note that every comparison but the last one results in two neighboring blocks coalescing into one. Having found block s, the algorithm finds the proper place to split block s by doing binary search over the cells in the block. This is exactly as in the case already discussed for j = 1.
S2
0 0
i
I l
E
j+4
l
n
Sl
i+4
- D
-n
r i
+
Figure 12,21: A single block with t = 4 drawn inside the full dynamic programming table. The distance values in the part of the block labeled F are determined by the values in the parts labeled A. 8 , and C together with the substrings of S1 and S2 in D and E. Note that A is the intersection of the first row and
column of the block.
Consider the standard dynamic programming approach to computing the edit distance of The value D(i, j)given to any cell (i, j),when i and j are both greater two strings S1and S2. than 0, is determined by the values in its three neighboring cells, (i - I , j - l ) , ( i - 1, j), and (i, j - I), and by the characters in positions i and j of the two strings. By extension, the values given to the cells in an entire t-block, with upper left-hand comer at position (i, j ) say, are determined by the values in the first row and column of the t-block together with the substrings Sl[i..i + t - I] and S2[j.. j + t - 11 (see Figure 12.21). Another way to state this observation is the following:
Lemma 12.7.1. The distance values in a t-block starting in position (i, j) are afunction of [i. .i t - 13 and S2[ j.. j t - 11. the values in itsjrst row and column and the substrings S1
Definition Given Lemma 12.7.1, and using the notation shown in Figure 12.21, we define the block filnction as the function from the five inputs (A, B , C,D ,E) to the output F .
It follows that the values in the last row and column of a t-block are also a function of the inputs (A. B, C , D , E). We call the function from those inputs to the values in the last row and column of a t-block, the restricted block function. Notice that the total size of the input and the size of the output of the restricted block function is O(t).
The case of F values is essentially symmetric A similar algorithm and analysis is used to compute the F values, except that for F(i, j )
the lists partition column j from cell i through n. There is, however, one point that might cause confusion: Although the analysis for F focuses on the work in a single column and is symmetric to the analysis for E in a single row, the computations of E and F are actually interleaved since, by the recurrences, each V(i, j ) value depends on both E ( i , j ) and F ( i , j ) . Even though both the E values and the F values are computed rowwise (since V is computed rowwise), one row after another, E(i, j) is computed just prior to the computation of E(i, j l), while between the computation of F(i, j ) and F(i 1, j), 1). So m - 1 other F values will be computed (m - j in row i and j - 1 in row i although the analysis treats the work in a column as if it is done in one contiguous time interval, the algorithm actually breaks up the work in any given column. Only O(nm) total time is needed to compute the G values and to compute every V(i, j ) once E(i, j ) and F(i, j) is known. In summary we have
+ +
Theorem 12.6.2. When the gap weight w is a convexfunction of thegap length, a n optimal
alignment can be computed in O(nm log m) time, where m > n are the lengths of the hvo strings.
The rough idea of the Four-Russians method is to partition the dynamic programming table into t-blocks and compute the essential values in the table one t-block at a time, rather than one cell at a time. The goal is to spend only O(r) time per block (rather than Q(t2) time), achieving a factor of t speedup over the standard dynamic programming solution. In the exposition given below, the partition will not be exactly achieved, since neighboring t-blocks will overlap somewhat. Still, the rough idea given here does capture the basic flavor and advantage of the method presented below. That method will compute the edit distance in 0 ( n 2 / log n) time, for two strings of length n (again assuming a fixed alphabet).
'
This reflects our general level of ignorance about ethnicities in the then Soviet Union.
1 ~ I. . I
nk PUUK-RUSSIANS SPEEDUP
305
In the case of edit distance, the precornputation suggested by the Four-Russians idea is to enumerate all possible inputs to the restricted block function (the proper size of the block will be determined later), compute the resulting output values (a t-length row and a t-length column) for each input, and store the outputs indexed by the inputs. Every time a specific restricted block function must be computed in step 3 of; the block edit distance algorithm, the value of the function is then retrieved from the precomputed values and need not be computed. This clearly works to compute the edit distance D(n, n), but is it any faster than the original 0 ( n 2 ) method? Astute readers should be skeptical, so please suspend disbelief for now.
Accounting detail Assume first that all the precomputation has been done. What time is needed to execute the block edit distance algorithm? Recall that the sizes of the input and the output of the restricted block function are both O(t). It is not difficult to organize the input-output values of the (precomputed) restricted block function so that the correct output for any specific input can be retrieved in O ( t j time, Details are left to the reader. There are 0 ( n 2 / t 2 )blocks, hence the total time used by the block edit distance algorithm is 0 ( n 2 / t ) . Setting t to @(log n), the time is 0 ( n 2 / log n). However, in the unit-cost RAM model of computation, each output value can be retrieved in constant time since t = O(1og n). In that case, the time for the method is reduced to 0(n 2 /(lo g u ) ~ ) . But what about the precomputation time? The key issue involves the number of input choices to the restricted block function. By definition, every cell has an integer from zero to n, so there are (n + 1)' possible values for any t-length row or column. If the alphabet has size a , then there are a' possible substrings of length t. Hence the number of distinct input combinations to the restricted block function is (n 1)2'a". For each input, it takes @ ( r 2 ) time to evaluate the last row and column of the resulting t-block (by running the standard dynamic program). Thus the overall time used in this way to precompute the function outputs to all possible input choices is O((n + 1 j2'02't2). But t must be at least one, so $2(n2)time is used in this way. No progress yet! The idea is right, but we need another trick to make it work.
Lemma 12.7.2. in any row, column, o r diagonal of the dynamic programtning table for edit distance, m o adjacent cells call have a val~re that difiers by a t most one.
PROOF
Certainly, D(i, j ) 5 D(i, j - 1) 1. Conversely, if the optimal alignment of SII l . . i ] and S2[1..j ] matches S 2 ( j ) to some character of SI, then by simply omitting S 2 ( j ) and aligning its mate against a space, the distance increases by at most one. If S 2 ( j ) is not matched then its omission reduces the distance by one, Hence D(i, j - 1) 5 D(i, j) 1, and the lemma is proved for adjacent row cells. Similar reasoning holds along a column. In the case of adjacent cells in a diagonal, it is easy to see that D ( i , j ) 5 D(i - 1, j - 1 ) 1. Conversely, if the optimal alignment of SI[ I ..i] and S2[1..j ] aligns i against j,
Figure 12.22: An edit distance table for n = 9. With t = 4, the table is covered by nine overlapping blocks. n general, if n = k(t - 1) then the ( n +1) by (n+ 1) The center block is outlined with darker lines for clarity. L table will be covered by k 2 overlapping t-blocks.
1. Cover the (n + 1) by (n 1) dynamic programming table with t-blocks, where the last column of every t-block is shared with the first column of the t-block to its right (if any), and the last row of every t-block is shared with the first row of the r-block below it (if any). (See Figure 12.22). In this way, anci since n = k(r - I ) , the table will consist of k rows and k columns of partially overlapping t-blocks. 2. Initialize the values in the first row and column of the full table according to the base conditions of the recurrence. 3. In arowwise manner, use the restricted block function to successively determine the values in the last row and last column of each block. By the overlapping nature of the blocks, the values in the last column (or row) of a block are the values in the first column (or row) of the block to its right (or below it). 4. The value in ceIl (n, n ) is the edit distance of SLand Sz.
end. Of course, the heart of the algorithm is step 3, where specific instances of the restricted block function must be computed. Any instance of the restricted block function can be computed 0 ( t 2 ) time, but that gains us nothing. So how is the restricted block function computed?
Time analysis
As in the analysis of the block edit distance algorithm, the execution of the four-Russians n)~ in ] the unit-cost edit distance algorithm takes 0 ( n 2 / logn) time (or ~ [ n ~ / ( l o ~time RAM model) by setting t to O(1ogn). S o again, the key issue is the time needed to . that the first entry of an offset vector must be precompute the block offset f u ~ c t i o nRecall zero, so there are 32(r-11 possible offset vectors. There are o rways to specify a substring ways to specify the input to over an alphabet with 0 characters, and so there are 32(1-1)a2r the offset function. For any specific input choice, the output is computed in 0 ( t 2 ) time (via dynamic programming), hence the entire precomputation takes 0(32'a2b2)time. Setting t equal to (log,, n)/2, the precomputation time is just O(n(1og nj2). In summary, we have
Theorem 12.7.2. The edit distance of two strings of length n can be computed in O (&) time or O time in the unit-cost RAM model.
(6)
(6)
306
then D(i - 1 , j- 1) 5 D(i, j ) 1. If the optimal alignment doesn't align i against j, then at or S2(j), must align against a space, and D(i - 1, j - 1) _< least one of the characters, Sl(i) i j . Given Lemma 12.7.2, we can encode the values in a row of a t-block by a t-length vector specifying the value of the first entry in the row, and then specifying the difference (offset) of each successive cell value to its left neighbor: A zero indicates equality, a one indicates an increase by one, and a minus one indicates a decrease by one. For example, the row of distances 5, 4, 4, 5 would be encoded by the row of offsets 5, -1, 0, + l . Similarly, we can encode the values in any column by such offset encoding. Since there are only (n 1)3'-' distinct vectors of this type, a change to offset encoding is surely a move in the right direction. We can, however, reduce the number of possible vectors even further.
Definition The ofset vector is a t-length vector of values from (-1,0, 1 ), where the first entry must be zero.
The key to making the Four-Russians method efficient is to compute edit distance using only offset vectors rather than actual distance values. Because the number of possible offset vectors is much less than the number of possible vectors of distance values, much less precomputation will be needed. We next show that edit distance can be computed using offset vectors.
Theorem 12.7.1. Consider a t-block with upper left corner in position (i, j ) . The two ofset vectorsfor the last row and last column o f the block can be determinedfrom the two ofthe block and from substrings Sl [l..i] and offset vectors for the first row and column S2[1 .. j ] . That is, no D value is needed in the input in order to determine the oflser vectors in the lust row and column of the block.
The proof is essentially a close examination of the dynamic programming recurrences for edit distance. Denote the unknown value of D(i, j ) by C . Then for column q in the block, D(i, q ) equals C plus the total of the offset values in row i from column j 1 to column y. Hence even if the algorithm doesn't know the value of C , it can express D(i, q ) as C plus an integer that it can determine. Each D(q, j) can be similarly expressed. Let D(i, j 1) be C + J and let D(i + 1, j ) be C I , where the algorithm can know I and J. Now consider cell (i + 1 , j + 1). D(i + 1, j 1) is equal to D(i, j ) = C if character S l ( i ) matches S z ( j ) .Otherwise D(i 1, j + 1 ) equals the minimum of D ( i , j 1) 1, D(i 1, j) t , and D(i, j ) 1, i.e., the minimum of C I 1, C J 1, and C 1. The algorithm can make this comparison by comparing I and J (which it knows) to the 1, j 1) as C , C I I, number zero. So the algorithm can correctly express D(i C J 1, or C 1. Continuing in this way, the algorithm can correctly express each D value in the block as an unknown C plus some integer that it can determine. Since every term involves the same unknown constant C, the offset vectors can be correctly determined by the algorithm. o
PROOF
+ +
+ +
+ + + + + + + + + + +
Definition The function that determines the two offset vectors for the last row and last column from the two offset vectors for the first row and column of a block together with substrings Sl[ l . . i ] and S2[l..j ] is called the offsetfunction.
We now have all the pieces of the Four-Russians-type algorithm to compute edit distance. We again assume, for simplicity, that each string has length n = k(t - 1) for some k.
12.8. EXERCISES
309
Prove the lemma and then show how to exploit it in the solution to the threshold P-againstall problem. Try to estimate how effective the lemma is in practice. Be sure to consider how the output is efficiently collected when the dynamic programming ends high in the tree, before a leaf is reached.
11. Give a complete proof of the correctness of the all-against-all suffix t ~ e e algorithm.
12. Another, faster, alternative to the P-against-all problem is to change the problem slightly as follows: For each position i in T such that there is a substring starting at i with edit distance less than d from P, report only the smallestsuch substring starting at position i. This is the (P-against-all) starting location problem, and it can be solved by modifying the approach discussed for the threshold P-against-all problem. The starting location problem (actually the equivalent ending location problem) is the subject of a paper by Ukkonen [437]. In that paper, Ukkonen develops three hybrid dynamic programming methods in the same spirit as those presented in this chapter, but with additional technical observations. The main result of that paper was later improved by Cobbs f1051.
Detail a solution to the starting location problem, using a hybrid dynamic programming approach.
13. Show that the suffix tree methods and time bounds for the P-against-all.andthe all-againstall problems extend to the problem of computing similarity instead of edit distance.
14. Let R be a regular expression. Show how to modify the P-against-allmethod to solve the Ragainst-all problem. That is, show how to use a suffix tree to efficiently search for a substring in a large text T that matches the regular expression R. (This problem is from [63].)
Now extend the method to allow for a bounded number of errors in the match.
15. Finish the proof of Theorem 12.5.2. 16. Show that in any permutation of n integers from 1 to n, there is either an increasing subsequence of length at least f i or a decreasing subsequence of length at least &.Show that, averaged over all the n! permutations, the average length of the longest increasing subsequence is at least &/2. Show that the lower bound of f i / 2 cannot be tight.
17. What do the results from the previous problem imply for the Ics problem? 18. If S is a subsequence of another string S', then S is said to be a supersequence of S. If two strings S1and E& are subsequences of S', then Sf is a common supersequence of S, and &. That leads to the following natural question: Given two strings SIand $, what is the shortestsupersequence common to both S1 and S2.This problem is clearly related to the longest common subsequence problem. Develop an explicit relationship between the two problems, and the lengths of their solutions. Then develop efficient methods to find a shortest common supersequence of two strings. For additional results on subsequences 2 4 0 1 and [241]. and supersequences see 1
f
19. Can the results in the previous problem be generalized to the case of more than two strings? For instance, is there a natural relationship between the longest common subsequence and the shortest common supersequence of three strings?
20. Let T be a string whose characters come from an alphabet C with a characters. A subsequence S of T is nondecreasing if each successive charhder in S is lexically greater than or equal to the preceding character. For example, using the English alphabet let T = characterstring; then S = aacrst is a nondecreasing subsequence of T. Give an aigorithm that finds the longest nondecreasing subsequence of a string T in time O(na), where n is the length of T. How does this bound compare to the O(nlog n) bound given for the longest increasing subsequence problem over integers.
21. Recall the definition of r given for two strings in Section 12.5.2 on page 290. Extend the
1 2 . 8 . Exercises
1. Show how to compute the value V(n,m)of the optimal alignment using only min(n,m) + 1
space in addition to the space needed to represent the two input strings.
2. Modify Hirschberg's method to work for alignment with a gap penalty (affine and general) in the objective function. It may be helpful to use both the affine gap recurrences developed in the text, and the alternative recurrences that pay for a gap when terminated. The latter recurrences were developed in the exercise 27 of Chapter 11.
3. Hirschberg's method computes one optimal alignment. Try to find ways to modify the method to produce more (all?) optimal alignments while still achieving substantial space reduction and maintaining a good time bound compared to the O(nm)-time and space method? I believe this is an open area.
4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when Im- nf < k. 5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km)values stored during the k differences algorithm. The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching ( d - I )path. These pointers only take O(km) space and are a sparse version of the standard dynamic programming pointers. Fill in the details for this approach as well.
6. The k differences problem is an unweighted (or unit weighted) alignment problem defined in terms of the number of mismatches and spaces. Can the O(km) result be extended to operator- or alphabet-weighted versions of alignment? The answer is: not completely. Explain why not. Then find special cases of weighted alignment, and plausible uses for these cases, where the result does extend.
7 . Prove Lemma 12.3.2 from page 274. 8. Prove Lemma 12.3.4 from page 277.
9. Prove Theorem 12.4.2 that concerns space use in the P-against-all problem.
The P-against-all problem was introduced first because it most directly illustrates one general approach to using suffix trees to speed up dynamic programming computations. And, it has been proposed that such a massive study of how Prelates to substrings of T can be important in certain problems [183]. Nonetheless, for most applications the output of the Pagainst-all problem is excessive and a more focused computation is desirable. The threshold P-against-allproblem is of this type: Given strings Pand T and a threshold d , find every substring T' of T such that the edit distance between P and T' is less than d . Of course, it would be cheating to first solve the P-against-all problem and then filter out the substrings of T whose edit distance to P i s d or greater. We want a method whose speed is related to d . The computation should increase in speed as d falls. The idea is to follow the solution to the P-against-all problem, doing a depth-first traversal of suffix tree 7,but recognize subtrees that need not be traversed. The following lemma is the key.
Lemma 12.8.1. In the P-against-allproblem, suppose that the currentpath in the suffix tree specifies a substring S of T and that the current dynamic programming column (including the zero row) contains no values below d . Then the column representing an extension of S will also contain no values below d. Hence no columns need be computed for any extensions of S.
12.8. EXERCISES
311
method seems more justified. In fact, why not pick a "reasonable5aalue tor t , do the precomputation of the offset function once for that t, and then embed the offset function in an edit distance algorithm to be used for all future edit distance computations. Discuss the merits and demerits of this proposal. 32. The Four-Russians method presented in the text only computes the edit distance. How can it be modified to compute the edit transcript as well?
33. Show how to apply the Four-Russians method to strings of unequal length.
34. What problems arise in trying to extend the Four-Russians method and the improved time bound to the weightededit distance problem? Are there restrictions on weights (other than equality) that make the extension easier?
35. Following the lines of the previous question, show in detail how the Four-Russians approach can be used to solve the longest common subsequence problem between two strings of log n) time. length n, in O($/
310
definition for r to the longest common subsequence problem for more than two strings, and use r to express the time for finding an Ics in this case.
22. Show how to model and solve the lis problem as a shortest path problem in a directed, acyclic graph. Are there any advantages to viewing the problem in this way?
23. Suppose we only want to learn the length of the Ics of two strings S1and $. That can be done, as before, in O(r log n) time, but now only using linear space. The key is to keep only the last element in each list of the cover (when computing the lis), and not to generate all of n(St,$) at once, but to generate (in linear space) parts of n(S,,&) on the fly. Fill in the details of these ideas and show that the length of the Ics can be computed as quickly as before in only linear space.
Open problem: Extend the above combinatorial ideas, to show how to compute the actual Ics of two strings using only linear space, without increasing the needed time. Then extend to more than two strings.
24. (This problem requires a knowledge of systolic arrays.) Show how to implement the longest increasing subsequence algorithm to run in O(n) time on an O(n)-element systolic array (remember that each array element has only constant memory). To make the problem simpler, first consider how to compute the length of the /is, and then work out how to compute the actual increasing subsequence.
25. Work out how to compute the Ics in O(n) time on an O(n)-element systolic array.
26. We have reduced the Ics problem to the /is problem. Show how to do the reduction in the opposite direction,
27. Suppose each character in S,and S is given an individual weight. Give an algorithm to find an increasing subsequence of maximum total weight.
28. Derive an O(nmlog m)-time method to compute edit distance for the convex gap weight model. 29. The idea of forward dynamic programming can be used to speed up (in practice) the (global) alignment of two strings, even when gaps are not included in the objective function. We will explain this in terms of computing unweighted edit distance between strings S and S2 (of lengths nand m respectively), but the basic idea works for computing similarity as well. Suppose a cell (i, j ) is reached during the (forward) dynamic programming computation of edit distance and the value there is D(i, j). Suppose also that there is a fast way to compute a lower bound, L(i, j),on the distance between substrings Sl[i I , . . . ,n] and &[j 1,. . ,ml. If o ( i , j ) L ( i , j ) is greater than or equal to a known distance between S, and & obtained from some particular alignment, then there is no need to propogate j). The question now is to find efficient methods to candidate values forward from cell (i, compute "effective" values of L(i,j ) . One simple one is ( n- m + j - il. Explain this. Try it out in practice to see how effective it is. Come up with other simple lower bounds that are much more effective.
Hint: Use the count of the number of times each character appears in each string.
30. As detailed in the text, the Four-Russians method precomputes the offset function for 321t-11~2r specifications of input values. However, the problem statement and time bound allow the precomputation of the offset function to be done after strings S1 and & are known. Can that observation be used to reduce the running time? An alternative encoding of strings allows the a2' term to be changed to (t + 2)t even in problem settings where S, and & are not known when the precomputation is done. Discover and explain the encoding and how edit distance is computed when using it.
31. Consider the situation when the edit distance must be computed for each pair of strings from a large set of strings. In that situation, the precomputation needed by the Four-Russians