Z algorithm in Python
The Z algorithm is a powerful string-matching algorithm used to find all occurrences of a pattern within a text. It operates efficiently, with a linear time complexity of O(n+m), where n is the length of the text and m is the length of the pattern. This makes it particularly useful for problems involving large texts. In this article, we'll explore the Z algorithm, understand its underlying concepts, and learn how to implement it in Python.
What is the Z Algorithm?
The Z algorithm computes an array, known as the Z-array, for a given string. The Z-array at position i stores the length of the longest substring starting from i that is also a prefix of the string. This information can then be used to efficiently search for a pattern within a text.
Z-array Definition:
Given a string S of length n, the Z-array Z is defined as follows: Z[i] is the length of the longest substring starting from S[i] which is also a prefix of S.
Example:
Consider the string S = "aabcaabxaaaz". The Z-array for S is calculated as follows:
- Z[0] = n (since the entire string is a prefix of itself)
- Z[1] = 1 (the substring starting at index 1 is "a", which is a prefix of length 1)
- Z[2] = 0 (the substring starting at index 2 is "b", which is not a prefix)
- Z[3] = 1 (the substring starting at index 3 is "c", which is a prefix of length 1)
- and so on.
- The Z-array for S would be [12, 1, 0, 1, 3, 1, 0, 0, 3, 0, 0, 1].
The Z Algorithm: Step-by-Step
Here's a detailed breakdown of how the Z algorithm works:
- Initialization:
- Start with the entire string S, and initialize the Z-array Z with zeroes.
- Set the variables L and R to 0. These variables will define a window in S where S[L:R+1] matches the prefix of S.
- Iterate through the string: For each position i in the string S:
- Case 1: If i > R, then there is no Z-box (a substring matching the prefix of S that starts before i and ends after i).
- Set L = R = i and extend the window R to the right as long as S[R] == S[R-L].
- Set Z[i] = R - L and decrement R.
- Case 2: If i ≤ R, then i falls within a Z-box. Use the previously computed Z-values to determine the value of Z[i]:
- Sub-case 2a: If Z[i-L] < R - i + 1, then Z[i] = Z[i-L].
- Sub-case 2b: If Z[i-L] ≥ R - i + 1, then set L = i and extend the window R as long as S[R] == S[R-L]. Set Z[i] = R - L and decrement R.
- Case 1: If i > R, then there is no Z-box (a substring matching the prefix of S that starts before i and ends after i).
- Output the Z-array: After processing all positions in the string, the Z-array contains the lengths of the longest substrings starting from each position that match the prefix of S.
Implementing the Z Algorithm in Python:
To understand the Z algorithm better, let's break down the implementation step by step.
- calculate_z(s):
- This function computes the Z-array for a given string
s
. - The Z-array is an array where the value at each position
i
indicates the length of the longest substring starting froms[i]
which is also a prefix ofs
.
- This function computes the Z-array for a given string
- z_algorithm(pattern, text):
- This function uses the Z Algorithm to search for all occurrences of
pattern
intext
. - It concatenates the pattern, a unique delimiter (
$
), and the text to create a combined string. - It then computes the Z-array for the combined string and checks for positions in the Z-array where the Z-value equals the length of the pattern, indicating a match.
- This function uses the Z Algorithm to search for all occurrences of
Below is the implementation of the above approach:
def calculate_z(s):
n = len(s) # Length of the input string
z = [0] * n # Initialize Z-array with zeros
l, r, k = 0, 0, 0 # Initialize left and right boundary of Z-box
for i in range(1, n):
# Case 1: i is outside the current Z-box
if i > r:
l, r = i, i
while r < n and s[r] == s[r - l]:
r += 1
z[i] = r - l
r -= 1
# Case 2: i is inside the current Z-box
else:
k = i - l
# Case 2a: Value does not stretch outside the Z-box
if z[k] < r - i + 1:
z[i] = z[k]
# Case 2b: Value stretches outside the Z-box
else:
# Case 2b: Value stretches outside the Z-box
l = i
while r < n and s[r] == s[r - l]:
r += 1
z[i] = r - l
r -= 1
return z
def z_algorithm(pattern, text):
# Concatenate pattern, delimiter, and text
combined = pattern + "$" + text
# Calculate Z-array for the combined string
z = calculate_z(combined)
# Length of the pattern
pattern_length = len(pattern)
# List to store the result indices
result = []
for i in range(len(z)):
# If Z-value equals pattern length, pattern is found
if z[i] == pattern_length:
# Append starting index to result
result.append(i - pattern_length - 1)
return result
# Example usage:
pattern = "abc"
text = "ababcabc"
result = z_algorithm(pattern, text)
print("Pattern found at indices:", result) # Output should be [2, 5]
Output
Pattern found at indices: [2, 5]
Time Complexity: O(n), where n is the length of the text. This is because the algorithm only needs to iterate through the text once to compute the Z array, and then it can use the Z array to find all occurrences of the pattern in the text.
Auxiliary Space: O(n), where n is the length of the text. This is because the algorithm needs to store the Z array, which has the same length as the text.