Aho-Corasick Algorithm in Python
Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + … + length(arr[k-1]). Here k is the total number of input words.
Examples:
Input:
text = "hello worldhello"
arr = ["hello", "world"]
Output:{'hello': [0, 10], 'world': [6]}
Explantion
:
In the given text "hello worldhello", the pattern "hello" appears at index 0 and 10, and the pattern "world" appears at index 6.Input:
text = "abxabcabcaby"
arr = ["ab", "abc", "aby"]
Output:{'ab': [0, 3], 'abc': [2, 5], 'aby': [9]}
Aho-Corasick Algorithm:
The Aho-Corasick algorithm is a string-searching algorithm that constructs a finite state machine representing all keywords to be searched for. It’s a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the “dictionary”) within an input text. It matches all strings simultaneously.
Step-by-step explanation of the algorithm:
Build Trie (Keyword Tree):
- Create a root node.
- For each keyword in the given list, add it to the trie.
- If a keyword ends at a node, add it to the output list of that node.
Below is the syntax of the above idea:
def build_trie(patterns):
root = AhoCorasickNode(None) # root node of the trie
# Iterate over each pattern in the list of patterns
for pattern in patterns:
node = root
# Iterate over each character in the pattern
for char in pattern:
# If the character is not in the children of the current node, add a new child node
if char not in node.children:
node.children[char] = AhoCorasickNode(char)
# Move to the child node
node = node.children[char]
# Add the pattern to the output of the current node
node.output.append(pattern)
return root
Build Failure Links:
- Use BFS to traverse the trie.
- For each node, set its failure link to the longest suffix of the current keyword that is also a prefix of a keyword in the trie. If no such suffix exists, set the failure link to the root node.
Below is the syntax of the above idea:
from collections import deque
def build_failure_function(root):
queue = deque()
# Initialize failure function of the root's children to the root itself
for node in root.children.values():
node.failure = root
queue.append(node)
# Breadth-first traversal of the trie to compute the failure function
while queue:
current_node = queue.popleft()
# For each child of the current node
for char, child_node in current_node.children.items():
queue.append(child_node)
failure_node = current_node.failure
# Traverse the failure function until a node is found with a matching child or the root is reached
while failure_node and char not in failure_node.children:
failure_node = failure_node.failure
# Update the failure function of the child node
child_node.failure = failure_node.children[char] if failure_node else root
# Add the output of the failure node to the output of the current node
child_node.output.extend(child_node.failure.output)
Search the Text:
- Start at the root node of the trie.
- For each character in the text:
- Follow the character along the trie.
- If a keyword is found, record its position in the text.
- If a character leads to a failure link, follow the failure link and continue searching.
Below is the syntax of the above idea:
def search(text, patterns):
root = build_trie(patterns)
build_failure_function(root)
current_node = root
results = {} # Dictionary to store the indices of the found patterns
# Iterate over each character in the text
for i, char in enumerate(text):
# Follow the failure function until a matching child is found or the root is reached
while current_node and char not in current_node.children:
current_node = current_node.failure
# If a matching child is found, move to that child
if current_node:
current_node = current_node.children[char]
# If the current node has any patterns that end at it, store the indices of those patterns
for pattern in current_node.output:
start_index = i - len(pattern) + 1
if start_index not in results:
results[start_index] = []
results[start_index].append(pattern)
return results
Implementation of Aho-Corasick Algorithm in Python:
Aho-Corasick Algorithm efficiently finds multiple patterns in a given text. Here's a Python implementation:
class TrieNode:
def __init__(self):
# Initialize TrieNode attributes
self.children = {}
self.output = []
self.fail = None
def build_automaton(keywords):
# Initialize root node of the trie
root = TrieNode()
# Build trie
for keyword in keywords:
node = root
# Traverse the trie and create nodes for each character
for char in keyword:
node = node.children.setdefault(char, TrieNode())
# Add keyword to the output list of the final node
node.output.append(keyword)
# Build failure links using BFS
queue = []
# Start from root's children
for node in root.children.values():
queue.append(node)
node.fail = root
# Breadth-first traversal of the trie
while queue:
current_node = queue.pop(0)
# Traverse each child node
for key, next_node in current_node.children.items():
queue.append(next_node)
fail_node = current_node.fail
# Find the longest proper suffix that is also a prefix
while fail_node and key not in fail_node.children:
fail_node = fail_node.fail
# Set failure link of the current node
next_node.fail = fail_node.children[key] if fail_node else root
# Add output patterns of failure node to current node's output
next_node.output += next_node.fail.output
return root
def search_text(text, keywords):
# Build the Aho-Corasick automaton
root = build_automaton(keywords)
# Initialize result dictionary
result = {keyword: [] for keyword in keywords}
current_node = root
# Traverse the text
for i, char in enumerate(text):
# Follow failure links until a match is found
while current_node and char not in current_node.children:
current_node = current_node.fail
if not current_node:
current_node = root
continue
# Move to the next node based on current character
current_node = current_node.children[char]
# Record matches found at this position
for keyword in current_node.output:
result[keyword].append(i - len(keyword) + 1)
return result
# Example 1
text1 = "hello worldhello"
arr1 = ["hello", "world"]
result1 = search_text(text1, arr1)
print(result1)
# Example 2
text2 = "abxabcabcaby"
arr2 = ["ab", "abc", "aby"]
result2 = search_text(text2, arr2)
print(result2)
class TrieNode:
def __init__(self):
# Initialize TrieNode attributes
self.children = {}
self.output = []
self.fail = None
def build_automaton(keywords):
# Initialize root node of the trie
root = TrieNode()
# Build trie
for keyword in keywords:
node = root
# Traverse the trie and create nodes for each character
for char in keyword:
node = node.children.setdefault(char, TrieNode())
# Add keyword to the output list of the final node
node.output.append(keyword)
# Build failure links using BFS
queue = []
# Start from root's children
for node in root.children.values():
queue.append(node)
node.fail = root
# Breadth-first traversal of the trie
while queue:
current_node = queue.pop(0)
# Traverse each child node
for key, next_node in current_node.children.items():
queue.append(next_node)
fail_node = current_node.fail
# Find the longest proper suffix that is also a prefix
while fail_node and key not in fail_node.children:
fail_node = fail_node.fail
# Set failure link of the current node
next_node.fail = fail_node.children[key] if fail_node else root
# Add output patterns of failure node to current node's output
next_node.output += next_node.fail.output
return root
def search_text(text, keywords):
# Build the Aho-Corasick automaton
root = build_automaton(keywords)
# Initialize result dictionary
result = {keyword: [] for keyword in keywords}
current_node = root
# Traverse the text
for i, char in enumerate(text):
# Follow failure links until a match is found
while current_node and char not in current_node.children:
current_node = current_node.fail
if not current_node:
current_node = root
continue
# Move to the next node based on current character
current_node = current_node.children[char]
# Record matches found at this position
for keyword in current_node.output:
result[keyword].append(i - len(keyword) + 1)
return result
# Example 1
text1 = "hello worldhello"
arr1 = ["hello", "world"]
result1 = search_text(text1, arr1)
print(result1)
# Example 2
text2 = "abxabcabcaby"
arr2 = ["ab", "abc", "aby"]
result2 = search_text(text2, arr2)
print(result2)
Output
{'hello': [0, 11], 'world': [6]} {'ab': [0, 3, 6, 9], 'abc': [3, 6], 'aby': [9]}
Time Complexity:
- Building the automaton: O(m+k)
- Searching the text: O(n+z), where z is the total number of occurrences of all keywords in the text.
Auxiliary Space: O (m+k)