The complete guide to string processing algorithms used in competitive programming, software engineering interviews, and real-world applications like search engines, spell checkers, DNA sequence analysis, and text editors. From pattern matching to suffix structures.
KMP finds all occurrences of a pattern in a text in O(n + m) time, where n is the text length and m is the pattern length. The key insight is the failure function (also called the prefix function): when a mismatch occurs, we already know some characters in the text match the beginning of the pattern, so we never re-examine characters we have already matched.
lps[i] = length of the longest proper prefix of pattern[0..i] that is also a suffix of pattern[0..i].
Example: pattern = "ABACABAB"
lps = [0, 0, 1, 0, 1, 2, 3, 2]
At index 6, "ABACABA" has prefix "ABA" matching suffix "ABA", so lps[6] = 3.
Naive string matching is O(n*m). KMP guarantees linear time because we never backtrack in the text. Each character in the text is visited at most twice (once for matching, once for a potential failure link jump).
Python
def compute_lps(pattern: str) -> list[int]:
"""Build the longest proper prefix-suffix array."""
m = len(pattern)
lps = [0] * m
length = 0 # length of previous longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1] # fall back, don't increment i
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text: str, pattern: str) -> list[int]:
"""Return all starting indices where pattern occurs in text."""
n, m = len(text), len(pattern)
if m == 0:
return []
lps = compute_lps(pattern)
matches = []
i = 0 # index in text
j = 0 # index in pattern
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1] # continue searching for overlapping matches
elif i < n and text[i] != pattern[j]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
# Example
print(kmp_search("ABABDABACDABABCABAB", "ABABCABAB")) # [9]
print(kmp_search("AAAAAA", "AAA")) # [0, 1, 2, 3] (overlapping)
C++
#include <vector>
#include <string>
using namespace std;
vector<int> computeLPS(const string& pat) {
int m = pat.size();
vector<int> lps(m, 0);
int len = 0, i = 1;
while (i < m) {
if (pat[i] == pat[len]) {
lps[i++] = ++len;
} else if (len) {
len = lps[len - 1];
} else {
lps[i++] = 0;
}
}
return lps;
}
vector<int> kmpSearch(const string& text, const string& pat) {
int n = text.size(), m = pat.size();
vector<int> lps = computeLPS(pat), res;
for (int i = 0, j = 0; i < n; ) {
if (text[i] == pat[j]) { i++; j++; }
if (j == m) { res.push_back(i - m); j = lps[j - 1]; }
else if (i < n && text[i] != pat[j]) {
j ? j = lps[j - 1] : i++;
}
}
return res;
}
Time: O(n + m) for both preprocessing and search.
Space: O(m) for the LPS array.
The Z-algorithm computes, for each position i in a string, the length of the longest substring starting at i that matches a prefix of the string. This Z-array can be used for pattern matching by concatenating pattern + "$" + text and looking for Z-values equal to the pattern length.
Z[i] = length of the longest substring starting at position i that matches a prefix of the string.
Example: s = "aabxaab"
Z = [-, 1, 0, 0, 3, 1, 0] (Z[0] is undefined/set to 0 by convention)
Z[4] = 3 because "aab" starting at index 4 matches the prefix "aab".
Python
def z_function(s: str) -> list[int]:
"""Compute the Z-array in O(n) time."""
n = len(s)
z = [0] * n
l, r = 0, 0 # [l, r) is the rightmost Z-box
for i in range(1, n):
if i < r:
z[i] = min(r - i, z[i - l])
while i + z[i] < n and s[z[i]] == s[i + z[i]]:
z[i] += 1
if i + z[i] > r:
l, r = i, i + z[i]
return z
def z_search(text: str, pattern: str) -> list[int]:
"""Find all occurrences of pattern in text using Z-algorithm."""
combined = pattern + "$" + text
z = z_function(combined)
m = len(pattern)
return [i - m - 1 for i in range(m + 1, len(combined)) if z[i] == m]
# Example
print(z_search("ABABDABACDABABCABAB", "ABABCABAB")) # [9]
C++
vector<int> zFunction(const string& s) {
int n = s.size();
vector<int> z(n, 0);
for (int i = 1, l = 0, r = 0; i < n; i++) {
if (i < r) z[i] = min(r - i, z[i - l]);
while (i + z[i] < n && s[z[i]] == s[i + z[i]]) z[i]++;
if (i + z[i] > r) { l = i; r = i + z[i]; }
}
return z;
}
vector<int> zSearch(const string& text, const string& pat) {
string combined = pat + "$" + text;
vector<int> z = zFunction(combined);
int m = pat.size();
vector<int> res;
for (int i = m + 1; i < (int)combined.size(); i++)
if (z[i] == m) res.push_back(i - m - 1);
return res;
}
Both are O(n + m) for pattern matching. Z-algorithm is often simpler to implement and reason about. KMP is more natural for streaming data (processing one character at a time). In competitive programming, Z-algorithm is often preferred for its simplicity.
Time: O(n) to build the Z-array. O(n + m) for pattern search.
Space: O(n + m) for the combined string.
Rabin-Karp uses a rolling hash to compare the pattern hash with substring hashes in O(1) per position. If hashes match, we verify character by character. It is especially useful for multi-pattern matching (searching for many patterns at once).
hash(s[0..m-1]) = (s[0]*p^(m-1) + s[1]*p^(m-2) + ... + s[m-1]*p^0) mod M
To slide the window one position right:
hash(s[1..m]) = (hash(s[0..m-1]) - s[0]*p^(m-1)) * p + s[m]
Common choices: p = 31 or 131, M = 10^9 + 7 or 10^9 + 9
A single hash can collide. For competitive programming, use double hashing (two different mod values) to make collision probability negligible (~10^-18). Always verify matches character by character when correctness matters.
Python
def rabin_karp(text: str, pattern: str) -> list[int]:
"""Rabin-Karp with rolling hash. Returns all match positions."""
n, m = len(text), len(pattern)
if m > n:
return []
BASE = 131
MOD = 10**18 + 9
# Precompute BASE^m mod MOD
power = pow(BASE, m, MOD)
# Compute hash of pattern and first window
pat_hash = 0
win_hash = 0
for i in range(m):
pat_hash = (pat_hash * BASE + ord(pattern[i])) % MOD
win_hash = (win_hash * BASE + ord(text[i])) % MOD
matches = []
for i in range(n - m + 1):
if win_hash == pat_hash:
# Verify to avoid false positives
if text[i:i + m] == pattern:
matches.append(i)
# Slide window
if i + m < n:
win_hash = (win_hash * BASE - ord(text[i]) * power + ord(text[i + m])) % MOD
return matches
# Example
print(rabin_karp("abcabcabc", "abc")) # [0, 3, 6]
C++
vector<int> rabinKarp(const string& text, const string& pat) {
int n = text.size(), m = pat.size();
if (m > n) return {};
const long long BASE = 131, MOD = 1000000007LL;
long long power = 1;
for (int i = 0; i < m; i++) power = power * BASE % MOD;
long long ph = 0, wh = 0;
for (int i = 0; i < m; i++) {
ph = (ph * BASE + pat[i]) % MOD;
wh = (wh * BASE + text[i]) % MOD;
}
vector<int> res;
for (int i = 0; i <= n - m; i++) {
if (wh == ph && text.substr(i, m) == pat)
res.push_back(i);
if (i + m < n)
wh = ((wh * BASE - text[i] * power % MOD + MOD) % MOD + text[i + m]) % MOD;
}
return res;
}
Time: O(n + m) expected, O(n*m) worst case (many collisions). With double hashing, worst case is negligible.
Space: O(1) extra (beyond input).
Edit distance measures the minimum number of single-character operations (insert, delete, replace) to transform one string into another. This is one of the most practically important string algorithms -- it powers spell checkers, autocorrect, fuzzy search, diff tools, DNA sequence alignment, and typing accuracy services.
dp[i][j] = edit distance between s1[0..i-1] and s2[0..j-1]
Base cases: dp[i][0] = i, dp[0][j] = j
Transition:
If s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1]
Else: dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
The three choices are: delete from s1, insert into s1, replace in s1.
Typing services like MonkeyType and TypeRacer use edit distance to measure accuracy. When you type "hte" instead of "the", the edit distance is 1 (one transposition, or two operations in standard Levenshtein). Services compute:
Accuracy = 1 - (edit_distance / max(len(typed), len(expected)))
Fuzzy matching in search (like fzf, VS Code's Ctrl+P) uses edit distance or weighted variants where certain operations cost less (e.g., transpositions are common typos and may count as 1 operation via Damerau-Levenshtein).
k->s (replace), e->i (replace), n->g (replace), _->g (insert) = edit distance 3
Full DP table (rows = "kitten", cols = "sitting"):
"" s i t t i n g "" 0 1 2 3 4 5 6 7 k 1 1 2 3 4 5 6 7 i 2 2 1 2 3 4 5 6 t 3 3 2 1 2 3 4 5 t 4 4 3 2 1 2 3 4 e 5 5 4 3 2 2 3 4 n 6 6 5 4 3 3 2 3
Answer: dp[6][7] = 3
Python
def edit_distance(s1: str, s2: str) -> int:
"""Compute minimum edit distance using O(n) space."""
m, n = len(s1), len(s2)
prev = list(range(n + 1))
for i in range(1, m + 1):
curr = [i] + [0] * n
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
curr[j] = prev[j - 1]
else:
curr[j] = 1 + min(prev[j], curr[j - 1], prev[j - 1])
prev = curr
return prev[n]
def edit_distance_with_ops(s1: str, s2: str) -> tuple[int, list[str]]:
"""Return edit distance and the actual operations to transform s1 -> s2."""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1): dp[i][0] = i
for j in range(n + 1): dp[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
# Backtrack to find operations
ops = []
i, j = m, n
while i > 0 or j > 0:
if i > 0 and j > 0 and s1[i - 1] == s2[j - 1]:
i -= 1; j -= 1
elif i > 0 and j > 0 and dp[i][j] == dp[i - 1][j - 1] + 1:
ops.append(f"Replace '{s1[i-1]}' with '{s2[j-1]}' at pos {i-1}")
i -= 1; j -= 1
elif i > 0 and dp[i][j] == dp[i - 1][j] + 1:
ops.append(f"Delete '{s1[i-1]}' at pos {i-1}")
i -= 1
else:
ops.append(f"Insert '{s2[j-1]}' at pos {i}")
j -= 1
ops.reverse()
return dp[m][n], ops
# Typing accuracy example
def typing_accuracy(expected: str, typed: str) -> float:
"""Calculate typing accuracy as used by typing speed services."""
dist = edit_distance(expected, typed)
max_len = max(len(expected), len(typed))
return 1.0 - dist / max_len if max_len > 0 else 1.0
# Examples
print(edit_distance("kitten", "sitting")) # 3
dist, ops = edit_distance_with_ops("kitten", "sitting")
for op in ops: print(op)
print(f"Typing accuracy: {typing_accuracy('the quick brown', 'teh quikc brown'):.1%}") # 86.7%
C++
int editDistance(const string& s1, const string& s2) {
int m = s1.size(), n = s2.size();
vector<int> prev(n + 1), curr(n + 1);
for (int j = 0; j <= n; j++) prev[j] = j;
for (int i = 1; i <= m; i++) {
curr[0] = i;
for (int j = 1; j <= n; j++) {
if (s1[i-1] == s2[j-1])
curr[j] = prev[j-1];
else
curr[j] = 1 + min({prev[j], curr[j-1], prev[j-1]});
}
swap(prev, curr);
}
return prev[n];
}
Standard Levenshtein counts "ab" -> "ba" as 2 operations (delete + insert or two replaces). Damerau-Levenshtein adds transposition as a single operation, which is more natural for typos. Add this check: if s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1], then also consider dp[i-2][j-2] + 1.
Time: O(m * n)
Space: O(min(m, n)) with rolling array optimization, O(m * n) if you need backtracking.
LCS finds the longest sequence of characters that appear in both strings in the same order (not necessarily contiguous). It is used in diff tools, version control, bioinformatics, and plagiarism detection.
dp[i][j] = LCS length of s1[0..i-1] and s2[0..j-1]
If s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1] + 1
Else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
Python
def lcs(s1: str, s2: str) -> str:
"""Return the actual LCS string via DP + backtracking."""
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Backtrack to build the LCS
result = []
i, j = m, n
while i > 0 and j > 0:
if s1[i - 1] == s2[j - 1]:
result.append(s1[i - 1])
i -= 1; j -= 1
elif dp[i - 1][j] > dp[i][j - 1]:
i -= 1
else:
j -= 1
return "".join(reversed(result))
# Example
print(lcs("ABCBDAB", "BDCAB")) # "BCAB" (length 4)
C++
string lcs(const string& s1, const string& s2) {
int m = s1.size(), n = s2.size();
vector<vector<int>> dp(m + 1, vector<int>(n + 1, 0));
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
dp[i][j] = s1[i-1] == s2[j-1]
? dp[i-1][j-1] + 1
: max(dp[i-1][j], dp[i][j-1]);
string res;
for (int i = m, j = n; i > 0 && j > 0; ) {
if (s1[i-1] == s2[j-1]) { res += s1[--i]; --j; }
else if (dp[i-1][j] > dp[i][j-1]) i--;
else j--;
}
reverse(res.begin(), res.end());
return res;
}
Time: O(m * n)
Space: O(m * n) for the full table (needed for backtracking), O(min(m,n)) for length only.
A suffix array is a sorted array of all suffixes of a string, represented by their starting indices. Combined with an LCP (Longest Common Prefix) array, it enables powerful operations: finding any substring in O(m log n), counting distinct substrings, and computing longest repeated substrings.
s = "banana$"
All suffixes: banana$, anana$, nana$, ana$, na$, a$, $
Sorted: $, a$, ana$, anana$, banana$, na$, nana$
Suffix Array (indices): [6, 5, 3, 1, 0, 4, 2]
LCP Array: [0, 1, 3, 0, 0, 2] (LCP between consecutive sorted suffixes)
Python
def build_suffix_array(s: str) -> list[int]:
"""Build suffix array in O(n log^2 n) using rank doubling."""
n = len(s)
sa = list(range(n))
rank = [ord(c) for c in s]
tmp = [0] * n
k = 1
while k < n:
def cmp_key(i):
return (rank[i], rank[i + k] if i + k < n else -1)
sa.sort(key=cmp_key)
# Recompute ranks
tmp[sa[0]] = 0
for i in range(1, n):
tmp[sa[i]] = tmp[sa[i - 1]]
if cmp_key(sa[i]) != cmp_key(sa[i - 1]):
tmp[sa[i]] += 1
rank = tmp[:]
if rank[sa[-1]] == n - 1:
break
k *= 2
return sa
def build_lcp(s: str, sa: list[int]) -> list[int]:
"""Build LCP array in O(n) using Kasai's algorithm."""
n = len(s)
rank = [0] * n
for i in range(n):
rank[sa[i]] = i
lcp = [0] * (n - 1)
k = 0
for i in range(n):
if rank[i] == 0:
k = 0
continue
j = sa[rank[i] - 1]
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k += 1
lcp[rank[i] - 1] = k
if k > 0:
k -= 1
return lcp
# Example
s = "banana"
sa = build_suffix_array(s)
lcp = build_lcp(s, sa)
print("SA:", sa) # [5, 3, 1, 0, 4, 2]
print("LCP:", lcp) # [1, 3, 0, 0, 2]
# Number of distinct substrings = n*(n+1)/2 - sum(lcp)
n = len(s)
distinct = n * (n + 1) // 2 - sum(lcp)
print(f"Distinct substrings: {distinct}") # 15
1. Find any substring in O(m log n) via binary search on the suffix array.
2. Count distinct substrings: n*(n+1)/2 - sum(LCP).
3. Longest repeated substring: max value in the LCP array.
4. Longest common substring of two strings: concatenate with separator, build SA + LCP, find max LCP between suffixes from different strings.
Suffix Array Construction: O(n log^2 n) with the above approach, O(n log n) with radix sort, O(n) with SA-IS.
LCP Array (Kasai): O(n)
Space: O(n)
Aho-Corasick is a multi-pattern string matching algorithm. Given a set of k patterns with total length M and a text of length N, it finds all occurrences of all patterns in O(N + M + Z) time, where Z is the number of matches. It builds a trie of all patterns, then adds failure links (like KMP's failure function, but on a trie) and dictionary suffix links.
Use it when you need to search for many patterns simultaneously: content filtering (banned words), virus signature detection, network intrusion detection, DNA motif finding. Running KMP separately for each pattern would be O(N*k), but Aho-Corasick is O(N + M + Z).
Python
from collections import deque
class AhoCorasick:
def __init__(self):
self.goto = [{}] # goto[state][char] -> next_state
self.fail = [0] # failure links
self.output = [[]] # output[state] = list of pattern indices
def add_pattern(self, pattern: str, idx: int):
"""Insert a pattern into the trie."""
state = 0
for ch in pattern:
if ch not in self.goto[state]:
self.goto[state][ch] = len(self.goto)
self.goto.append({})
self.fail.append(0)
self.output.append([])
state = self.goto[state][ch]
self.output[state].append(idx)
def build(self):
"""Build failure links using BFS (like KMP on a trie)."""
queue = deque()
# Initialize: depth-1 nodes fail to root
for ch, s in self.goto[0].items():
queue.append(s)
while queue:
u = queue.popleft()
for ch, v in self.goto[u].items():
queue.append(v)
# Follow failure links until we find a node with edge ch
f = self.fail[u]
while f and ch not in self.goto[f]:
f = self.fail[f]
self.fail[v] = self.goto[f].get(ch, 0)
if self.fail[v] == v:
self.fail[v] = 0
# Merge output from failure link (dictionary suffix link)
self.output[v] = self.output[v] + self.output[self.fail[v]]
def search(self, text: str) -> list[tuple[int, int]]:
"""Return list of (position, pattern_index) for all matches."""
state = 0
results = []
for i, ch in enumerate(text):
while state and ch not in self.goto[state]:
state = self.fail[state]
state = self.goto[state].get(ch, 0)
for pat_idx in self.output[state]:
results.append((i, pat_idx))
return results
# Example: find banned words
ac = AhoCorasick()
patterns = ["he", "she", "his", "hers"]
for i, p in enumerate(patterns):
ac.add_pattern(p, i)
ac.build()
text = "ahishers"
for pos, idx in ac.search(text):
print(f"Pattern '{patterns[idx]}' found ending at position {pos}")
# Pattern 'his' found ending at position 3
# Pattern 'she' found ending at position 5
# Pattern 'he' found ending at position 6
# Pattern 'hers' found ending at position 7
Build: O(M * A) where M = total pattern length, A = alphabet size (26 for lowercase).
Search: O(N + Z) where N = text length, Z = number of matches.
Space: O(M * A) for the automaton.
Manacher's algorithm finds all palindromic substrings in O(n) time. It computes, for each center position, the radius of the longest palindrome centered there. The key trick: we use previously computed palindromes to skip redundant comparisons, similar to how Z-algorithm reuses the Z-box.
Transform the string to handle even-length palindromes: "abc" becomes "#a#b#c#".
p[i] = radius of longest palindrome centered at i in the transformed string.
Maintain a rightmost palindrome boundary [c, r). For each i:
1. If i < r, initialize p[i] = min(p[2*c - i], r - i) (mirror property).
2. Try to expand the palindrome centered at i.
3. If i + p[i] > r, update c = i, r = i + p[i].
Python
def manachers(s: str) -> list[int]:
"""Return palindrome radii for the transformed string."""
# Transform: "abc" -> "^#a#b#c#$"
t = "^#" + "#".join(s) + "#$"
n = len(t)
p = [0] * n
c = r = 0 # center and right boundary
for i in range(1, n - 1):
mirror = 2 * c - i
if i < r:
p[i] = min(r - i, p[mirror])
# Attempt to expand
while t[i + p[i] + 1] == t[i - p[i] - 1]:
p[i] += 1
# Update center if we expanded past r
if i + p[i] > r:
c, r = i, i + p[i]
return p
def longest_palindrome(s: str) -> str:
"""Find the longest palindromic substring in O(n)."""
p = manachers(s)
max_len = max(p)
center = p.index(max_len)
# Convert back from transformed index to original
start = (center - max_len) // 2
return s[start:start + max_len]
def count_palindromes(s: str) -> int:
"""Count total number of palindromic substrings."""
p = manachers(s)
# Each p[i] in the transformed string gives (p[i]+1)//2
# palindromes centered at that position (for # positions: even-length, for char positions: odd-length)
total = 0
for i in range(1, len(p) - 1):
total += (p[i] + 1) // 2 # integer ceiling of p[i]/2
return total
# Examples
print(longest_palindrome("babad")) # "bab" or "aba"
print(longest_palindrome("cbbd")) # "bb"
print(count_palindromes("aaa")) # 6: "a","a","a","aa","aa","aaa"
C++
string longestPalindrome(const string& s) {
string t = "^#";
for (char c : s) { t += c; t += '#'; }
t += '$';
int n = t.size();
vector<int> p(n, 0);
int c = 0, r = 0;
for (int i = 1; i < n - 1; i++) {
if (i < r) p[i] = min(r - i, p[2*c - i]);
while (t[i + p[i] + 1] == t[i - p[i] - 1]) p[i]++;
if (i + p[i] > r) { c = i; r = i + p[i]; }
}
int maxLen = *max_element(p.begin(), p.end());
int center = find(p.begin(), p.end(), maxLen) - p.begin();
return s.substr((center - maxLen) / 2, maxLen);
}
Time: O(n) -- each character is part of at most one expansion.
Space: O(n) for the P array and transformed string.
Polynomial string hashing allows O(1) substring comparison after O(n) preprocessing. This is essential for competitive programming problems involving substring equality checks, counting distinct substrings, or solving problems with binary search on string length.
hash(s) = s[0]*p^(n-1) + s[1]*p^(n-2) + ... + s[n-1]*p^0 (mod M)
Prefix hashes: H[i] = hash(s[0..i-1])
Substring hash: hash(s[l..r]) = (H[r+1] - H[l] * p^(r-l+1)) mod M
Use double hashing (two mod values) to virtually eliminate collisions.
Python
class StringHasher:
"""O(n) preprocessing, O(1) substring hash queries. Double hashing."""
def __init__(self, s: str):
self.n = len(s)
self.MOD1 = 10**9 + 7
self.MOD2 = 10**9 + 9
self.BASE1 = 131
self.BASE2 = 137
# Build prefix hashes and power arrays
self.h1 = [0] * (self.n + 1)
self.h2 = [0] * (self.n + 1)
self.pw1 = [1] * (self.n + 1)
self.pw2 = [1] * (self.n + 1)
for i in range(self.n):
self.h1[i + 1] = (self.h1[i] * self.BASE1 + ord(s[i])) % self.MOD1
self.h2[i + 1] = (self.h2[i] * self.BASE2 + ord(s[i])) % self.MOD2
self.pw1[i + 1] = self.pw1[i] * self.BASE1 % self.MOD1
self.pw2[i + 1] = self.pw2[i] * self.BASE2 % self.MOD2
def query(self, l: int, r: int) -> tuple[int, int]:
"""Return double hash of s[l..r] (inclusive) in O(1)."""
length = r - l + 1
h1 = (self.h1[r + 1] - self.h1[l] * self.pw1[length]) % self.MOD1
h2 = (self.h2[r + 1] - self.h2[l] * self.pw2[length]) % self.MOD2
return (h1, h2)
def equal(self, l1: int, r1: int, l2: int, r2: int) -> bool:
"""Check if s[l1..r1] == s[l2..r2] in O(1)."""
return self.query(l1, r1) == self.query(l2, r2)
# Example: count distinct substrings of length k
def count_distinct_k(s: str, k: int) -> int:
hasher = StringHasher(s)
seen = set()
for i in range(len(s) - k + 1):
seen.add(hasher.query(i, i + k - 1))
return len(seen)
print(count_distinct_k("abcabc", 3)) # 3: "abc", "bca", "cab"
With a single mod of ~10^9, collision probability for n = 10^6 comparisons is about 10^-3 (birthday paradox). With double hashing (two independent mods), the probability drops to ~10^-12. Always use double hashing in contests unless the time limit is extremely tight.
Preprocessing: O(n)
Substring hash query: O(1)
Space: O(n)
A trie stores a set of strings by sharing common prefixes. Each edge represents a character, and each node may mark the end of a word. Tries support insert, search, and prefix queries in O(m) time (m = word length), and are the backbone of autocomplete systems, spell checkers, IP routing tables, and Boggle solvers.
Python
class TrieNode:
__slots__ = ['children', 'is_end', 'count']
def __init__(self):
self.children: dict[str, TrieNode] = {}
self.is_end = False
self.count = 0 # number of words passing through this node
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str) -> None:
node = self.root
for ch in word:
if ch not in node.children:
node.children[ch] = TrieNode()
node = node.children[ch]
node.count += 1
node.is_end = True
def search(self, word: str) -> bool:
"""Return True if word is in the trie."""
node = self._find(word)
return node is not None and node.is_end
def starts_with(self, prefix: str) -> bool:
"""Return True if any word starts with prefix."""
return self._find(prefix) is not None
def count_prefix(self, prefix: str) -> int:
"""Return number of words that start with prefix."""
node = self._find(prefix)
return node.count if node else 0
def _find(self, prefix: str):
node = self.root
for ch in prefix:
if ch not in node.children:
return None
node = node.children[ch]
return node
def autocomplete(self, prefix: str, limit: int = 5) -> list[str]:
"""Return up to `limit` words starting with prefix."""
node = self._find(prefix)
if not node:
return []
results = []
def dfs(n, path):
if len(results) >= limit:
return
if n.is_end:
results.append(prefix + "".join(path))
for ch in sorted(n.children):
dfs(n.children[ch], path + [ch])
dfs(node, [])
return results
# Example
trie = Trie()
for word in ["apple", "app", "application", "apt", "bat"]:
trie.insert(word)
print(trie.search("app")) # True
print(trie.starts_with("ap")) # True
print(trie.count_prefix("app")) # 3 (app, apple, application)
print(trie.autocomplete("app")) # ['app', 'apple', 'application']
C++
struct TrieNode {
int children[26];
bool is_end;
int count;
TrieNode() : is_end(false), count(0) {
memset(children, -1, sizeof(children));
}
};
class Trie {
vector<TrieNode> nodes;
public:
Trie() { nodes.emplace_back(); }
void insert(const string& word) {
int cur = 0;
for (char c : word) {
int idx = c - 'a';
if (nodes[cur].children[idx] == -1) {
nodes[cur].children[idx] = nodes.size();
nodes.emplace_back();
}
cur = nodes[cur].children[idx];
nodes[cur].count++;
}
nodes[cur].is_end = true;
}
bool search(const string& word) {
int cur = find(word);
return cur != -1 && nodes[cur].is_end;
}
int countPrefix(const string& prefix) {
int cur = find(prefix);
return cur == -1 ? 0 : nodes[cur].count;
}
private:
int find(const string& s) {
int cur = 0;
for (char c : s) {
int idx = c - 'a';
if (nodes[cur].children[idx] == -1) return -1;
cur = nodes[cur].children[idx];
}
return cur;
}
};
Autocomplete: DFS from the prefix node to enumerate completions.
Spell checker: Trie + edit distance (search within edit distance k from query).
XOR maximum: Binary trie to find max XOR pair in O(n * 30).
IP routing: Longest prefix match in routing tables.
Insert/Search/Prefix: O(m) where m is the word/prefix length.
Space: O(total characters * alphabet size) in worst case, but shared prefixes save space.
Many competitive programming and interview problems involve dynamic programming on strings. Here are the most important patterns.
Python
def min_palindrome_cuts(s: str) -> int:
"""Minimum cuts to partition s into palindromes. O(n^2)."""
n = len(s)
# is_pal[i][j] = True if s[i..j] is a palindrome
is_pal = [[False] * n for _ in range(n)]
for i in range(n - 1, -1, -1):
for j in range(i, n):
if s[i] == s[j] and (j - i <= 2 or is_pal[i + 1][j - 1]):
is_pal[i][j] = True
# dp[i] = minimum cuts for s[0..i]
dp = list(range(n)) # worst case: cut after every char
for i in range(n):
if is_pal[0][i]:
dp[i] = 0
else:
for j in range(1, i + 1):
if is_pal[j][i]:
dp[i] = min(dp[i], dp[j - 1] + 1)
return dp[n - 1]
print(min_palindrome_cuts("aab")) # 1 ("aa" | "b")
print(min_palindrome_cuts("abcba")) # 0 (already palindrome)
Python
def is_match(s: str, p: str) -> bool:
"""
Regex matching with '.' (any single char) and '*' (zero or more of preceding).
dp[i][j] = does s[0..i-1] match p[0..j-1]?
"""
m, n = len(s), len(p)
dp = [[False] * (n + 1) for _ in range(m + 1)]
dp[0][0] = True
# Handle patterns like a*, a*b*, a*b*c* that match empty string
for j in range(2, n + 1):
if p[j - 1] == '*':
dp[0][j] = dp[0][j - 2]
for i in range(1, m + 1):
for j in range(1, n + 1):
if p[j - 1] == '*':
# Zero occurrences of preceding element
dp[i][j] = dp[i][j - 2]
# One or more occurrences
if p[j - 2] == '.' or p[j - 2] == s[i - 1]:
dp[i][j] = dp[i][j] or dp[i - 1][j]
elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
dp[i][j] = dp[i - 1][j - 1]
return dp[m][n]
print(is_match("aab", "c*a*b")) # True
print(is_match("mississippi", "mis*is*ip*.")) # True
Python
def wildcard_match(s: str, p: str) -> bool:
"""
Wildcard matching: '?' matches any single char, '*' matches any sequence.
"""
m, n = len(s), len(p)
dp = [[False] * (n + 1) for _ in range(m + 1)]
dp[0][0] = True
for j in range(1, n + 1):
if p[j - 1] == '*':
dp[0][j] = dp[0][j - 1]
for i in range(1, m + 1):
for j in range(1, n + 1):
if p[j - 1] == '*':
dp[i][j] = dp[i - 1][j] or dp[i][j - 1]
elif p[j - 1] == '?' or p[j - 1] == s[i - 1]:
dp[i][j] = dp[i - 1][j - 1]
return dp[m][n]
print(wildcard_match("adceb", "*a*b")) # True
print(wildcard_match("acdcb", "a*c?b")) # False
Time: O(m * n) where m = string length, n = pattern length.
Space: O(m * n), optimizable to O(n) with rolling array.
These services compare your typed text against a reference using edit distance (Section 4). The core pipeline:
1. Split into words. For each word, compute Levenshtein distance against the expected word.
2. Accuracy = correct_chars / total_chars, where incorrect chars are counted via edit distance.
3. Some services use Damerau-Levenshtein (counting transpositions as 1 edit) since "teh" -> "the" is a common slip.
4. Advanced services also track per-key error rates using character-level diff alignment (similar to LCS backtracking).
Modern spell checkers combine multiple string algorithms:
1. Trie for dictionary storage and fast lookup.
2. Edit distance to find candidates within distance 1-2 of the misspelled word.
3. BK-trees (metric trees using edit distance) for efficient nearest-neighbor search in large dictionaries.
4. Norvig's spell corrector: generate all strings within edit distance 2, filter by dictionary membership, rank by word frequency. ~80% of misspellings are within edit distance 1.
Pattern matching: KMP/Rabin-Karp to find gene sequences (motifs) in genomes.
Sequence alignment: Edit distance variants (Smith-Waterman for local alignment, Needleman-Wunsch for global) with custom scoring matrices (BLOSUM, PAM).
Repeat finding: Suffix arrays + LCP to find tandem repeats in DNA.
Genome assembly: Overlap graphs built using suffix-prefix matching (Aho-Corasick).
Ctrl+F (Find): Uses Boyer-Moore or KMP for exact matching.
Fuzzy search (fzf, VS Code): Substring/subsequence matching + scoring based on match positions, consecutive matches, and word boundaries.
Full-text search (Elasticsearch): Inverted index + n-gram tokenization + edit distance for typo tolerance.
Diff tools (git diff): LCS-based algorithms (Myers diff) to compute minimal edit scripts between files.
Python
from collections import Counter
def spell_corrector(word: str, dictionary: set[str], freq: Counter) -> str:
"""Return most likely correction for a misspelled word."""
if word in dictionary:
return word
def edits1(w):
"""All strings within edit distance 1."""
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(w[:i], w[i:]) for i in range(len(w) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]
replaces = [a + c + b[1:] for a, b in splits if b for c in letters]
inserts = [a + c + b for a, b in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
# Try edit distance 1 first, then 2
candidates = {w for w in edits1(word) if w in dictionary}
if not candidates:
candidates = {e2 for e1 in edits1(word)
for e2 in edits1(e1) if e2 in dictionary}
if not candidates:
return word # no correction found
return max(candidates, key=lambda w: freq[w])
# Example usage
dictionary = {"the", "their", "there", "then", "they", "this", "that"}
freq = Counter({"the": 100, "their": 20, "there": 30, "then": 15})
print(spell_corrector("teh", dictionary, freq)) # "the"
print(spell_corrector("thier", dictionary, freq)) # "their"
LC 28: Find the Index of the First Occurrence in a String
LC 214: Shortest Palindrome (KMP trick: reverse + KMP failure function)
LC 459: Repeated Substring Pattern (KMP: check if lps[n-1] divides n)
LC 686: Repeated String Match (Rabin-Karp)
CF 126B: Password (Z-function)
LC 72: Edit Distance
LC 583: Delete Operation for Two Strings (LCS variant)
LC 1143: Longest Common Subsequence
LC 712: Minimum ASCII Delete Sum for Two Strings
LC 97: Interleaving String
LC 115: Distinct Subsequences
LC 5: Longest Palindromic Substring (Manacher's for O(n))
LC 647: Palindromic Substrings (count all)
LC 131: Palindrome Partitioning (backtracking)
LC 132: Palindrome Partitioning II (min cuts DP)
LC 516: Longest Palindromic Subsequence
LC 208: Implement Trie (Prefix Tree)
LC 211: Design Add and Search Words Data Structure
LC 212: Word Search II (trie + backtracking)
LC 421: Maximum XOR of Two Numbers in an Array (binary trie)
LC 745: Prefix and Suffix Search
LC 10: Regular Expression Matching
LC 44: Wildcard Matching
LC 87: Scramble String
LC 1312: Minimum Insertion Steps to Make a String Palindrome
LC 940: Distinct Subsequences II
SPOJ SUBST1: New Distinct Substrings (suffix array + LCP)
CF 271D: Good Substrings (trie)
CF 710F: String Set Queries (Aho-Corasick)
LC 336: Palindrome Pairs (trie or hashing)
POJ 3294: Life Forms (suffix array, LCS of multiple strings)
Start with: Edit Distance (LC 72), LCS (LC 1143), Trie (LC 208) -- these appear in interviews constantly.
Then learn: KMP (LC 28, 459), Rabin-Karp, String DP (LC 10, 44).
For competitive programming: Z-algorithm, Suffix Arrays, Aho-Corasick, Manacher's, String Hashing.
Master all of these and you can solve any string problem.