String Algorithms

1. KMP (Knuth-Morris-Pratt) 2. Z-Algorithm 3. Rabin-Karp 4. Edit Distance / Levenshtein 5. Longest Common Subsequence 6. Suffix Arrays 7. Aho-Corasick 8. Manacher's Algorithm 9. String Hashing 10. Trie 11. String DP Patterns 12. Real-World Applications 13. Practice Problems

1. KMP (Knuth-Morris-Pratt)

KMP finds all occurrences of a pattern in a text in O(n + m) time, where n is the text length and m is the pattern length. The key insight is the failure function (also called the prefix function): when a mismatch occurs, we already know some characters in the text match the beginning of the pattern, so we never re-examine characters we have already matched.

Failure Function

lps[i] = length of the longest proper prefix of pattern[0..i] that is also a suffix of pattern[0..i].

Example: pattern = "ABACABAB"

lps = [0, 0, 1, 0, 1, 2, 3, 2]

At index 6, "ABACABA" has prefix "ABA" matching suffix "ABA", so lps[6] = 3.

Why KMP Matters

Naive string matching is O(n*m). KMP guarantees linear time because we never backtrack in the text. Each character in the text is visited at most twice (once for matching, once for a potential failure link jump).

Python Implementation

Python
def compute_lps(pattern: str) -> list[int]:
    """Build the longest proper prefix-suffix array."""
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of previous longest prefix suffix
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]  # fall back, don't increment i
            else:
                lps[i] = 0
                i += 1
    return lps


def kmp_search(text: str, pattern: str) -> list[int]:
    """Return all starting indices where pattern occurs in text."""
    n, m = len(text), len(pattern)
    if m == 0:
        return []
    lps = compute_lps(pattern)
    matches = []
    i = 0  # index in text
    j = 0  # index in pattern

    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            matches.append(i - j)
            j = lps[j - 1]  # continue searching for overlapping matches
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return matches


# Example
print(kmp_search("ABABDABACDABABCABAB", "ABABCABAB"))  # [9]
print(kmp_search("AAAAAA", "AAA"))  # [0, 1, 2, 3] (overlapping)

C++ Implementation

C++
#include <vector>
#include <string>
using namespace std;

vector<int> computeLPS(const string& pat) {
    int m = pat.size();
    vector<int> lps(m, 0);
    int len = 0, i = 1;
    while (i < m) {
        if (pat[i] == pat[len]) {
            lps[i++] = ++len;
        } else if (len) {
            len = lps[len - 1];
        } else {
            lps[i++] = 0;
        }
    }
    return lps;
}

vector<int> kmpSearch(const string& text, const string& pat) {
    int n = text.size(), m = pat.size();
    vector<int> lps = computeLPS(pat), res;
    for (int i = 0, j = 0; i < n; ) {
        if (text[i] == pat[j]) { i++; j++; }
        if (j == m) { res.push_back(i - m); j = lps[j - 1]; }
        else if (i < n && text[i] != pat[j]) {
            j ? j = lps[j - 1] : i++;
        }
    }
    return res;
}

Complexity

Time: O(n + m) for both preprocessing and search.

Space: O(m) for the LPS array.

2. Z-Algorithm

The Z-algorithm computes, for each position i in a string, the length of the longest substring starting at i that matches a prefix of the string. This Z-array can be used for pattern matching by concatenating pattern + "$" + text and looking for Z-values equal to the pattern length.

Z-Array Definition

Z[i] = length of the longest substring starting at position i that matches a prefix of the string.

Example: s = "aabxaab"

Z = [-, 1, 0, 0, 3, 1, 0] (Z[0] is undefined/set to 0 by convention)

Z[4] = 3 because "aab" starting at index 4 matches the prefix "aab".

Python Implementation

Python
def z_function(s: str) -> list[int]:
    """Compute the Z-array in O(n) time."""
    n = len(s)
    z = [0] * n
    l, r = 0, 0  # [l, r) is the rightmost Z-box
    for i in range(1, n):
        if i < r:
            z[i] = min(r - i, z[i - l])
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        if i + z[i] > r:
            l, r = i, i + z[i]
    return z


def z_search(text: str, pattern: str) -> list[int]:
    """Find all occurrences of pattern in text using Z-algorithm."""
    combined = pattern + "$" + text
    z = z_function(combined)
    m = len(pattern)
    return [i - m - 1 for i in range(m + 1, len(combined)) if z[i] == m]


# Example
print(z_search("ABABDABACDABABCABAB", "ABABCABAB"))  # [9]

C++ Implementation

C++
vector<int> zFunction(const string& s) {
    int n = s.size();
    vector<int> z(n, 0);
    for (int i = 1, l = 0, r = 0; i < n; i++) {
        if (i < r) z[i] = min(r - i, z[i - l]);
        while (i + z[i] < n && s[z[i]] == s[i + z[i]]) z[i]++;
        if (i + z[i] > r) { l = i; r = i + z[i]; }
    }
    return z;
}

vector<int> zSearch(const string& text, const string& pat) {
    string combined = pat + "$" + text;
    vector<int> z = zFunction(combined);
    int m = pat.size();
    vector<int> res;
    for (int i = m + 1; i < (int)combined.size(); i++)
        if (z[i] == m) res.push_back(i - m - 1);
    return res;
}

KMP vs Z-Algorithm

Both are O(n + m) for pattern matching. Z-algorithm is often simpler to implement and reason about. KMP is more natural for streaming data (processing one character at a time). In competitive programming, Z-algorithm is often preferred for its simplicity.

Complexity

Time: O(n) to build the Z-array. O(n + m) for pattern search.

Space: O(n + m) for the combined string.

3. Rabin-Karp

Rabin-Karp uses a rolling hash to compare the pattern hash with substring hashes in O(1) per position. If hashes match, we verify character by character. It is especially useful for multi-pattern matching (searching for many patterns at once).

Rolling Hash

hash(s[0..m-1]) = (s[0]*p^(m-1) + s[1]*p^(m-2) + ... + s[m-1]*p^0) mod M

To slide the window one position right:

hash(s[1..m]) = (hash(s[0..m-1]) - s[0]*p^(m-1)) * p + s[m]

Common choices: p = 31 or 131, M = 10^9 + 7 or 10^9 + 9

Hash Collisions

A single hash can collide. For competitive programming, use double hashing (two different mod values) to make collision probability negligible (~10^-18). Always verify matches character by character when correctness matters.

Python Implementation

Python
def rabin_karp(text: str, pattern: str) -> list[int]:
    """Rabin-Karp with rolling hash. Returns all match positions."""
    n, m = len(text), len(pattern)
    if m > n:
        return []

    BASE = 131
    MOD = 10**18 + 9

    # Precompute BASE^m mod MOD
    power = pow(BASE, m, MOD)

    # Compute hash of pattern and first window
    pat_hash = 0
    win_hash = 0
    for i in range(m):
        pat_hash = (pat_hash * BASE + ord(pattern[i])) % MOD
        win_hash = (win_hash * BASE + ord(text[i])) % MOD

    matches = []
    for i in range(n - m + 1):
        if win_hash == pat_hash:
            # Verify to avoid false positives
            if text[i:i + m] == pattern:
                matches.append(i)
        # Slide window
        if i + m < n:
            win_hash = (win_hash * BASE - ord(text[i]) * power + ord(text[i + m])) % MOD
    return matches


# Example
print(rabin_karp("abcabcabc", "abc"))  # [0, 3, 6]

C++ Implementation

C++
vector<int> rabinKarp(const string& text, const string& pat) {
    int n = text.size(), m = pat.size();
    if (m > n) return {};
    const long long BASE = 131, MOD = 1000000007LL;

    long long power = 1;
    for (int i = 0; i < m; i++) power = power * BASE % MOD;

    long long ph = 0, wh = 0;
    for (int i = 0; i < m; i++) {
        ph = (ph * BASE + pat[i]) % MOD;
        wh = (wh * BASE + text[i]) % MOD;
    }

    vector<int> res;
    for (int i = 0; i <= n - m; i++) {
        if (wh == ph && text.substr(i, m) == pat)
            res.push_back(i);
        if (i + m < n)
            wh = ((wh * BASE - text[i] * power % MOD + MOD) % MOD + text[i + m]) % MOD;
    }
    return res;
}

Complexity

Time: O(n + m) expected, O(n*m) worst case (many collisions). With double hashing, worst case is negligible.

Space: O(1) extra (beyond input).

4. Edit Distance / Levenshtein Distance

Edit distance measures the minimum number of single-character operations (insert, delete, replace) to transform one string into another. This is one of the most practically important string algorithms -- it powers spell checkers, autocorrect, fuzzy search, diff tools, DNA sequence alignment, and typing accuracy services.

DP Recurrence

dp[i][j] = edit distance between s1[0..i-1] and s2[0..j-1]

Base cases: dp[i][0] = i, dp[0][j] = j

Transition:

If s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1]

Else: dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])

The three choices are: delete from s1, insert into s1, replace in s1.

Real-World: Typing Accuracy and Fuzzy Matching

Typing services like MonkeyType and TypeRacer use edit distance to measure accuracy. When you type "hte" instead of "the", the edit distance is 1 (one transposition, or two operations in standard Levenshtein). Services compute:

Accuracy = 1 - (edit_distance / max(len(typed), len(expected)))

Fuzzy matching in search (like fzf, VS Code's Ctrl+P) uses edit distance or weighted variants where certain operations cost less (e.g., transpositions are common typos and may count as 1 operation via Damerau-Levenshtein).

Detailed Walkthrough: "kitten" to "sitting"

k->s (replace), e->i (replace), n->g (replace), _->g (insert) = edit distance 3

Full DP table (rows = "kitten", cols = "sitting"):

    ""  s  i  t  t  i  n  g
""   0  1  2  3  4  5  6  7
k    1  1  2  3  4  5  6  7
i    2  2  1  2  3  4  5  6
t    3  3  2  1  2  3  4  5
t    4  4  3  2  1  2  3  4
e    5  5  4  3  2  2  3  4
n    6  6  5  4  3  3  2  3

Answer: dp[6][7] = 3

Python Implementation (with backtracking)

Python
def edit_distance(s1: str, s2: str) -> int:
    """Compute minimum edit distance using O(n) space."""
    m, n = len(s1), len(s2)
    prev = list(range(n + 1))
    for i in range(1, m + 1):
        curr = [i] + [0] * n
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(prev[j], curr[j - 1], prev[j - 1])
        prev = curr
    return prev[n]


def edit_distance_with_ops(s1: str, s2: str) -> tuple[int, list[str]]:
    """Return edit distance and the actual operations to transform s1 -> s2."""
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1): dp[i][0] = i
    for j in range(n + 1): dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])

    # Backtrack to find operations
    ops = []
    i, j = m, n
    while i > 0 or j > 0:
        if i > 0 and j > 0 and s1[i - 1] == s2[j - 1]:
            i -= 1; j -= 1
        elif i > 0 and j > 0 and dp[i][j] == dp[i - 1][j - 1] + 1:
            ops.append(f"Replace '{s1[i-1]}' with '{s2[j-1]}' at pos {i-1}")
            i -= 1; j -= 1
        elif i > 0 and dp[i][j] == dp[i - 1][j] + 1:
            ops.append(f"Delete '{s1[i-1]}' at pos {i-1}")
            i -= 1
        else:
            ops.append(f"Insert '{s2[j-1]}' at pos {i}")
            j -= 1
    ops.reverse()
    return dp[m][n], ops


# Typing accuracy example
def typing_accuracy(expected: str, typed: str) -> float:
    """Calculate typing accuracy as used by typing speed services."""
    dist = edit_distance(expected, typed)
    max_len = max(len(expected), len(typed))
    return 1.0 - dist / max_len if max_len > 0 else 1.0


# Examples
print(edit_distance("kitten", "sitting"))  # 3
dist, ops = edit_distance_with_ops("kitten", "sitting")
for op in ops: print(op)
print(f"Typing accuracy: {typing_accuracy('the quick brown', 'teh quikc brown'):.1%}")  # 86.7%

C++ Implementation

C++
int editDistance(const string& s1, const string& s2) {
    int m = s1.size(), n = s2.size();
    vector<int> prev(n + 1), curr(n + 1);
    for (int j = 0; j <= n; j++) prev[j] = j;

    for (int i = 1; i <= m; i++) {
        curr[0] = i;
        for (int j = 1; j <= n; j++) {
            if (s1[i-1] == s2[j-1])
                curr[j] = prev[j-1];
            else
                curr[j] = 1 + min({prev[j], curr[j-1], prev[j-1]});
        }
        swap(prev, curr);
    }
    return prev[n];
}

Damerau-Levenshtein (Transpositions)

Standard Levenshtein counts "ab" -> "ba" as 2 operations (delete + insert or two replaces). Damerau-Levenshtein adds transposition as a single operation, which is more natural for typos. Add this check: if s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1], then also consider dp[i-2][j-2] + 1.

Complexity

Time: O(m * n)

Space: O(min(m, n)) with rolling array optimization, O(m * n) if you need backtracking.

5. Longest Common Subsequence (LCS)

LCS finds the longest sequence of characters that appear in both strings in the same order (not necessarily contiguous). It is used in diff tools, version control, bioinformatics, and plagiarism detection.

DP Recurrence

dp[i][j] = LCS length of s1[0..i-1] and s2[0..j-1]

If s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1] + 1

Else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])

Python Implementation

Python
def lcs(s1: str, s2: str) -> str:
    """Return the actual LCS string via DP + backtracking."""
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    # Backtrack to build the LCS
    result = []
    i, j = m, n
    while i > 0 and j > 0:
        if s1[i - 1] == s2[j - 1]:
            result.append(s1[i - 1])
            i -= 1; j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1

    return "".join(reversed(result))


# Example
print(lcs("ABCBDAB", "BDCAB"))  # "BCAB" (length 4)

C++ Implementation

C++
string lcs(const string& s1, const string& s2) {
    int m = s1.size(), n = s2.size();
    vector<vector<int>> dp(m + 1, vector<int>(n + 1, 0));
    for (int i = 1; i <= m; i++)
        for (int j = 1; j <= n; j++)
            dp[i][j] = s1[i-1] == s2[j-1]
                ? dp[i-1][j-1] + 1
                : max(dp[i-1][j], dp[i][j-1]);

    string res;
    for (int i = m, j = n; i > 0 && j > 0; ) {
        if (s1[i-1] == s2[j-1]) { res += s1[--i]; --j; }
        else if (dp[i-1][j] > dp[i][j-1]) i--;
        else j--;
    }
    reverse(res.begin(), res.end());
    return res;
}

Complexity

Time: O(m * n)

Space: O(m * n) for the full table (needed for backtracking), O(min(m,n)) for length only.

6. Suffix Arrays

A suffix array is a sorted array of all suffixes of a string, represented by their starting indices. Combined with an LCP (Longest Common Prefix) array, it enables powerful operations: finding any substring in O(m log n), counting distinct substrings, and computing longest repeated substrings.

Suffix Array Example

s = "banana$"

All suffixes: banana$, anana$, nana$, ana$, na$, a$, $

Sorted: $, a$, ana$, anana$, banana$, na$, nana$

Suffix Array (indices): [6, 5, 3, 1, 0, 4, 2]

LCP Array: [0, 1, 3, 0, 0, 2] (LCP between consecutive sorted suffixes)

Python Implementation (O(n log^2 n))

Python
def build_suffix_array(s: str) -> list[int]:
    """Build suffix array in O(n log^2 n) using rank doubling."""
    n = len(s)
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    k = 1

    while k < n:
        def cmp_key(i):
            return (rank[i], rank[i + k] if i + k < n else -1)

        sa.sort(key=cmp_key)

        # Recompute ranks
        tmp[sa[0]] = 0
        for i in range(1, n):
            tmp[sa[i]] = tmp[sa[i - 1]]
            if cmp_key(sa[i]) != cmp_key(sa[i - 1]):
                tmp[sa[i]] += 1
        rank = tmp[:]
        if rank[sa[-1]] == n - 1:
            break
        k *= 2
    return sa


def build_lcp(s: str, sa: list[int]) -> list[int]:
    """Build LCP array in O(n) using Kasai's algorithm."""
    n = len(s)
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i

    lcp = [0] * (n - 1)
    k = 0
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        j = sa[rank[i] - 1]
        while i + k < n and j + k < n and s[i + k] == s[j + k]:
            k += 1
        lcp[rank[i] - 1] = k
        if k > 0:
            k -= 1
    return lcp


# Example
s = "banana"
sa = build_suffix_array(s)
lcp = build_lcp(s, sa)
print("SA:", sa)   # [5, 3, 1, 0, 4, 2]
print("LCP:", lcp)  # [1, 3, 0, 0, 2]

# Number of distinct substrings = n*(n+1)/2 - sum(lcp)
n = len(s)
distinct = n * (n + 1) // 2 - sum(lcp)
print(f"Distinct substrings: {distinct}")  # 15

Key Applications of Suffix Arrays

1. Find any substring in O(m log n) via binary search on the suffix array.

2. Count distinct substrings: n*(n+1)/2 - sum(LCP).

3. Longest repeated substring: max value in the LCP array.

4. Longest common substring of two strings: concatenate with separator, build SA + LCP, find max LCP between suffixes from different strings.

Complexity

Suffix Array Construction: O(n log^2 n) with the above approach, O(n log n) with radix sort, O(n) with SA-IS.

LCP Array (Kasai): O(n)

Space: O(n)

7. Aho-Corasick

Aho-Corasick is a multi-pattern string matching algorithm. Given a set of k patterns with total length M and a text of length N, it finds all occurrences of all patterns in O(N + M + Z) time, where Z is the number of matches. It builds a trie of all patterns, then adds failure links (like KMP's failure function, but on a trie) and dictionary suffix links.

When to Use Aho-Corasick

Use it when you need to search for many patterns simultaneously: content filtering (banned words), virus signature detection, network intrusion detection, DNA motif finding. Running KMP separately for each pattern would be O(N*k), but Aho-Corasick is O(N + M + Z).

Python Implementation

Python
from collections import deque

class AhoCorasick:
    def __init__(self):
        self.goto = [{}]       # goto[state][char] -> next_state
        self.fail = [0]        # failure links
        self.output = [[]]     # output[state] = list of pattern indices

    def add_pattern(self, pattern: str, idx: int):
        """Insert a pattern into the trie."""
        state = 0
        for ch in pattern:
            if ch not in self.goto[state]:
                self.goto[state][ch] = len(self.goto)
                self.goto.append({})
                self.fail.append(0)
                self.output.append([])
            state = self.goto[state][ch]
        self.output[state].append(idx)

    def build(self):
        """Build failure links using BFS (like KMP on a trie)."""
        queue = deque()
        # Initialize: depth-1 nodes fail to root
        for ch, s in self.goto[0].items():
            queue.append(s)

        while queue:
            u = queue.popleft()
            for ch, v in self.goto[u].items():
                queue.append(v)
                # Follow failure links until we find a node with edge ch
                f = self.fail[u]
                while f and ch not in self.goto[f]:
                    f = self.fail[f]
                self.fail[v] = self.goto[f].get(ch, 0)
                if self.fail[v] == v:
                    self.fail[v] = 0
                # Merge output from failure link (dictionary suffix link)
                self.output[v] = self.output[v] + self.output[self.fail[v]]

    def search(self, text: str) -> list[tuple[int, int]]:
        """Return list of (position, pattern_index) for all matches."""
        state = 0
        results = []
        for i, ch in enumerate(text):
            while state and ch not in self.goto[state]:
                state = self.fail[state]
            state = self.goto[state].get(ch, 0)
            for pat_idx in self.output[state]:
                results.append((i, pat_idx))
        return results


# Example: find banned words
ac = AhoCorasick()
patterns = ["he", "she", "his", "hers"]
for i, p in enumerate(patterns):
    ac.add_pattern(p, i)
ac.build()

text = "ahishers"
for pos, idx in ac.search(text):
    print(f"Pattern '{patterns[idx]}' found ending at position {pos}")
# Pattern 'his' found ending at position 3
# Pattern 'she' found ending at position 5
# Pattern 'he' found ending at position 6
# Pattern 'hers' found ending at position 7

Complexity

Build: O(M * A) where M = total pattern length, A = alphabet size (26 for lowercase).

Search: O(N + Z) where N = text length, Z = number of matches.

Space: O(M * A) for the automaton.

8. Manacher's Algorithm

Manacher's algorithm finds all palindromic substrings in O(n) time. It computes, for each center position, the radius of the longest palindrome centered there. The key trick: we use previously computed palindromes to skip redundant comparisons, similar to how Z-algorithm reuses the Z-box.

Core Idea

Transform the string to handle even-length palindromes: "abc" becomes "#a#b#c#".

p[i] = radius of longest palindrome centered at i in the transformed string.

Maintain a rightmost palindrome boundary [c, r). For each i:

1. If i < r, initialize p[i] = min(p[2*c - i], r - i) (mirror property).

2. Try to expand the palindrome centered at i.

3. If i + p[i] > r, update c = i, r = i + p[i].

Python Implementation

Python
def manachers(s: str) -> list[int]:
    """Return palindrome radii for the transformed string."""
    # Transform: "abc" -> "^#a#b#c#$"
    t = "^#" + "#".join(s) + "#$"
    n = len(t)
    p = [0] * n
    c = r = 0  # center and right boundary

    for i in range(1, n - 1):
        mirror = 2 * c - i
        if i < r:
            p[i] = min(r - i, p[mirror])
        # Attempt to expand
        while t[i + p[i] + 1] == t[i - p[i] - 1]:
            p[i] += 1
        # Update center if we expanded past r
        if i + p[i] > r:
            c, r = i, i + p[i]
    return p


def longest_palindrome(s: str) -> str:
    """Find the longest palindromic substring in O(n)."""
    p = manachers(s)
    max_len = max(p)
    center = p.index(max_len)
    # Convert back from transformed index to original
    start = (center - max_len) // 2
    return s[start:start + max_len]


def count_palindromes(s: str) -> int:
    """Count total number of palindromic substrings."""
    p = manachers(s)
    # Each p[i] in the transformed string gives (p[i]+1)//2
    # palindromes centered at that position (for # positions: even-length, for char positions: odd-length)
    total = 0
    for i in range(1, len(p) - 1):
        total += (p[i] + 1) // 2  # integer ceiling of p[i]/2
    return total


# Examples
print(longest_palindrome("babad"))      # "bab" or "aba"
print(longest_palindrome("cbbd"))       # "bb"
print(count_palindromes("aaa"))        # 6: "a","a","a","aa","aa","aaa"

C++ Implementation

C++
string longestPalindrome(const string& s) {
    string t = "^#";
    for (char c : s) { t += c; t += '#'; }
    t += '$';
    int n = t.size();
    vector<int> p(n, 0);
    int c = 0, r = 0;

    for (int i = 1; i < n - 1; i++) {
        if (i < r) p[i] = min(r - i, p[2*c - i]);
        while (t[i + p[i] + 1] == t[i - p[i] - 1]) p[i]++;
        if (i + p[i] > r) { c = i; r = i + p[i]; }
    }

    int maxLen = *max_element(p.begin(), p.end());
    int center = find(p.begin(), p.end(), maxLen) - p.begin();
    return s.substr((center - maxLen) / 2, maxLen);
}

Complexity

Time: O(n) -- each character is part of at most one expansion.

Space: O(n) for the P array and transformed string.

9. String Hashing

Polynomial string hashing allows O(1) substring comparison after O(n) preprocessing. This is essential for competitive programming problems involving substring equality checks, counting distinct substrings, or solving problems with binary search on string length.

Polynomial Hash

hash(s) = s[0]*p^(n-1) + s[1]*p^(n-2) + ... + s[n-1]*p^0 (mod M)

Prefix hashes: H[i] = hash(s[0..i-1])

Substring hash: hash(s[l..r]) = (H[r+1] - H[l] * p^(r-l+1)) mod M

Use double hashing (two mod values) to virtually eliminate collisions.

Python Implementation

Python
class StringHasher:
    """O(n) preprocessing, O(1) substring hash queries. Double hashing."""

    def __init__(self, s: str):
        self.n = len(s)
        self.MOD1 = 10**9 + 7
        self.MOD2 = 10**9 + 9
        self.BASE1 = 131
        self.BASE2 = 137

        # Build prefix hashes and power arrays
        self.h1 = [0] * (self.n + 1)
        self.h2 = [0] * (self.n + 1)
        self.pw1 = [1] * (self.n + 1)
        self.pw2 = [1] * (self.n + 1)

        for i in range(self.n):
            self.h1[i + 1] = (self.h1[i] * self.BASE1 + ord(s[i])) % self.MOD1
            self.h2[i + 1] = (self.h2[i] * self.BASE2 + ord(s[i])) % self.MOD2
            self.pw1[i + 1] = self.pw1[i] * self.BASE1 % self.MOD1
            self.pw2[i + 1] = self.pw2[i] * self.BASE2 % self.MOD2

    def query(self, l: int, r: int) -> tuple[int, int]:
        """Return double hash of s[l..r] (inclusive) in O(1)."""
        length = r - l + 1
        h1 = (self.h1[r + 1] - self.h1[l] * self.pw1[length]) % self.MOD1
        h2 = (self.h2[r + 1] - self.h2[l] * self.pw2[length]) % self.MOD2
        return (h1, h2)

    def equal(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """Check if s[l1..r1] == s[l2..r2] in O(1)."""
        return self.query(l1, r1) == self.query(l2, r2)


# Example: count distinct substrings of length k
def count_distinct_k(s: str, k: int) -> int:
    hasher = StringHasher(s)
    seen = set()
    for i in range(len(s) - k + 1):
        seen.add(hasher.query(i, i + k - 1))
    return len(seen)

print(count_distinct_k("abcabc", 3))  # 3: "abc", "bca", "cab"

Birthday Paradox

With a single mod of ~10^9, collision probability for n = 10^6 comparisons is about 10^-3 (birthday paradox). With double hashing (two independent mods), the probability drops to ~10^-12. Always use double hashing in contests unless the time limit is extremely tight.

Complexity

Preprocessing: O(n)

Substring hash query: O(1)

Space: O(n)

10. Trie (Prefix Tree)

A trie stores a set of strings by sharing common prefixes. Each edge represents a character, and each node may mark the end of a word. Tries support insert, search, and prefix queries in O(m) time (m = word length), and are the backbone of autocomplete systems, spell checkers, IP routing tables, and Boggle solvers.

Python Implementation

Python
class TrieNode:
    __slots__ = ['children', 'is_end', 'count']

    def __init__(self):
        self.children: dict[str, TrieNode] = {}
        self.is_end = False
        self.count = 0  # number of words passing through this node


class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str) -> None:
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
            node.count += 1
        node.is_end = True

    def search(self, word: str) -> bool:
        """Return True if word is in the trie."""
        node = self._find(word)
        return node is not None and node.is_end

    def starts_with(self, prefix: str) -> bool:
        """Return True if any word starts with prefix."""
        return self._find(prefix) is not None

    def count_prefix(self, prefix: str) -> int:
        """Return number of words that start with prefix."""
        node = self._find(prefix)
        return node.count if node else 0

    def _find(self, prefix: str):
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return None
            node = node.children[ch]
        return node

    def autocomplete(self, prefix: str, limit: int = 5) -> list[str]:
        """Return up to `limit` words starting with prefix."""
        node = self._find(prefix)
        if not node:
            return []
        results = []
        def dfs(n, path):
            if len(results) >= limit:
                return
            if n.is_end:
                results.append(prefix + "".join(path))
            for ch in sorted(n.children):
                dfs(n.children[ch], path + [ch])
        dfs(node, [])
        return results


# Example
trie = Trie()
for word in ["apple", "app", "application", "apt", "bat"]:
    trie.insert(word)

print(trie.search("app"))          # True
print(trie.starts_with("ap"))      # True
print(trie.count_prefix("app"))    # 3 (app, apple, application)
print(trie.autocomplete("app"))    # ['app', 'apple', 'application']

C++ Implementation

C++
struct TrieNode {
    int children[26];
    bool is_end;
    int count;
    TrieNode() : is_end(false), count(0) {
        memset(children, -1, sizeof(children));
    }
};

class Trie {
    vector<TrieNode> nodes;
public:
    Trie() { nodes.emplace_back(); }

    void insert(const string& word) {
        int cur = 0;
        for (char c : word) {
            int idx = c - 'a';
            if (nodes[cur].children[idx] == -1) {
                nodes[cur].children[idx] = nodes.size();
                nodes.emplace_back();
            }
            cur = nodes[cur].children[idx];
            nodes[cur].count++;
        }
        nodes[cur].is_end = true;
    }

    bool search(const string& word) {
        int cur = find(word);
        return cur != -1 && nodes[cur].is_end;
    }

    int countPrefix(const string& prefix) {
        int cur = find(prefix);
        return cur == -1 ? 0 : nodes[cur].count;
    }

private:
    int find(const string& s) {
        int cur = 0;
        for (char c : s) {
            int idx = c - 'a';
            if (nodes[cur].children[idx] == -1) return -1;
            cur = nodes[cur].children[idx];
        }
        return cur;
    }
};

Trie Applications

Autocomplete: DFS from the prefix node to enumerate completions.

Spell checker: Trie + edit distance (search within edit distance k from query).

XOR maximum: Binary trie to find max XOR pair in O(n * 30).

IP routing: Longest prefix match in routing tables.

Complexity

Insert/Search/Prefix: O(m) where m is the word/prefix length.

Space: O(total characters * alphabet size) in worst case, but shared prefixes save space.

11. String DP Patterns

Many competitive programming and interview problems involve dynamic programming on strings. Here are the most important patterns.

Palindrome Partitioning (Minimum Cuts)

Python
def min_palindrome_cuts(s: str) -> int:
    """Minimum cuts to partition s into palindromes. O(n^2)."""
    n = len(s)
    # is_pal[i][j] = True if s[i..j] is a palindrome
    is_pal = [[False] * n for _ in range(n)]
    for i in range(n - 1, -1, -1):
        for j in range(i, n):
            if s[i] == s[j] and (j - i <= 2 or is_pal[i + 1][j - 1]):
                is_pal[i][j] = True

    # dp[i] = minimum cuts for s[0..i]
    dp = list(range(n))  # worst case: cut after every char
    for i in range(n):
        if is_pal[0][i]:
            dp[i] = 0
        else:
            for j in range(1, i + 1):
                if is_pal[j][i]:
                    dp[i] = min(dp[i], dp[j - 1] + 1)
    return dp[n - 1]

print(min_palindrome_cuts("aab"))   # 1 ("aa" | "b")
print(min_palindrome_cuts("abcba")) # 0 (already palindrome)

Regular Expression Matching

Python
def is_match(s: str, p: str) -> bool:
    """
    Regex matching with '.' (any single char) and '*' (zero or more of preceding).
    dp[i][j] = does s[0..i-1] match p[0..j-1]?
    """
    m, n = len(s), len(p)
    dp = [[False] * (n + 1) for _ in range(m + 1)]
    dp[0][0] = True

    # Handle patterns like a*, a*b*, a*b*c* that match empty string
    for j in range(2, n + 1):
        if p[j - 1] == '*':
            dp[0][j] = dp[0][j - 2]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if p[j - 1] == '*':
                # Zero occurrences of preceding element
                dp[i][j] = dp[i][j - 2]
                # One or more occurrences
                if p[j - 2] == '.' or p[j - 2] == s[i - 1]:
                    dp[i][j] = dp[i][j] or dp[i - 1][j]
            elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
                dp[i][j] = dp[i - 1][j - 1]

    return dp[m][n]

print(is_match("aab", "c*a*b"))    # True
print(is_match("mississippi", "mis*is*ip*."))  # True

Wildcard Matching

Python
def wildcard_match(s: str, p: str) -> bool:
    """
    Wildcard matching: '?' matches any single char, '*' matches any sequence.
    """
    m, n = len(s), len(p)
    dp = [[False] * (n + 1) for _ in range(m + 1)]
    dp[0][0] = True

    for j in range(1, n + 1):
        if p[j - 1] == '*':
            dp[0][j] = dp[0][j - 1]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if p[j - 1] == '*':
                dp[i][j] = dp[i - 1][j] or dp[i][j - 1]
            elif p[j - 1] == '?' or p[j - 1] == s[i - 1]:
                dp[i][j] = dp[i - 1][j - 1]

    return dp[m][n]

print(wildcard_match("adceb", "*a*b"))   # True
print(wildcard_match("acdcb", "a*c?b"))  # False

Complexity (all above)

Time: O(m * n) where m = string length, n = pattern length.

Space: O(m * n), optimizable to O(n) with rolling array.

12. Real-World Applications

Typing Accuracy Services (MonkeyType, TypeRacer)

These services compare your typed text against a reference using edit distance (Section 4). The core pipeline:

1. Split into words. For each word, compute Levenshtein distance against the expected word.

2. Accuracy = correct_chars / total_chars, where incorrect chars are counted via edit distance.

3. Some services use Damerau-Levenshtein (counting transpositions as 1 edit) since "teh" -> "the" is a common slip.

4. Advanced services also track per-key error rates using character-level diff alignment (similar to LCS backtracking).

Spell Checkers and Autocorrect

Modern spell checkers combine multiple string algorithms:

1. Trie for dictionary storage and fast lookup.

2. Edit distance to find candidates within distance 1-2 of the misspelled word.

3. BK-trees (metric trees using edit distance) for efficient nearest-neighbor search in large dictionaries.

4. Norvig's spell corrector: generate all strings within edit distance 2, filter by dictionary membership, rank by word frequency. ~80% of misspellings are within edit distance 1.

DNA Sequence Analysis

Pattern matching: KMP/Rabin-Karp to find gene sequences (motifs) in genomes.

Sequence alignment: Edit distance variants (Smith-Waterman for local alignment, Needleman-Wunsch for global) with custom scoring matrices (BLOSUM, PAM).

Repeat finding: Suffix arrays + LCP to find tandem repeats in DNA.

Genome assembly: Overlap graphs built using suffix-prefix matching (Aho-Corasick).

Search Engines and Text Editors

Ctrl+F (Find): Uses Boyer-Moore or KMP for exact matching.

Fuzzy search (fzf, VS Code): Substring/subsequence matching + scoring based on match positions, consecutive matches, and word boundaries.

Full-text search (Elasticsearch): Inverted index + n-gram tokenization + edit distance for typo tolerance.

Diff tools (git diff): LCS-based algorithms (Myers diff) to compute minimal edit scripts between files.

Simple Spell Corrector (Norvig-style)

Python
from collections import Counter

def spell_corrector(word: str, dictionary: set[str], freq: Counter) -> str:
    """Return most likely correction for a misspelled word."""
    if word in dictionary:
        return word

    def edits1(w):
        """All strings within edit distance 1."""
        letters = 'abcdefghijklmnopqrstuvwxyz'
        splits = [(w[:i], w[i:]) for i in range(len(w) + 1)]
        deletes = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]
        replaces = [a + c + b[1:] for a, b in splits if b for c in letters]
        inserts = [a + c + b for a, b in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    # Try edit distance 1 first, then 2
    candidates = {w for w in edits1(word) if w in dictionary}
    if not candidates:
        candidates = {e2 for e1 in edits1(word)
                      for e2 in edits1(e1) if e2 in dictionary}
    if not candidates:
        return word  # no correction found
    return max(candidates, key=lambda w: freq[w])


# Example usage
dictionary = {"the", "their", "there", "then", "they", "this", "that"}
freq = Counter({"the": 100, "their": 20, "there": 30, "then": 15})
print(spell_corrector("teh", dictionary, freq))   # "the"
print(spell_corrector("thier", dictionary, freq))  # "their"

13. Practice Problems

Pattern Matching (KMP / Z / Rabin-Karp)

LC 28: Find the Index of the First Occurrence in a String

LC 214: Shortest Palindrome (KMP trick: reverse + KMP failure function)

LC 459: Repeated Substring Pattern (KMP: check if lps[n-1] divides n)

LC 686: Repeated String Match (Rabin-Karp)

CF 126B: Password (Z-function)

Edit Distance / LCS

LC 72: Edit Distance

LC 583: Delete Operation for Two Strings (LCS variant)

LC 1143: Longest Common Subsequence

LC 712: Minimum ASCII Delete Sum for Two Strings

LC 97: Interleaving String

LC 115: Distinct Subsequences

Palindromes

LC 5: Longest Palindromic Substring (Manacher's for O(n))

LC 647: Palindromic Substrings (count all)

LC 131: Palindrome Partitioning (backtracking)

LC 132: Palindrome Partitioning II (min cuts DP)

LC 516: Longest Palindromic Subsequence

Trie

LC 208: Implement Trie (Prefix Tree)

LC 211: Design Add and Search Words Data Structure

LC 212: Word Search II (trie + backtracking)

LC 421: Maximum XOR of Two Numbers in an Array (binary trie)

LC 745: Prefix and Suffix Search

String DP

LC 10: Regular Expression Matching

LC 44: Wildcard Matching

LC 87: Scramble String

LC 1312: Minimum Insertion Steps to Make a String Palindrome

LC 940: Distinct Subsequences II

Advanced / Suffix Array / Aho-Corasick

SPOJ SUBST1: New Distinct Substrings (suffix array + LCP)

CF 271D: Good Substrings (trie)

CF 710F: String Set Queries (Aho-Corasick)

LC 336: Palindrome Pairs (trie or hashing)

POJ 3294: Life Forms (suffix array, LCS of multiple strings)

Study Order Recommendation

Start with: Edit Distance (LC 72), LCS (LC 1143), Trie (LC 208) -- these appear in interviews constantly.

Then learn: KMP (LC 28, 459), Rabin-Karp, String DP (LC 10, 44).

For competitive programming: Z-algorithm, Suffix Arrays, Aho-Corasick, Manacher's, String Hashing.

Master all of these and you can solve any string problem.

Table of Contents

1. KMP (Knuth-Morris-Pratt)

Python Implementation

C++ Implementation

2. Z-Algorithm

Python Implementation

C++ Implementation

3. Rabin-Karp

Python Implementation

C++ Implementation

4. Edit Distance / Levenshtein Distance

Python Implementation (with backtracking)

C++ Implementation

5. Longest Common Subsequence (LCS)

Python Implementation

C++ Implementation

6. Suffix Arrays

Python Implementation (O(n log^2 n))

7. Aho-Corasick

Python Implementation

8. Manacher's Algorithm

Python Implementation

C++ Implementation

9. String Hashing

Python Implementation

10. Trie (Prefix Tree)

Python Implementation

C++ Implementation

11. String DP Patterns

Palindrome Partitioning (Minimum Cuts)

Regular Expression Matching

Wildcard Matching

12. Real-World Applications

Simple Spell Corrector (Norvig-style)

13. Practice Problems