Table of Contents

1. DP on Trees 2. Bitmask DP 3. Digit DP 4. Interval DP 5. Knapsack Variants 6. Convex Hull Trick 7. Divide & Conquer Optimization 8. Knuth's Optimization 9. SOS DP (Sum over Subsets) 10. Profile DP 11. Practice Problems

1. DP on Trees

Tree DP exploits the recursive structure of trees. We root the tree and compute values bottom-up (subtree DP), then sometimes push values top-down (rerooting). These techniques solve problems about paths, subtrees, diameters, and optimal placement on trees.

1.1 Subtree DP -- Tree Diameter

The diameter of a tree is the longest path between any two nodes. We compute it in a single DFS: at each node, track the two longest paths descending into children. The diameter through a node is the sum of its two longest child-paths.

Formula

Let depth[v] = longest descending path from v. At node v with children c1, c2, ..., ck:

depth[v] = 1 + max(depth[ci])

diameter = max over all v of (top two depth[ci] values summed)

Python
import sys
from collections import defaultdict

def tree_diameter(n, edges):
    # Build adjacency list
    adj = defaultdict(list)
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)

    diameter = 0

    def dfs(node, parent):
        nonlocal diameter
        max1 = max2 = 0  # two longest descending paths
        for child in adj[node]:
            if child == parent:
                continue
            depth = dfs(child, node) + 1
            if depth >= max1:
                max2 = max1
                max1 = depth
            elif depth > max2:
                max2 = depth
        diameter = max(diameter, max1 + max2)
        return max1

    dfs(0, -1)
    return diameter
C++
#include <bits/stdc++.h>
using namespace std;

int diameter;
vector<vector<int>> adj;

int dfs(int node, int parent) {
    int max1 = 0, max2 = 0;
    for (int child : adj[node]) {
        if (child == parent) continue;
        int depth = dfs(child, node) + 1;
        if (depth >= max1) { max2 = max1; max1 = depth; }
        else if (depth > max2) { max2 = depth; }
    }
    diameter = max(diameter, max1 + max2);
    return max1;
}
Complexity

O(n) time and space. A single DFS visits each node exactly once.

1.2 Rerooting Technique

Many tree DP problems ask "compute f(v) for every node v as if v were the root." Naive approach: run DFS from each node = O(n^2). Rerooting does it in O(n) with two passes.

Pass 1 (bottom-up): Root at node 0, compute dp_down[v] = answer when v is root of its subtree.

Pass 2 (top-down): For each node, "reroot" by removing the child's contribution from the parent's answer and combining with the rest of the tree.

Example: Sum of distances to all other nodes

Given a tree, compute for each node v the sum of distances from v to every other node.

Let subtree_size[v] = number of nodes in subtree of v. Let dp_down[v] = sum of distances from v to all nodes in its subtree.

Then dp_down[v] = sum over children c of (dp_down[c] + subtree_size[c]).

For rerooting: ans[child] = ans[parent] + (n - subtree_size[child]) - subtree_size[child] = ans[parent] + n - 2*subtree_size[child].

Python
def sum_of_distances(n, edges):
    adj = [[] for _ in range(n)]
    for u, v in edges:
        adj[u].append(v)
        adj[v].append(u)

    sz = [1] * n
    dp_down = [0] * n

    # Pass 1: bottom-up from root 0
    order = []
    parent = [-1] * n
    visited = [False] * n
    stack = [0]
    visited[0] = True
    while stack:
        node = stack.pop()
        order.append(node)
        for nb in adj[node]:
            if not visited[nb]:
                visited[nb] = True
                parent[nb] = node
                stack.append(nb)

    for node in reversed(order):
        for nb in adj[node]:
            if nb == parent[node]:
                continue
            sz[node] += sz[nb]
            dp_down[node] += dp_down[nb] + sz[nb]

    # Pass 2: top-down rerooting
    ans = [0] * n
    ans[0] = dp_down[0]
    for node in order:
        for nb in adj[node]:
            if nb == parent[node]:
                continue
            ans[nb] = ans[node] + n - 2 * sz[nb]

    return ans
When to Use Rerooting

Rerooting works when the DP combines children's answers using an invertible operation (addition, multiplication, XOR, max with counts). If you can "remove" a child's contribution, you can reroot.

1.3 Subtree DP -- Maximum Path Sum

Given a tree with weighted nodes, find the path (between any two nodes) with maximum sum. This is a generalization of the diameter problem.

C++
#include <bits/stdc++.h>
using namespace std;

long long ans;
vector<vector<int>> adj;
vector<int> val;

long long dfs(int u, int par) {
    long long best = 0;  // best descending path sum (can be 0 = take nothing)
    for (int v : adj[u]) {
        if (v == par) continue;
        long long child_best = dfs(v, u);
        ans = max(ans, best + child_best + val[u]);
        best = max(best, child_best);
    }
    return best + val[u];
}

int main() {
    int n;
    cin >> n;
    adj.resize(n);
    val.resize(n);
    for (int i = 0; i < n; i++) cin >> val[i];
    for (int i = 0; i < n - 1; i++) {
        int u, v; cin >> u >> v; u--; v--;
        adj[u].push_back(v);
        adj[v].push_back(u);
    }
    ans = *max_element(val.begin(), val.end());
    dfs(0, -1);
    cout << ans << "\n";
}

2. Bitmask DP

When the state involves a subset of a small set (n ≤ 20-23), we represent each subset as a bitmask integer. This gives us 2^n states, each encodable in a single int. Classic applications: TSP, assignment problem, Hamiltonian path, and subset enumeration.

Size Constraint

Bitmask DP is feasible only for small n. Memory = O(2^n * n), time = O(2^n * n^2) for TSP. For n=20, 2^20 = ~1M which is fine. For n=25, 2^25 = ~33M which is borderline. Beyond n=25, consider meet-in-middle or other approaches.

2.1 Travelling Salesman Problem (TSP)

Given n cities and distances between every pair, find the shortest route visiting every city exactly once and returning to the start. The classic Held-Karp algorithm solves this in O(2^n * n^2).

Recurrence

dp[mask][i] = minimum cost to visit exactly the cities in mask, ending at city i.

Base: dp[1 << start][start] = 0

Transition: dp[mask | (1 << j)][j] = min(dp[mask][i] + dist[i][j]) for all i in mask where j is not in mask.

Answer: min over all i of dp[(1 << n) - 1][i] + dist[i][start]

Python
def tsp(dist):
    """Held-Karp TSP. dist is n x n matrix. Returns min tour cost."""
    n = len(dist)
    INF = float('inf')
    dp = [[INF] * n for _ in range(1 << n)]
    dp[1][0] = 0  # start at city 0

    for mask in range(1, 1 << n):
        for u in range(n):
            if dp[mask][u] == INF:
                continue
            if not (mask & (1 << u)):
                continue
            for v in range(n):
                if mask & (1 << v):
                    continue
                new_mask = mask | (1 << v)
                dp[new_mask][v] = min(dp[new_mask][v], dp[mask][u] + dist[u][v])

    full = (1 << n) - 1
    return min(dp[full][i] + dist[i][0] for i in range(n))
C++
int tsp(vector<vector<int>>& dist) {
    int n = dist.size();
    const int INF = 1e9;
    vector<vector<int>> dp(1 << n, vector<int>(n, INF));
    dp[1][0] = 0;

    for (int mask = 1; mask < (1 << n); mask++) {
        for (int u = 0; u < n; u++) {
            if (dp[mask][u] == INF || !(mask & (1 << u))) continue;
            for (int v = 0; v < n; v++) {
                if (mask & (1 << v)) continue;
                int nm = mask | (1 << v);
                dp[nm][v] = min(dp[nm][v], dp[mask][u] + dist[u][v]);
            }
        }
    }
    int ans = INF;
    int full = (1 << n) - 1;
    for (int i = 0; i < n; i++)
        ans = min(ans, dp[full][i] + dist[i][0]);
    return ans;
}

2.2 Assignment Problem

Given n workers and n tasks with cost[i][j], assign each worker to exactly one task (and each task to exactly one worker) minimizing total cost. This is equivalent to minimum weight perfect matching in a bipartite graph.

Python
def assignment(cost):
    """Bitmask DP for assignment problem. O(2^n * n)."""
    n = len(cost)
    INF = float('inf')
    dp = [INF] * (1 << n)
    dp[0] = 0

    for mask in range(1 << n):
        if dp[mask] == INF:
            continue
        worker = bin(mask).count('1')  # which worker to assign next
        if worker >= n:
            continue
        for task in range(n):
            if mask & (1 << task):
                continue
            new_mask = mask | (1 << task)
            dp[new_mask] = min(dp[new_mask], dp[mask] + cost[worker][task])

    return dp[(1 << n) - 1]

2.3 Subset Enumeration

A critical bitmask trick: iterating over all submasks of a given mask in O(3^n) total across all masks.

C++
// Enumerate all submasks of mask (including mask itself, excluding 0)
for (int sub = mask; sub > 0; sub = (sub - 1) & mask) {
    // process submask 'sub'
}
// Don't forget to handle sub = 0 separately if needed

// Total work across all masks: sum over mask of 2^popcount(mask) = 3^n
Bit Manipulation Cheatsheet

Check bit i: mask & (1 << i)

Set bit i: mask | (1 << i)

Clear bit i: mask & ~(1 << i)

Toggle bit i: mask ^ (1 << i)

Lowest set bit: mask & (-mask)

Clear lowest set bit: mask & (mask - 1)

Count set bits: __builtin_popcount(mask) in C++, bin(mask).count('1') in Python

3. Digit DP

Digit DP counts numbers in a range [L, R] satisfying some property by building the number digit by digit. The key idea is a tight flag: whether the digits chosen so far exactly match the upper bound's prefix (constraining future digits) or are already smaller (free to choose any digit).

General Template

count(R) - count(L-1) gives the answer for [L, R].

State: (position, tight, ...extra state). Position goes from MSB to LSB.

If tight=True, current digit can be 0..digits[pos]. If tight=False, digit can be 0..9.

3.1 Template: Count numbers in [1, N] with digit sum = S

Python
from functools import lru_cache

def count_digit_sum(N, S):
    """Count integers in [1, N] whose digit sum equals S."""
    digits = [int(d) for d in str(N)]
    n = len(digits)

    @lru_cache(maxsize=None)
    def solve(pos, tight, rem, started):
        """
        pos: current digit position (0-indexed from left)
        tight: whether we're still bounded by N
        rem: remaining digit sum needed
        started: whether we've placed a nonzero digit yet
        """
        if rem < 0:
            return 0
        if pos == n:
            return 1 if rem == 0 and started else 0

        limit = digits[pos] if tight else 9
        result = 0
        for d in range(0, limit + 1):
            result += solve(
                pos + 1,
                tight and d == limit,
                rem - d,
                started or d > 0
            )
        return result

    return solve(0, True, S, False)
C++
#include <bits/stdc++.h>
using namespace std;

string num;
int target;
int memo[20][2][200][2];

int solve(int pos, bool tight, int rem, bool started) {
    if (rem < 0) return 0;
    if (pos == (int)num.size())
        return (rem == 0 && started) ? 1 : 0;

    int& res = memo[pos][tight][rem][started];
    if (res != -1) return res;
    res = 0;

    int limit = tight ? (num[pos] - '0') : 9;
    for (int d = 0; d <= limit; d++) {
        res += solve(pos + 1, tight && (d == limit),
                     rem - d, started || (d > 0));
    }
    return res;
}

int count_digit_sum(long long N, int S) {
    num = to_string(N);
    target = S;
    memset(memo, -1, sizeof memo);
    return solve(0, true, S, false);
}
Digit DP Pattern

Almost every digit DP problem follows this skeleton. The only thing that changes is the "extra state" (digit sum, count of certain digits, modular remainder, etc.) and the base case condition. Master this template and you can solve 90% of digit DP problems.

Walkthrough: count_digit_sum(123, 5)

We want numbers in [1, 123] with digit sum = 5. The digits array is [1, 2, 3].

At pos=0, tight=True: we can pick d=0 (then free for next digits) or d=1 (stay tight).

If d=0: need remaining sum 5 from 2 digits (00-99). Numbers: 05, 14, 23, 32, 41, 50 = 6 numbers.

If d=1, tight: pos=1, tight=True, rem=4. Can pick d=0..2.

If d=1,0: need 4 from 1 digit (0-9). Only 4. Number: 104.

If d=1,1: need 3 from 1 digit (0-9). Only 3. Number: 113.

If d=1,2, tight: need 2 from 1 digit (0-3). Only 2. Number: 122.

Total: 6 + 3 = 9 numbers.

4. Interval DP

Interval DP operates on contiguous subarrays/substrings. The state is dp[i][j] = optimal answer for the interval [i, j]. We try all ways to split the interval and combine results. Typical time complexity: O(n^3).

4.1 Matrix Chain Multiplication

Given matrices A1(p0 x p1), A2(p1 x p2), ..., An(p_{n-1} x p_n), find the minimum number of scalar multiplications to compute their product.

Recurrence

dp[i][j] = minimum cost to multiply matrices i through j.

dp[i][j] = min over k in [i, j-1] of: dp[i][k] + dp[k+1][j] + p[i-1] * p[k] * p[j]

Base: dp[i][i] = 0 (single matrix costs nothing).

Python
def matrix_chain(p):
    """p = list of dimensions. n matrices: p[0]xp[1], p[1]xp[2], ..., p[n-1]xp[n]."""
    n = len(p) - 1  # number of matrices
    INF = float('inf')
    dp = [[0] * n for _ in range(n)]

    for length in range(2, n + 1):  # chain length
        for i in range(n - length + 1):
            j = i + length - 1
            dp[i][j] = INF
            for k in range(i, j):
                cost = dp[i][k] + dp[k + 1][j] + p[i] * p[k + 1] * p[j + 1]
                dp[i][j] = min(dp[i][j], cost)

    return dp[0][n - 1]

4.2 Burst Balloons (LeetCode 312)

Given n balloons with values, bursting balloon i gives nums[left]*nums[i]*nums[right]. Find max coins by bursting all balloons. The trick: think of which balloon is burst last in each interval.

Python
def max_coins(nums):
    nums = [1] + nums + [1]  # pad with 1s
    n = len(nums)
    dp = [[0] * n for _ in range(n)]

    for length in range(2, n):  # gap between i and j
        for i in range(n - length):
            j = i + length
            for k in range(i + 1, j):  # k is the LAST balloon burst in (i,j)
                dp[i][j] = max(
                    dp[i][j],
                    dp[i][k] + dp[k][j] + nums[i] * nums[k] * nums[j]
                )

    return dp[0][n - 1]

4.3 Merge Stones

Given n piles of stones, merge exactly k consecutive piles into one each step (cost = sum of merged piles). Find minimum total cost to merge into one pile, or -1 if impossible.

C++
int mergeStones(vector<int>& stones, int K) {
    int n = stones.size();
    if ((n - 1) % (K - 1) != 0) return -1;

    vector<int> prefix(n + 1, 0);
    for (int i = 0; i < n; i++) prefix[i + 1] = prefix[i] + stones[i];

    // dp[i][j] = min cost to merge stones[i..j] as far as possible
    vector<vector<int>> dp(n, vector<int>(n, 0));

    for (int len = K; len <= n; len++) {
        for (int i = 0; i + len <= n; i++) {
            int j = i + len - 1;
            dp[i][j] = 1e9;
            for (int mid = i; mid < j; mid += K - 1) {
                dp[i][j] = min(dp[i][j], dp[i][mid] + dp[mid + 1][j]);
            }
            if ((len - 1) % (K - 1) == 0)
                dp[i][j] += prefix[j + 1] - prefix[i];
        }
    }
    return dp[0][n - 1];
}
Complexity

Matrix chain and burst balloons: O(n^3) time, O(n^2) space. Merge stones with k groups: O(n^3 / k) time.

5. Knapsack Variants

The 0/1 knapsack is foundational DP, but real problems demand variants: unbounded (infinite copies), bounded (limited copies), multi-dimensional (multiple constraints), and meet-in-middle (when n is too large for standard DP but small enough for splitting).

5.1 Unbounded Knapsack

Each item can be used unlimited times. The only change: iterate items in the inner loop or process weights going forward (not backward).

Python
def unbounded_knapsack(W, weights, values):
    """Max value with capacity W, unlimited copies of each item."""
    dp = [0] * (W + 1)
    for w in range(1, W + 1):
        for i in range(len(weights)):
            if weights[i] <= w:
                dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
    return dp[W]
0/1 vs Unbounded

In 0/1 knapsack, iterate capacity right to left (each item used at most once). In unbounded, iterate left to right (reuse allowed). This single direction change is the entire difference.

5.2 Bounded Knapsack (Binary Lifting Trick)

Each item i has a limit c[i]. Naive: treat as c[i] separate items = O(W * sum(c[i])). Binary lifting: decompose c[i] into powers of 2 (1, 2, 4, ..., remainder) to get O(W * n * log(max_c)) items.

C++
int bounded_knapsack(int W, vector<int>& wt, vector<int>& val, vector<int>& cnt) {
    int n = wt.size();
    vector<int> dp(W + 1, 0);

    for (int i = 0; i < n; i++) {
        // Binary decomposition of cnt[i]
        int remaining = cnt[i];
        for (int k = 1; remaining > 0; k *= 2) {
            int batch = min(k, remaining);
            remaining -= batch;
            int bw = wt[i] * batch, bv = val[i] * batch;
            // 0/1 knapsack for this batch (right to left)
            for (int w = W; w >= bw; w--)
                dp[w] = max(dp[w], dp[w - bw] + bv);
        }
    }
    return dp[W];
}

5.3 Multi-Dimensional Knapsack

When there are multiple constraints (weight AND volume, for example), add an extra dimension to the DP table.

Python
def knapsack_2d(W, V, items):
    """items = list of (weight, volume, value). Max value with capacity W and volume V."""
    dp = [[0] * (V + 1) for _ in range(W + 1)]

    for w, v, val in items:
        for cw in range(W, w - 1, -1):
            for cv in range(V, v - 1, -1):
                dp[cw][cv] = max(dp[cw][cv], dp[cw - w][cv - v] + val)

    return dp[W][V]

5.4 Meet-in-the-Middle Knapsack

When n is up to ~40 and W is very large, standard DP is impossible. Split items into two halves, enumerate all 2^(n/2) subsets for each, then combine with binary search. Time: O(2^(n/2) * log(2^(n/2))).

Python
from bisect import bisect_right

def meet_in_middle_knapsack(W, weights, values):
    n = len(weights)
    half = n // 2

    def enumerate_subsets(wts, vals):
        """Return list of (weight, value) for all subsets."""
        m = len(wts)
        result = []
        for mask in range(1 << m):
            tw = tv = 0
            for i in range(m):
                if mask & (1 << i):
                    tw += wts[i]
                    tv += vals[i]
            if tw <= W:
                result.append((tw, tv))
        return result

    left = enumerate_subsets(weights[:half], values[:half])
    right = enumerate_subsets(weights[half:], values[half:])

    # Sort right by weight, build prefix-max of values
    right.sort()
    rw = [r[0] for r in right]
    rv = [r[1] for r in right]
    # Make rv prefix-max: rv[i] = max value among right subsets with weight <= rw[i]
    for i in range(1, len(rv)):
        rv[i] = max(rv[i], rv[i - 1])

    ans = 0
    for lw, lv in left:
        remaining = W - lw
        idx = bisect_right(rw, remaining) - 1
        if idx >= 0:
            ans = max(ans, lv + rv[idx])
        else:
            ans = max(ans, lv)

    return ans
Common Mistakes

1. Forgetting to iterate right-to-left in 0/1 knapsack (accidentally allowing reuse).

2. In meet-in-middle, forgetting to build the prefix-max on the sorted right half.

3. Off-by-one in bounded knapsack binary decomposition (the remainder group).

6. Convex Hull Trick

The Convex Hull Trick (CHT) optimizes DP recurrences of the form: dp[i] = min/max over j < i of (a[j] * x[i] + b[j]), where each previous state j defines a linear function f_j(x) = a[j]*x + b[j], and we query at x = x[i]. If we maintain a convex hull of these lines, each query takes O(log n) or amortized O(1).

When to Use CHT

Your DP recurrence must be expressible as: dp[i] = min/max(m_j * x_i + b_j) + cost(i)

where m_j and b_j depend on j (a previously computed state) and x_i depends only on i.

If slopes m_j are monotone AND query points x_i are monotone, use a simple deque (amortized O(1) per operation). Otherwise, use Li Chao tree (O(log n) per operation).

6.1 Simple CHT with Monotone Slopes and Queries

Python
from collections import deque

class ConvexHullTrick:
    """Maintains lower hull for minimum queries.
    Requirements: lines added in decreasing slope order,
    queries in increasing x order."""

    def __init__(self):
        self.hull = deque()  # list of (slope, intercept)

    def _bad(self, l1, l2, l3):
        """Check if l2 is never optimal between l1 and l3."""
        return (l3[1] - l1[1]) * (l1[0] - l2[0]) <= \
               (l2[1] - l1[1]) * (l1[0] - l3[0])

    def add_line(self, m, b):
        """Add line y = m*x + b. Slopes must be decreasing."""
        line = (m, b)
        while len(self.hull) >= 2 and self._bad(self.hull[-2], self.hull[-1], line):
            self.hull.pop()
        self.hull.append(line)

    def query(self, x):
        """Query minimum y at x. x must be increasing."""
        while len(self.hull) >= 2:
            m1, b1 = self.hull[0]
            m2, b2 = self.hull[1]
            if m1 * x + b1 > m2 * x + b2:
                self.hull.popleft()
            else:
                break
        m, b = self.hull[0]
        return m * x + b
C++
struct Line {
    long long m, b;
    long long eval(long long x) { return m * x + b; }
};

struct CHT {
    deque<Line> hull;

    bool bad(Line l1, Line l2, Line l3) {
        return (__int128)(l3.b - l1.b) * (l1.m - l2.m) <=
               (__int128)(l2.b - l1.b) * (l1.m - l3.m);
    }

    void add_line(long long m, long long b) {
        Line line = {m, b};
        while (hull.size() >= 2 && bad(hull[hull.size()-2], hull.back(), line))
            hull.pop_back();
        hull.push_back(line);
    }

    long long query(long long x) {
        while (hull.size() >= 2 && hull[0].eval(x) >= hull[1].eval(x))
            hull.pop_front();
        return hull.front().eval(x);
    }
};

6.2 Example: Minimize dp[i] = min(dp[j] + a[j]*b[i])

Walkthrough: APIO 2014 Sequence

Suppose dp[i] = min over j < i of (dp[j] + cost(j, i)), where cost(j, i) can be decomposed as m_j * x_i + b_j.

Rewrite: dp[i] = min(m_j * x_i + b_j) where m_j = some function of j's values, b_j = dp[j] + extra, x_i = some function of i.

Add line (m_j, b_j) when computing dp[j]. Query at x_i to compute dp[i]. If slopes decrease and queries increase, the deque approach gives O(n) total.

Li Chao Tree

When slopes and queries are NOT monotone, use a Li Chao segment tree. It supports add_line in O(log(MAXX)) and query in O(log(MAXX)). Conceptually simpler than maintaining an explicit hull with binary search.

7. Divide & Conquer Optimization

This optimization applies when the DP recurrence is: dp[i][j] = min over k in [opt[i][j-1], opt[i+1][j]] of (dp[k][j-1] + C(k+1, i)) and the "optimal split point" opt[i][j] is monotone: opt[i][j] ≤ opt[i+1][j]. This means as i increases, the optimal k never decreases.

Condition

Given dp[i][j] = min over k < i of (dp[k][j-1] + C(k, i)), if the optimal k for row i is ≤ optimal k for row i+1, we have the monotonicity condition. This reduces O(n^2 * k) to O(n * k * log n).

C++
#include <bits/stdc++.h>
using namespace std;

const long long INF = 1e18;
int N, K;
vector<vector<long long>> dp;

// cost(l, r) = cost of one group covering elements [l, r]
long long cost(int l, int r);

void solve(int layer, int lo, int hi, int opt_lo, int opt_hi) {
    if (lo > hi) return;
    int mid = (lo + hi) / 2;
    long long best = INF;
    int opt = opt_lo;

    for (int k = opt_lo; k <= min(mid - 1, opt_hi); k++) {
        long long val = dp[layer - 1][k] + cost(k + 1, mid);
        if (val < best) {
            best = val;
            opt = k;
        }
    }
    dp[layer][mid] = best;

    solve(layer, lo, mid - 1, opt_lo, opt);
    solve(layer, mid + 1, hi, opt, opt_hi);
}

void compute() {
    dp.assign(K + 1, vector<long long>(N + 1, INF));
    dp[0][0] = 0;

    // Base layer: dp[1][i] = cost(1, i)
    for (int i = 1; i <= N; i++)
        dp[1][i] = cost(1, i);

    for (int j = 2; j <= K; j++)
        solve(j, 1, N, 0, N - 1);
}
Python
import sys
sys.setrecursionlimit(300000)

def dnc_optimize(n, k, cost_fn):
    """
    dp[j][i] = min over m < i of (dp[j-1][m] + cost_fn(m+1, i))
    Returns dp[k][n].
    """
    INF = float('inf')
    prev = [INF] * (n + 1)
    prev[0] = 0
    for i in range(1, n + 1):
        prev[i] = cost_fn(1, i)

    for layer in range(2, k + 1):
        cur = [INF] * (n + 1)

        def rec(lo, hi, opt_lo, opt_hi):
            if lo > hi:
                return
            mid = (lo + hi) // 2
            best = INF
            opt = opt_lo
            for m in range(opt_lo, min(mid, opt_hi + 1)):
                val = prev[m] + cost_fn(m + 1, mid)
                if val < best:
                    best = val
                    opt = m
            cur[mid] = best
            rec(lo, mid - 1, opt_lo, opt)
            rec(mid + 1, hi, opt, opt_hi)

        rec(1, n, 0, n - 1)
        prev = cur

    return prev[n]
How to Verify Monotonicity

Compute the cost function C(l, r). If C satisfies the quadrangle inequality (C(a,c) + C(b,d) ≤ C(a,d) + C(b,c) for a ≤ b ≤ c ≤ d), then the optimal split points are monotone. Common qualifying costs: sum of squares of segment sums, variance-based costs.

8. Knuth's Optimization

Knuth's optimization speeds up interval DP from O(n^3) to O(n^2). It applies when the cost function C(i, j) satisfies the quadrangle inequality AND the DP recurrence is: dp[i][j] = min over i < k < j of (dp[i][k] + dp[k][j] + C(i, j)).

Conditions

1. Quadrangle Inequality: C(a,c) + C(b,d) ≤ C(a,d) + C(b,c) for all a ≤ b ≤ c ≤ d.

2. Monotonicity: C(a,c) ≤ C(b,d) for a ≤ b ≤ c ≤ d.

Then opt[i][j-1] ≤ opt[i][j] ≤ opt[i+1][j], limiting the search range for k.

C++
// Knuth's optimization for optimal BST / interval DP
void knuth_dp(int n, vector<vector<long long>>& dp,
              vector<vector<int>>& opt,
              vector<vector<long long>>& C) {
    // Base cases
    for (int i = 0; i < n; i++) {
        dp[i][i] = 0;
        opt[i][i] = i;
    }

    for (int len = 2; len <= n; len++) {
        for (int i = 0; i + len - 1 < n; i++) {
            int j = i + len - 1;
            dp[i][j] = 1e18;

            // Key: search only from opt[i][j-1] to opt[i+1][j]
            int lo = opt[i][j - 1];
            int hi = (j + 1 < n) ? opt[i + 1][j] : j;

            for (int k = lo; k <= hi; k++) {
                long long val = dp[i][k] + dp[k + 1][j] + C[i][j];
                if (val < dp[i][j]) {
                    dp[i][j] = val;
                    opt[i][j] = k;
                }
            }
        }
    }
}
Python
def knuth_dp(n, C):
    """
    Interval DP with Knuth's optimization.
    C[i][j] = cost of interval [i, j].
    Returns dp[0][n-1] = optimal cost.
    """
    INF = float('inf')
    dp = [[0] * n for _ in range(n)]
    opt = [[0] * n for _ in range(n)]

    for i in range(n):
        opt[i][i] = i

    for length in range(2, n + 1):
        for i in range(n - length + 1):
            j = i + length - 1
            dp[i][j] = INF
            lo = opt[i][j - 1]
            hi = opt[i + 1][j] if i + 1 < n else j
            for k in range(lo, hi + 1):
                val = dp[i][k] + (dp[k + 1][j] if k + 1 <= j else 0) + C[i][j]
                if val < dp[i][j]:
                    dp[i][j] = val
                    opt[i][j] = k

    return dp[0][n - 1]
Classic Application: Optimal Binary Search Tree

Given keys with access frequencies f[i], build a BST minimizing weighted search cost. Cost C(i,j) = sum of f[i..j] (prefix sums). This satisfies the quadrangle inequality, so Knuth's optimization applies, giving O(n^2) instead of O(n^3).

9. SOS DP (Sum over Subsets)

SOS DP computes, for every bitmask x: F[x] = sum of A[y] for all y that are submasks of x (y & x == y). The naive approach is O(4^n) (for each mask, enumerate all submasks). SOS DP does it in O(n * 2^n).

Recurrence

Process bits one at a time. Let dp[mask][i] = sum of A[y] over all y that agree with mask on bits i+1..n-1 and are submasks of mask on bits 0..i.

dp[mask][-1] = A[mask]

dp[mask][i] = dp[mask][i-1] + (dp[mask ^ (1<<i)][i-1] if bit i is set in mask, else 0)

F[mask] = dp[mask][n-1]

We can flatten to 1D by processing in-place.

Python
def sos_dp(A, n):
    """
    Given array A of size 2^n, compute F[mask] = sum of A[sub] for all sub submasks of mask.
    Modifies A in-place to become F.
    O(n * 2^n).
    """
    for i in range(n):
        for mask in range(1 << n):
            if mask & (1 << i):
                A[mask] += A[mask ^ (1 << i)]
    return A
C++
void sos_dp(vector<long long>& A, int n) {
    for (int i = 0; i < n; i++) {
        for (int mask = 0; mask < (1 << n); mask++) {
            if (mask & (1 << i))
                A[mask] += A[mask ^ (1 << i)];
        }
    }
}
Walkthrough: n=3, A = [1, 2, 3, 4, 5, 6, 7, 8]

Indices 0..7 in binary: 000, 001, 010, 011, 100, 101, 110, 111.

After processing bit 0: A[001]+=A[000]=3, A[011]+=A[010]=7, A[101]+=A[100]=11, A[111]+=A[110]=15.

After processing bit 1: A[010]+=A[000]=4, A[011]+=A[001]=10, A[110]+=A[100]=12, A[111]+=A[101]=26.

After processing bit 2: A[100]+=A[000]=6, A[101]+=A[001]=14, A[110]+=A[010]=16, A[111]+=A[011]=36.

F[111] = A[000]+A[001]+A[010]+A[011]+A[100]+A[101]+A[110]+A[111] = 1+2+3+4+5+6+7+8 = 36. Correct.

9.1 Superset Sum (Inverse Direction)

To compute G[mask] = sum of A[y] for all y that are supermasks of mask (mask is a submask of y), simply flip the condition:

Python
def superset_sum(A, n):
    """G[mask] = sum of A[sup] for all sup where mask is submask of sup."""
    for i in range(n):
        for mask in range(1 << n):
            if not (mask & (1 << i)):
                A[mask] += A[mask | (1 << i)]
    return A
Applications of SOS DP

1. Count pairs (i, j) where a[i] AND a[j] = a[i] (a[i] is submask of a[j]).

2. Maximum/minimum over all submasks.

3. Subset convolution (combine with inclusion-exclusion and popcount stratification).

4. Competitive programming: CF 165E, ARC 100E, many Div1 D/E problems.

10. Profile DP (Broken Profile)

Profile DP handles grid tiling problems: "In how many ways can you tile an m x n grid with dominoes (1x2 tiles)?" The idea is to process the grid column by column (or row by row), with the DP state being the "profile" -- a bitmask describing which cells in the current column boundary are already filled by a domino from the previous column.

State

dp[col][mask] = number of ways to tile columns 0..col such that the boundary between col and col+1 has the given mask (bit i = 1 means row i is already covered by a horizontal domino extending from col to col+1).

For an m x n grid: 2^m states per column, n columns. Total: O(n * 2^m).

10.1 Domino Tiling of m x n Grid

Python
def domino_tiling(m, n):
    """Count ways to tile m x n grid with 1x2 dominoes. m should be small (<= 20)."""
    if (m * n) % 2 != 0:
        return 0  # odd total cells = impossible

    # Ensure m <= n (use m as the small dimension)
    if m > n:
        m, n = n, m

    full = (1 << m) - 1

    def generate_transitions(row, cur_mask, next_mask):
        """
        Try to fill row 'row' of current column.
        cur_mask: which rows are already filled in current column.
        next_mask: which rows will be filled in next column (by horizontal dominos).
        Returns list of valid next_masks.
        """
        if row == m:
            if cur_mask == full:
                yield next_mask
            return

        if cur_mask & (1 << row):
            # Row already filled, skip
            yield from generate_transitions(row + 1, cur_mask, next_mask)
        else:
            # Option 1: place horizontal domino (extends to next column)
            yield from generate_transitions(
                row + 1, cur_mask, next_mask | (1 << row)
            )
            # Option 2: place vertical domino (fills this row and next row)
            if row + 1 < m and not (cur_mask & (1 << (row + 1))):
                yield from generate_transitions(
                    row + 2, cur_mask | (1 << row) | (1 << (row + 1)),
                    next_mask
                )

    # Precompute transitions
    trans = {}
    for mask in range(1 << m):
        trans[mask] = list(generate_transitions(0, mask, 0))

    # DP
    dp = [0] * (1 << m)
    dp[0] = 1  # empty profile before first column

    for col in range(n):
        new_dp = [0] * (1 << m)
        for mask in range(1 << m):
            if dp[mask] == 0:
                continue
            for next_mask in trans[mask]:
                new_dp[next_mask] += dp[mask]
        dp = new_dp

    return dp[0]  # no overhanging dominoes after last column
C++
#include <bits/stdc++.h>
using namespace std;

int M;
vector<vector<int>> trans;

void gen(int row, int cur, int nxt, vector<int>& out) {
    if (row == M) {
        if (cur == (1 << M) - 1) out.push_back(nxt);
        return;
    }
    if (cur & (1 << row)) {
        gen(row + 1, cur, nxt, out);
    } else {
        // horizontal: extends to next column
        gen(row + 1, cur, nxt | (1 << row), out);
        // vertical: fill two rows in current column
        if (row + 1 < M && !(cur & (1 << (row + 1))))
            gen(row + 2, cur | (1 << row) | (1 << (row + 1)), nxt, out);
    }
}

long long domino_tiling(int m, int n) {
    if (((long long)m * n) % 2) return 0;
    if (m > n) swap(m, n);
    M = m;

    trans.assign(1 << m, vector<int>());
    for (int mask = 0; mask < (1 << m); mask++)
        gen(0, mask, 0, trans[mask]);

    vector<long long> dp(1 << m, 0);
    dp[0] = 1;

    for (int col = 0; col < n; col++) {
        vector<long long> ndp(1 << m, 0);
        for (int mask = 0; mask < (1 << m); mask++) {
            if (!dp[mask]) continue;
            for (int nm : trans[mask])
                ndp[nm] += dp[mask];
        }
        dp = ndp;
    }
    return dp[0];
}
Walkthrough: 2 x 3 Grid

m=2, n=3. Full mask = 11 (binary) = 3.

Transitions from mask=00: can place vertical (fills both rows) -> next=00. Or two horizontals -> next=11.

Transitions from mask=11: both rows filled, nothing to place -> next=00.

Transitions from mask=01: row 0 filled, row 1 empty. Place horizontal on row 1 -> next=10. (No vertical since only one row empty.)

Transitions from mask=10: row 1 filled, row 0 empty. Place horizontal on row 0 -> next=01.

DP: col 0: dp[00]=1 -> ndp[00]+=1 (vertical), ndp[11]+=1 (two horizontal). dp=[1,0,0,1].

col 1: from 00: ndp[00]+=1, ndp[11]+=1. From 11: ndp[00]+=1. dp=[2,0,0,1].

col 2: from 00: ndp[00]+=2, ndp[11]+=2. From 11: ndp[00]+=1. dp=[3,0,0,2].

Answer: dp[0] = 3. The three tilings are: all vertical, horizontal top + vertical bottom shifted, etc.

Profile DP Variants

The same technique works for: tiling with L-trominoes, tiling with 1x3 bars, counting Hamiltonian paths in grid graphs, and any grid problem where the "interaction" between adjacent columns is bounded.

11. Practice Problems

Organized by topic with approximate difficulty. Solve these to master advanced DP.

DP on Trees

Bitmask DP

Digit DP

Interval DP

Knapsack Variants

Convex Hull Trick & Optimizations

SOS DP & Profile DP

Study Order

If you are new to advanced DP, tackle them in this order: (1) Tree DP -- natural extension of basic DP. (2) Bitmask DP -- builds intuition for state compression. (3) Interval DP -- classic pattern, appears everywhere. (4) Knapsack variants -- deepens your understanding of the fundamental DP. (5) Digit DP -- template-based, very learnable. (6) SOS DP + Profile DP -- bitmask mastery. (7) CHT, DnC opt, Knuth's opt -- these are optimization techniques that require solid DP foundations first.

Contest Strategy

In a contest, identifying the DP type is half the battle. Look for these signals: "tree structure" = tree DP; "n <= 20" = bitmask DP; "count numbers in range [L,R] with property" = digit DP; "merge intervals / break string" = interval DP; "minimize cost of partitioning into k groups" = DnC optimization or Knuth's; "subset/superset queries" = SOS DP. Once you identify the type, apply the template and focus on defining the cost function correctly.