Beyond the basics: tree DP, bitmask DP, digit DP, interval DP, knapsack variants, convex hull trick, divide & conquer optimization, Knuth's optimization, SOS DP, and profile DP. Every technique you need to solve the hardest DP problems in competitive programming -- with full code in Python and C++.
Tree DP exploits the recursive structure of trees. We root the tree and compute values bottom-up (subtree DP), then sometimes push values top-down (rerooting). These techniques solve problems about paths, subtrees, diameters, and optimal placement on trees.
The diameter of a tree is the longest path between any two nodes. We compute it in a single DFS: at each node, track the two longest paths descending into children. The diameter through a node is the sum of its two longest child-paths.
Let depth[v] = longest descending path from v. At node v with children c1, c2, ..., ck:
depth[v] = 1 + max(depth[ci])
diameter = max over all v of (top two depth[ci] values summed)
Python
import sys
from collections import defaultdict
def tree_diameter(n, edges):
# Build adjacency list
adj = defaultdict(list)
for u, v in edges:
adj[u].append(v)
adj[v].append(u)
diameter = 0
def dfs(node, parent):
nonlocal diameter
max1 = max2 = 0 # two longest descending paths
for child in adj[node]:
if child == parent:
continue
depth = dfs(child, node) + 1
if depth >= max1:
max2 = max1
max1 = depth
elif depth > max2:
max2 = depth
diameter = max(diameter, max1 + max2)
return max1
dfs(0, -1)
return diameter
C++
#include <bits/stdc++.h>
using namespace std;
int diameter;
vector<vector<int>> adj;
int dfs(int node, int parent) {
int max1 = 0, max2 = 0;
for (int child : adj[node]) {
if (child == parent) continue;
int depth = dfs(child, node) + 1;
if (depth >= max1) { max2 = max1; max1 = depth; }
else if (depth > max2) { max2 = depth; }
}
diameter = max(diameter, max1 + max2);
return max1;
}
O(n) time and space. A single DFS visits each node exactly once.
Many tree DP problems ask "compute f(v) for every node v as if v were the root." Naive approach: run DFS from each node = O(n^2). Rerooting does it in O(n) with two passes.
Pass 1 (bottom-up): Root at node 0, compute dp_down[v] = answer when v is root of its subtree.
Pass 2 (top-down): For each node, "reroot" by removing the child's contribution from the parent's answer and combining with the rest of the tree.
Given a tree, compute for each node v the sum of distances from v to every other node.
Let subtree_size[v] = number of nodes in subtree of v. Let dp_down[v] = sum of distances from v to all nodes in its subtree.
Then dp_down[v] = sum over children c of (dp_down[c] + subtree_size[c]).
For rerooting: ans[child] = ans[parent] + (n - subtree_size[child]) - subtree_size[child] = ans[parent] + n - 2*subtree_size[child].
Python
def sum_of_distances(n, edges):
adj = [[] for _ in range(n)]
for u, v in edges:
adj[u].append(v)
adj[v].append(u)
sz = [1] * n
dp_down = [0] * n
# Pass 1: bottom-up from root 0
order = []
parent = [-1] * n
visited = [False] * n
stack = [0]
visited[0] = True
while stack:
node = stack.pop()
order.append(node)
for nb in adj[node]:
if not visited[nb]:
visited[nb] = True
parent[nb] = node
stack.append(nb)
for node in reversed(order):
for nb in adj[node]:
if nb == parent[node]:
continue
sz[node] += sz[nb]
dp_down[node] += dp_down[nb] + sz[nb]
# Pass 2: top-down rerooting
ans = [0] * n
ans[0] = dp_down[0]
for node in order:
for nb in adj[node]:
if nb == parent[node]:
continue
ans[nb] = ans[node] + n - 2 * sz[nb]
return ans
Rerooting works when the DP combines children's answers using an invertible operation (addition, multiplication, XOR, max with counts). If you can "remove" a child's contribution, you can reroot.
Given a tree with weighted nodes, find the path (between any two nodes) with maximum sum. This is a generalization of the diameter problem.
C++
#include <bits/stdc++.h>
using namespace std;
long long ans;
vector<vector<int>> adj;
vector<int> val;
long long dfs(int u, int par) {
long long best = 0; // best descending path sum (can be 0 = take nothing)
for (int v : adj[u]) {
if (v == par) continue;
long long child_best = dfs(v, u);
ans = max(ans, best + child_best + val[u]);
best = max(best, child_best);
}
return best + val[u];
}
int main() {
int n;
cin >> n;
adj.resize(n);
val.resize(n);
for (int i = 0; i < n; i++) cin >> val[i];
for (int i = 0; i < n - 1; i++) {
int u, v; cin >> u >> v; u--; v--;
adj[u].push_back(v);
adj[v].push_back(u);
}
ans = *max_element(val.begin(), val.end());
dfs(0, -1);
cout << ans << "\n";
}
When the state involves a subset of a small set (n ≤ 20-23), we represent each subset as a bitmask integer. This gives us 2^n states, each encodable in a single int. Classic applications: TSP, assignment problem, Hamiltonian path, and subset enumeration.
Bitmask DP is feasible only for small n. Memory = O(2^n * n), time = O(2^n * n^2) for TSP. For n=20, 2^20 = ~1M which is fine. For n=25, 2^25 = ~33M which is borderline. Beyond n=25, consider meet-in-middle or other approaches.
Given n cities and distances between every pair, find the shortest route visiting every city exactly once and returning to the start. The classic Held-Karp algorithm solves this in O(2^n * n^2).
dp[mask][i] = minimum cost to visit exactly the cities in mask, ending at city i.
Base: dp[1 << start][start] = 0
Transition: dp[mask | (1 << j)][j] = min(dp[mask][i] + dist[i][j]) for all i in mask where j is not in mask.
Answer: min over all i of dp[(1 << n) - 1][i] + dist[i][start]
Python
def tsp(dist):
"""Held-Karp TSP. dist is n x n matrix. Returns min tour cost."""
n = len(dist)
INF = float('inf')
dp = [[INF] * n for _ in range(1 << n)]
dp[1][0] = 0 # start at city 0
for mask in range(1, 1 << n):
for u in range(n):
if dp[mask][u] == INF:
continue
if not (mask & (1 << u)):
continue
for v in range(n):
if mask & (1 << v):
continue
new_mask = mask | (1 << v)
dp[new_mask][v] = min(dp[new_mask][v], dp[mask][u] + dist[u][v])
full = (1 << n) - 1
return min(dp[full][i] + dist[i][0] for i in range(n))
C++
int tsp(vector<vector<int>>& dist) {
int n = dist.size();
const int INF = 1e9;
vector<vector<int>> dp(1 << n, vector<int>(n, INF));
dp[1][0] = 0;
for (int mask = 1; mask < (1 << n); mask++) {
for (int u = 0; u < n; u++) {
if (dp[mask][u] == INF || !(mask & (1 << u))) continue;
for (int v = 0; v < n; v++) {
if (mask & (1 << v)) continue;
int nm = mask | (1 << v);
dp[nm][v] = min(dp[nm][v], dp[mask][u] + dist[u][v]);
}
}
}
int ans = INF;
int full = (1 << n) - 1;
for (int i = 0; i < n; i++)
ans = min(ans, dp[full][i] + dist[i][0]);
return ans;
}
Given n workers and n tasks with cost[i][j], assign each worker to exactly one task (and each task to exactly one worker) minimizing total cost. This is equivalent to minimum weight perfect matching in a bipartite graph.
Python
def assignment(cost):
"""Bitmask DP for assignment problem. O(2^n * n)."""
n = len(cost)
INF = float('inf')
dp = [INF] * (1 << n)
dp[0] = 0
for mask in range(1 << n):
if dp[mask] == INF:
continue
worker = bin(mask).count('1') # which worker to assign next
if worker >= n:
continue
for task in range(n):
if mask & (1 << task):
continue
new_mask = mask | (1 << task)
dp[new_mask] = min(dp[new_mask], dp[mask] + cost[worker][task])
return dp[(1 << n) - 1]
A critical bitmask trick: iterating over all submasks of a given mask in O(3^n) total across all masks.
C++
// Enumerate all submasks of mask (including mask itself, excluding 0)
for (int sub = mask; sub > 0; sub = (sub - 1) & mask) {
// process submask 'sub'
}
// Don't forget to handle sub = 0 separately if needed
// Total work across all masks: sum over mask of 2^popcount(mask) = 3^n
Check bit i: mask & (1 << i)
Set bit i: mask | (1 << i)
Clear bit i: mask & ~(1 << i)
Toggle bit i: mask ^ (1 << i)
Lowest set bit: mask & (-mask)
Clear lowest set bit: mask & (mask - 1)
Count set bits: __builtin_popcount(mask) in C++, bin(mask).count('1') in Python
Digit DP counts numbers in a range [L, R] satisfying some property by building the number digit by digit. The key idea is a tight flag: whether the digits chosen so far exactly match the upper bound's prefix (constraining future digits) or are already smaller (free to choose any digit).
count(R) - count(L-1) gives the answer for [L, R].
State: (position, tight, ...extra state). Position goes from MSB to LSB.
If tight=True, current digit can be 0..digits[pos]. If tight=False, digit can be 0..9.
Python
from functools import lru_cache
def count_digit_sum(N, S):
"""Count integers in [1, N] whose digit sum equals S."""
digits = [int(d) for d in str(N)]
n = len(digits)
@lru_cache(maxsize=None)
def solve(pos, tight, rem, started):
"""
pos: current digit position (0-indexed from left)
tight: whether we're still bounded by N
rem: remaining digit sum needed
started: whether we've placed a nonzero digit yet
"""
if rem < 0:
return 0
if pos == n:
return 1 if rem == 0 and started else 0
limit = digits[pos] if tight else 9
result = 0
for d in range(0, limit + 1):
result += solve(
pos + 1,
tight and d == limit,
rem - d,
started or d > 0
)
return result
return solve(0, True, S, False)
C++
#include <bits/stdc++.h>
using namespace std;
string num;
int target;
int memo[20][2][200][2];
int solve(int pos, bool tight, int rem, bool started) {
if (rem < 0) return 0;
if (pos == (int)num.size())
return (rem == 0 && started) ? 1 : 0;
int& res = memo[pos][tight][rem][started];
if (res != -1) return res;
res = 0;
int limit = tight ? (num[pos] - '0') : 9;
for (int d = 0; d <= limit; d++) {
res += solve(pos + 1, tight && (d == limit),
rem - d, started || (d > 0));
}
return res;
}
int count_digit_sum(long long N, int S) {
num = to_string(N);
target = S;
memset(memo, -1, sizeof memo);
return solve(0, true, S, false);
}
Almost every digit DP problem follows this skeleton. The only thing that changes is the "extra state" (digit sum, count of certain digits, modular remainder, etc.) and the base case condition. Master this template and you can solve 90% of digit DP problems.
We want numbers in [1, 123] with digit sum = 5. The digits array is [1, 2, 3].
At pos=0, tight=True: we can pick d=0 (then free for next digits) or d=1 (stay tight).
If d=0: need remaining sum 5 from 2 digits (00-99). Numbers: 05, 14, 23, 32, 41, 50 = 6 numbers.
If d=1, tight: pos=1, tight=True, rem=4. Can pick d=0..2.
If d=1,0: need 4 from 1 digit (0-9). Only 4. Number: 104.
If d=1,1: need 3 from 1 digit (0-9). Only 3. Number: 113.
If d=1,2, tight: need 2 from 1 digit (0-3). Only 2. Number: 122.
Total: 6 + 3 = 9 numbers.
Interval DP operates on contiguous subarrays/substrings. The state is dp[i][j] = optimal answer for the interval [i, j]. We try all ways to split the interval and combine results. Typical time complexity: O(n^3).
Given matrices A1(p0 x p1), A2(p1 x p2), ..., An(p_{n-1} x p_n), find the minimum number of scalar multiplications to compute their product.
dp[i][j] = minimum cost to multiply matrices i through j.
dp[i][j] = min over k in [i, j-1] of: dp[i][k] + dp[k+1][j] + p[i-1] * p[k] * p[j]
Base: dp[i][i] = 0 (single matrix costs nothing).
Python
def matrix_chain(p):
"""p = list of dimensions. n matrices: p[0]xp[1], p[1]xp[2], ..., p[n-1]xp[n]."""
n = len(p) - 1 # number of matrices
INF = float('inf')
dp = [[0] * n for _ in range(n)]
for length in range(2, n + 1): # chain length
for i in range(n - length + 1):
j = i + length - 1
dp[i][j] = INF
for k in range(i, j):
cost = dp[i][k] + dp[k + 1][j] + p[i] * p[k + 1] * p[j + 1]
dp[i][j] = min(dp[i][j], cost)
return dp[0][n - 1]
Given n balloons with values, bursting balloon i gives nums[left]*nums[i]*nums[right]. Find max coins by bursting all balloons. The trick: think of which balloon is burst last in each interval.
Python
def max_coins(nums):
nums = [1] + nums + [1] # pad with 1s
n = len(nums)
dp = [[0] * n for _ in range(n)]
for length in range(2, n): # gap between i and j
for i in range(n - length):
j = i + length
for k in range(i + 1, j): # k is the LAST balloon burst in (i,j)
dp[i][j] = max(
dp[i][j],
dp[i][k] + dp[k][j] + nums[i] * nums[k] * nums[j]
)
return dp[0][n - 1]
Given n piles of stones, merge exactly k consecutive piles into one each step (cost = sum of merged piles). Find minimum total cost to merge into one pile, or -1 if impossible.
C++
int mergeStones(vector<int>& stones, int K) {
int n = stones.size();
if ((n - 1) % (K - 1) != 0) return -1;
vector<int> prefix(n + 1, 0);
for (int i = 0; i < n; i++) prefix[i + 1] = prefix[i] + stones[i];
// dp[i][j] = min cost to merge stones[i..j] as far as possible
vector<vector<int>> dp(n, vector<int>(n, 0));
for (int len = K; len <= n; len++) {
for (int i = 0; i + len <= n; i++) {
int j = i + len - 1;
dp[i][j] = 1e9;
for (int mid = i; mid < j; mid += K - 1) {
dp[i][j] = min(dp[i][j], dp[i][mid] + dp[mid + 1][j]);
}
if ((len - 1) % (K - 1) == 0)
dp[i][j] += prefix[j + 1] - prefix[i];
}
}
return dp[0][n - 1];
}
Matrix chain and burst balloons: O(n^3) time, O(n^2) space. Merge stones with k groups: O(n^3 / k) time.
The 0/1 knapsack is foundational DP, but real problems demand variants: unbounded (infinite copies), bounded (limited copies), multi-dimensional (multiple constraints), and meet-in-middle (when n is too large for standard DP but small enough for splitting).
Each item can be used unlimited times. The only change: iterate items in the inner loop or process weights going forward (not backward).
Python
def unbounded_knapsack(W, weights, values):
"""Max value with capacity W, unlimited copies of each item."""
dp = [0] * (W + 1)
for w in range(1, W + 1):
for i in range(len(weights)):
if weights[i] <= w:
dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
return dp[W]
In 0/1 knapsack, iterate capacity right to left (each item used at most once). In unbounded, iterate left to right (reuse allowed). This single direction change is the entire difference.
Each item i has a limit c[i]. Naive: treat as c[i] separate items = O(W * sum(c[i])). Binary lifting: decompose c[i] into powers of 2 (1, 2, 4, ..., remainder) to get O(W * n * log(max_c)) items.
C++
int bounded_knapsack(int W, vector<int>& wt, vector<int>& val, vector<int>& cnt) {
int n = wt.size();
vector<int> dp(W + 1, 0);
for (int i = 0; i < n; i++) {
// Binary decomposition of cnt[i]
int remaining = cnt[i];
for (int k = 1; remaining > 0; k *= 2) {
int batch = min(k, remaining);
remaining -= batch;
int bw = wt[i] * batch, bv = val[i] * batch;
// 0/1 knapsack for this batch (right to left)
for (int w = W; w >= bw; w--)
dp[w] = max(dp[w], dp[w - bw] + bv);
}
}
return dp[W];
}
When there are multiple constraints (weight AND volume, for example), add an extra dimension to the DP table.
Python
def knapsack_2d(W, V, items):
"""items = list of (weight, volume, value). Max value with capacity W and volume V."""
dp = [[0] * (V + 1) for _ in range(W + 1)]
for w, v, val in items:
for cw in range(W, w - 1, -1):
for cv in range(V, v - 1, -1):
dp[cw][cv] = max(dp[cw][cv], dp[cw - w][cv - v] + val)
return dp[W][V]
When n is up to ~40 and W is very large, standard DP is impossible. Split items into two halves, enumerate all 2^(n/2) subsets for each, then combine with binary search. Time: O(2^(n/2) * log(2^(n/2))).
Python
from bisect import bisect_right
def meet_in_middle_knapsack(W, weights, values):
n = len(weights)
half = n // 2
def enumerate_subsets(wts, vals):
"""Return list of (weight, value) for all subsets."""
m = len(wts)
result = []
for mask in range(1 << m):
tw = tv = 0
for i in range(m):
if mask & (1 << i):
tw += wts[i]
tv += vals[i]
if tw <= W:
result.append((tw, tv))
return result
left = enumerate_subsets(weights[:half], values[:half])
right = enumerate_subsets(weights[half:], values[half:])
# Sort right by weight, build prefix-max of values
right.sort()
rw = [r[0] for r in right]
rv = [r[1] for r in right]
# Make rv prefix-max: rv[i] = max value among right subsets with weight <= rw[i]
for i in range(1, len(rv)):
rv[i] = max(rv[i], rv[i - 1])
ans = 0
for lw, lv in left:
remaining = W - lw
idx = bisect_right(rw, remaining) - 1
if idx >= 0:
ans = max(ans, lv + rv[idx])
else:
ans = max(ans, lv)
return ans
1. Forgetting to iterate right-to-left in 0/1 knapsack (accidentally allowing reuse).
2. In meet-in-middle, forgetting to build the prefix-max on the sorted right half.
3. Off-by-one in bounded knapsack binary decomposition (the remainder group).
The Convex Hull Trick (CHT) optimizes DP recurrences of the form: dp[i] = min/max over j < i of (a[j] * x[i] + b[j]), where each previous state j defines a linear function f_j(x) = a[j]*x + b[j], and we query at x = x[i]. If we maintain a convex hull of these lines, each query takes O(log n) or amortized O(1).
Your DP recurrence must be expressible as: dp[i] = min/max(m_j * x_i + b_j) + cost(i)
where m_j and b_j depend on j (a previously computed state) and x_i depends only on i.
If slopes m_j are monotone AND query points x_i are monotone, use a simple deque (amortized O(1) per operation). Otherwise, use Li Chao tree (O(log n) per operation).
Python
from collections import deque
class ConvexHullTrick:
"""Maintains lower hull for minimum queries.
Requirements: lines added in decreasing slope order,
queries in increasing x order."""
def __init__(self):
self.hull = deque() # list of (slope, intercept)
def _bad(self, l1, l2, l3):
"""Check if l2 is never optimal between l1 and l3."""
return (l3[1] - l1[1]) * (l1[0] - l2[0]) <= \
(l2[1] - l1[1]) * (l1[0] - l3[0])
def add_line(self, m, b):
"""Add line y = m*x + b. Slopes must be decreasing."""
line = (m, b)
while len(self.hull) >= 2 and self._bad(self.hull[-2], self.hull[-1], line):
self.hull.pop()
self.hull.append(line)
def query(self, x):
"""Query minimum y at x. x must be increasing."""
while len(self.hull) >= 2:
m1, b1 = self.hull[0]
m2, b2 = self.hull[1]
if m1 * x + b1 > m2 * x + b2:
self.hull.popleft()
else:
break
m, b = self.hull[0]
return m * x + b
C++
struct Line {
long long m, b;
long long eval(long long x) { return m * x + b; }
};
struct CHT {
deque<Line> hull;
bool bad(Line l1, Line l2, Line l3) {
return (__int128)(l3.b - l1.b) * (l1.m - l2.m) <=
(__int128)(l2.b - l1.b) * (l1.m - l3.m);
}
void add_line(long long m, long long b) {
Line line = {m, b};
while (hull.size() >= 2 && bad(hull[hull.size()-2], hull.back(), line))
hull.pop_back();
hull.push_back(line);
}
long long query(long long x) {
while (hull.size() >= 2 && hull[0].eval(x) >= hull[1].eval(x))
hull.pop_front();
return hull.front().eval(x);
}
};
Suppose dp[i] = min over j < i of (dp[j] + cost(j, i)), where cost(j, i) can be decomposed as m_j * x_i + b_j.
Rewrite: dp[i] = min(m_j * x_i + b_j) where m_j = some function of j's values, b_j = dp[j] + extra, x_i = some function of i.
Add line (m_j, b_j) when computing dp[j]. Query at x_i to compute dp[i]. If slopes decrease and queries increase, the deque approach gives O(n) total.
When slopes and queries are NOT monotone, use a Li Chao segment tree. It supports add_line in O(log(MAXX)) and query in O(log(MAXX)). Conceptually simpler than maintaining an explicit hull with binary search.
This optimization applies when the DP recurrence is: dp[i][j] = min over k in [opt[i][j-1], opt[i+1][j]] of (dp[k][j-1] + C(k+1, i)) and the "optimal split point" opt[i][j] is monotone: opt[i][j] ≤ opt[i+1][j]. This means as i increases, the optimal k never decreases.
Given dp[i][j] = min over k < i of (dp[k][j-1] + C(k, i)), if the optimal k for row i is ≤ optimal k for row i+1, we have the monotonicity condition. This reduces O(n^2 * k) to O(n * k * log n).
C++
#include <bits/stdc++.h>
using namespace std;
const long long INF = 1e18;
int N, K;
vector<vector<long long>> dp;
// cost(l, r) = cost of one group covering elements [l, r]
long long cost(int l, int r);
void solve(int layer, int lo, int hi, int opt_lo, int opt_hi) {
if (lo > hi) return;
int mid = (lo + hi) / 2;
long long best = INF;
int opt = opt_lo;
for (int k = opt_lo; k <= min(mid - 1, opt_hi); k++) {
long long val = dp[layer - 1][k] + cost(k + 1, mid);
if (val < best) {
best = val;
opt = k;
}
}
dp[layer][mid] = best;
solve(layer, lo, mid - 1, opt_lo, opt);
solve(layer, mid + 1, hi, opt, opt_hi);
}
void compute() {
dp.assign(K + 1, vector<long long>(N + 1, INF));
dp[0][0] = 0;
// Base layer: dp[1][i] = cost(1, i)
for (int i = 1; i <= N; i++)
dp[1][i] = cost(1, i);
for (int j = 2; j <= K; j++)
solve(j, 1, N, 0, N - 1);
}
Python
import sys
sys.setrecursionlimit(300000)
def dnc_optimize(n, k, cost_fn):
"""
dp[j][i] = min over m < i of (dp[j-1][m] + cost_fn(m+1, i))
Returns dp[k][n].
"""
INF = float('inf')
prev = [INF] * (n + 1)
prev[0] = 0
for i in range(1, n + 1):
prev[i] = cost_fn(1, i)
for layer in range(2, k + 1):
cur = [INF] * (n + 1)
def rec(lo, hi, opt_lo, opt_hi):
if lo > hi:
return
mid = (lo + hi) // 2
best = INF
opt = opt_lo
for m in range(opt_lo, min(mid, opt_hi + 1)):
val = prev[m] + cost_fn(m + 1, mid)
if val < best:
best = val
opt = m
cur[mid] = best
rec(lo, mid - 1, opt_lo, opt)
rec(mid + 1, hi, opt, opt_hi)
rec(1, n, 0, n - 1)
prev = cur
return prev[n]
Compute the cost function C(l, r). If C satisfies the quadrangle inequality (C(a,c) + C(b,d) ≤ C(a,d) + C(b,c) for a ≤ b ≤ c ≤ d), then the optimal split points are monotone. Common qualifying costs: sum of squares of segment sums, variance-based costs.
Knuth's optimization speeds up interval DP from O(n^3) to O(n^2). It applies when the cost function C(i, j) satisfies the quadrangle inequality AND the DP recurrence is: dp[i][j] = min over i < k < j of (dp[i][k] + dp[k][j] + C(i, j)).
1. Quadrangle Inequality: C(a,c) + C(b,d) ≤ C(a,d) + C(b,c) for all a ≤ b ≤ c ≤ d.
2. Monotonicity: C(a,c) ≤ C(b,d) for a ≤ b ≤ c ≤ d.
Then opt[i][j-1] ≤ opt[i][j] ≤ opt[i+1][j], limiting the search range for k.
C++
// Knuth's optimization for optimal BST / interval DP
void knuth_dp(int n, vector<vector<long long>>& dp,
vector<vector<int>>& opt,
vector<vector<long long>>& C) {
// Base cases
for (int i = 0; i < n; i++) {
dp[i][i] = 0;
opt[i][i] = i;
}
for (int len = 2; len <= n; len++) {
for (int i = 0; i + len - 1 < n; i++) {
int j = i + len - 1;
dp[i][j] = 1e18;
// Key: search only from opt[i][j-1] to opt[i+1][j]
int lo = opt[i][j - 1];
int hi = (j + 1 < n) ? opt[i + 1][j] : j;
for (int k = lo; k <= hi; k++) {
long long val = dp[i][k] + dp[k + 1][j] + C[i][j];
if (val < dp[i][j]) {
dp[i][j] = val;
opt[i][j] = k;
}
}
}
}
}
Python
def knuth_dp(n, C):
"""
Interval DP with Knuth's optimization.
C[i][j] = cost of interval [i, j].
Returns dp[0][n-1] = optimal cost.
"""
INF = float('inf')
dp = [[0] * n for _ in range(n)]
opt = [[0] * n for _ in range(n)]
for i in range(n):
opt[i][i] = i
for length in range(2, n + 1):
for i in range(n - length + 1):
j = i + length - 1
dp[i][j] = INF
lo = opt[i][j - 1]
hi = opt[i + 1][j] if i + 1 < n else j
for k in range(lo, hi + 1):
val = dp[i][k] + (dp[k + 1][j] if k + 1 <= j else 0) + C[i][j]
if val < dp[i][j]:
dp[i][j] = val
opt[i][j] = k
return dp[0][n - 1]
Given keys with access frequencies f[i], build a BST minimizing weighted search cost. Cost C(i,j) = sum of f[i..j] (prefix sums). This satisfies the quadrangle inequality, so Knuth's optimization applies, giving O(n^2) instead of O(n^3).
SOS DP computes, for every bitmask x: F[x] = sum of A[y] for all y that are submasks of x (y & x == y). The naive approach is O(4^n) (for each mask, enumerate all submasks). SOS DP does it in O(n * 2^n).
Process bits one at a time. Let dp[mask][i] = sum of A[y] over all y that agree with mask on bits i+1..n-1 and are submasks of mask on bits 0..i.
dp[mask][-1] = A[mask]
dp[mask][i] = dp[mask][i-1] + (dp[mask ^ (1<<i)][i-1] if bit i is set in mask, else 0)
F[mask] = dp[mask][n-1]
We can flatten to 1D by processing in-place.
Python
def sos_dp(A, n):
"""
Given array A of size 2^n, compute F[mask] = sum of A[sub] for all sub submasks of mask.
Modifies A in-place to become F.
O(n * 2^n).
"""
for i in range(n):
for mask in range(1 << n):
if mask & (1 << i):
A[mask] += A[mask ^ (1 << i)]
return A
C++
void sos_dp(vector<long long>& A, int n) {
for (int i = 0; i < n; i++) {
for (int mask = 0; mask < (1 << n); mask++) {
if (mask & (1 << i))
A[mask] += A[mask ^ (1 << i)];
}
}
}
Indices 0..7 in binary: 000, 001, 010, 011, 100, 101, 110, 111.
After processing bit 0: A[001]+=A[000]=3, A[011]+=A[010]=7, A[101]+=A[100]=11, A[111]+=A[110]=15.
After processing bit 1: A[010]+=A[000]=4, A[011]+=A[001]=10, A[110]+=A[100]=12, A[111]+=A[101]=26.
After processing bit 2: A[100]+=A[000]=6, A[101]+=A[001]=14, A[110]+=A[010]=16, A[111]+=A[011]=36.
F[111] = A[000]+A[001]+A[010]+A[011]+A[100]+A[101]+A[110]+A[111] = 1+2+3+4+5+6+7+8 = 36. Correct.
To compute G[mask] = sum of A[y] for all y that are supermasks of mask (mask is a submask of y), simply flip the condition:
Python
def superset_sum(A, n):
"""G[mask] = sum of A[sup] for all sup where mask is submask of sup."""
for i in range(n):
for mask in range(1 << n):
if not (mask & (1 << i)):
A[mask] += A[mask | (1 << i)]
return A
1. Count pairs (i, j) where a[i] AND a[j] = a[i] (a[i] is submask of a[j]).
2. Maximum/minimum over all submasks.
3. Subset convolution (combine with inclusion-exclusion and popcount stratification).
4. Competitive programming: CF 165E, ARC 100E, many Div1 D/E problems.
Profile DP handles grid tiling problems: "In how many ways can you tile an m x n grid with dominoes (1x2 tiles)?" The idea is to process the grid column by column (or row by row), with the DP state being the "profile" -- a bitmask describing which cells in the current column boundary are already filled by a domino from the previous column.
dp[col][mask] = number of ways to tile columns 0..col such that the boundary between col and col+1 has the given mask (bit i = 1 means row i is already covered by a horizontal domino extending from col to col+1).
For an m x n grid: 2^m states per column, n columns. Total: O(n * 2^m).
Python
def domino_tiling(m, n):
"""Count ways to tile m x n grid with 1x2 dominoes. m should be small (<= 20)."""
if (m * n) % 2 != 0:
return 0 # odd total cells = impossible
# Ensure m <= n (use m as the small dimension)
if m > n:
m, n = n, m
full = (1 << m) - 1
def generate_transitions(row, cur_mask, next_mask):
"""
Try to fill row 'row' of current column.
cur_mask: which rows are already filled in current column.
next_mask: which rows will be filled in next column (by horizontal dominos).
Returns list of valid next_masks.
"""
if row == m:
if cur_mask == full:
yield next_mask
return
if cur_mask & (1 << row):
# Row already filled, skip
yield from generate_transitions(row + 1, cur_mask, next_mask)
else:
# Option 1: place horizontal domino (extends to next column)
yield from generate_transitions(
row + 1, cur_mask, next_mask | (1 << row)
)
# Option 2: place vertical domino (fills this row and next row)
if row + 1 < m and not (cur_mask & (1 << (row + 1))):
yield from generate_transitions(
row + 2, cur_mask | (1 << row) | (1 << (row + 1)),
next_mask
)
# Precompute transitions
trans = {}
for mask in range(1 << m):
trans[mask] = list(generate_transitions(0, mask, 0))
# DP
dp = [0] * (1 << m)
dp[0] = 1 # empty profile before first column
for col in range(n):
new_dp = [0] * (1 << m)
for mask in range(1 << m):
if dp[mask] == 0:
continue
for next_mask in trans[mask]:
new_dp[next_mask] += dp[mask]
dp = new_dp
return dp[0] # no overhanging dominoes after last column
C++
#include <bits/stdc++.h>
using namespace std;
int M;
vector<vector<int>> trans;
void gen(int row, int cur, int nxt, vector<int>& out) {
if (row == M) {
if (cur == (1 << M) - 1) out.push_back(nxt);
return;
}
if (cur & (1 << row)) {
gen(row + 1, cur, nxt, out);
} else {
// horizontal: extends to next column
gen(row + 1, cur, nxt | (1 << row), out);
// vertical: fill two rows in current column
if (row + 1 < M && !(cur & (1 << (row + 1))))
gen(row + 2, cur | (1 << row) | (1 << (row + 1)), nxt, out);
}
}
long long domino_tiling(int m, int n) {
if (((long long)m * n) % 2) return 0;
if (m > n) swap(m, n);
M = m;
trans.assign(1 << m, vector<int>());
for (int mask = 0; mask < (1 << m); mask++)
gen(0, mask, 0, trans[mask]);
vector<long long> dp(1 << m, 0);
dp[0] = 1;
for (int col = 0; col < n; col++) {
vector<long long> ndp(1 << m, 0);
for (int mask = 0; mask < (1 << m); mask++) {
if (!dp[mask]) continue;
for (int nm : trans[mask])
ndp[nm] += dp[mask];
}
dp = ndp;
}
return dp[0];
}
m=2, n=3. Full mask = 11 (binary) = 3.
Transitions from mask=00: can place vertical (fills both rows) -> next=00. Or two horizontals -> next=11.
Transitions from mask=11: both rows filled, nothing to place -> next=00.
Transitions from mask=01: row 0 filled, row 1 empty. Place horizontal on row 1 -> next=10. (No vertical since only one row empty.)
Transitions from mask=10: row 1 filled, row 0 empty. Place horizontal on row 0 -> next=01.
DP: col 0: dp[00]=1 -> ndp[00]+=1 (vertical), ndp[11]+=1 (two horizontal). dp=[1,0,0,1].
col 1: from 00: ndp[00]+=1, ndp[11]+=1. From 11: ndp[00]+=1. dp=[2,0,0,1].
col 2: from 00: ndp[00]+=2, ndp[11]+=2. From 11: ndp[00]+=1. dp=[3,0,0,2].
Answer: dp[0] = 3. The three tilings are: all vertical, horizontal top + vertical bottom shifted, etc.
The same technique works for: tiling with L-trominoes, tiling with 1x3 bars, counting Hamiltonian paths in grid graphs, and any grid problem where the "interaction" between adjacent columns is bounded.
Organized by topic with approximate difficulty. Solve these to master advanced DP.
If you are new to advanced DP, tackle them in this order: (1) Tree DP -- natural extension of basic DP. (2) Bitmask DP -- builds intuition for state compression. (3) Interval DP -- classic pattern, appears everywhere. (4) Knapsack variants -- deepens your understanding of the fundamental DP. (5) Digit DP -- template-based, very learnable. (6) SOS DP + Profile DP -- bitmask mastery. (7) CHT, DnC opt, Knuth's opt -- these are optimization techniques that require solid DP foundations first.
In a contest, identifying the DP type is half the battle. Look for these signals: "tree structure" = tree DP; "n <= 20" = bitmask DP; "count numbers in range [L,R] with property" = digit DP; "merge intervals / break string" = interval DP; "minimize cost of partitioning into k groups" = DnC optimization or Knuth's; "subset/superset queries" = SOS DP. Once you identify the type, apply the template and focus on defining the cost function correctly.