Table of Contents

1. What is Calculus? (The Big Picture) 2. Limits 3. Derivatives -- The Basics 4. Derivative Rules (Advanced) 5. Applications of Derivatives 6. Integrals -- The Basics 7. Definite Integrals 8. Applications of Integrals 9. Partial Derivatives & Gradients (For ML) 10. Practice Quiz

1. What is Calculus? (The Big Picture)

Calculus sounds intimidating, but at its core it answers two simple questions:

  1. How fast is something changing right now? (Differential calculus)
  2. How much has accumulated over time? (Integral calculus)

That is it. Every formula, every rule, every theorem in calculus is just a tool for answering one of those two questions more precisely.

The Speed vs. Distance Analogy

Imagine you are driving a car. Your speedometer tells you your speed at this exact instant -- that is a derivative. Your odometer tells you the total distance you have traveled -- that is an integral. Calculus is the math that connects these two ideas.

These two operations are inverses of each other -- and that beautiful connection is called the Fundamental Theorem of Calculus.

Why Programmers Need Calculus

You might wonder: "I write code, not physics papers. Why do I need this?" Here is why:

Tip

You do not need to be a calculus wizard to be a great programmer. But understanding the core ideas -- especially derivatives and what they mean -- will unlock entire areas of CS that would otherwise feel like black-box magic.

2. Limits

Before you can understand derivatives or integrals, you need limits. A limit answers the question: "What value does f(x) approach as x gets closer and closer to some number?"

Notice the word "approach." We do not care what f(x) actually equals at that point -- we care about what it is heading toward.

Notation

lim f(x) = L x → a "The limit of f(x) as x approaches a equals L"

This means: as x gets closer and closer to a (from both sides), f(x) gets closer and closer to L.

One-Sided Limits

Sometimes a function approaches different values depending on which direction you come from:

Left-hand limit: lim f(x) = L x → a− Right-hand limit: lim f(x) = L x → a+

The full (two-sided) limit exists only if the left-hand and right-hand limits are equal. If they disagree, the limit does not exist.

Limits at Infinity

We can also ask what happens as x grows without bound:

lim f(x) = L x → ∞ "As x grows infinitely large, f(x) approaches L"
Example

Find: lim (1/x) as x → ∞

As x gets bigger and bigger (10, 100, 1000, ...), 1/x gets smaller and smaller (0.1, 0.01, 0.001, ...).

Answer: The limit is 0.

When Limits Do Not Exist

A limit does not exist (DNE) when:

How to Evaluate Limits

Method 1: Direct Substitution

Just plug in the value of a. If you get a real number, that is the limit. This works for most "nice" functions (polynomials, exponentials, trig).

Example -- Direct Substitution

Find: lim (x² + 3x) as x → 2

Plug in x = 2: (2)² + 3(2) = 4 + 6 = 10

Answer: The limit is 10.

Method 2: Factoring

If direct substitution gives 0/0 (an indeterminate form), try factoring and canceling.

Example -- Factoring

Find: lim (x² - 4) / (x - 2) as x → 2

Direct substitution gives: (4 - 4) / (2 - 2) = 0/0. Indeterminate!

Factor the numerator: (x - 2)(x + 2) / (x - 2)

Cancel (x - 2): x + 2

Now substitute: 2 + 2 = 4

Answer: The limit is 4.

Method 3: Rationalizing

When you have square roots and get 0/0, multiply by the conjugate.

Example -- Rationalizing

Find: lim (√(x+1) - 1) / x as x → 0

Direct substitution: (√1 - 1) / 0 = 0/0. Indeterminate!

Multiply by the conjugate: [(√(x+1) - 1)(√(x+1) + 1)] / [x(√(x+1) + 1)]

Numerator becomes: (x+1) - 1 = x

So we get: x / [x(√(x+1) + 1)] = 1 / (√(x+1) + 1)

Now substitute x = 0: 1 / (√1 + 1) = 1/2

Answer: The limit is 1/2.

Example -- Limit at Infinity

Find: lim (3x² + 2x) / (x² - 5) as x → ∞

Divide every term by x² (the highest power): (3 + 2/x) / (1 - 5/x²)

As x → ∞, 2/x → 0 and 5/x² → 0

So the limit becomes: 3 / 1 = 3

Answer: The limit is 3.

Tip -- Limits at Infinity for Rational Functions

For a ratio of polynomials as x → ∞: compare the highest degree in the numerator and denominator. Same degree? Limit = ratio of leading coefficients. Numerator higher? Limit is ∞. Denominator higher? Limit is 0.

Warning

0/0 is NOT the answer -- it means the limit is "indeterminate" and you need to do more work (factor, rationalize, or use L'Hopital's rule). Do not confuse 0/0 with "the limit is 0" or "the limit does not exist."

Limit Laws (Precise Rules):

If lim(x→a) f(x) = L and lim(x→a) g(x) = M, then:

• Sum: lim(x→a) [f(x) + g(x)] = L + M
• Difference: lim(x→a) [f(x) - g(x)] = L - M
• Product: lim(x→a) [f(x) · g(x)] = L · M
• Quotient: lim(x→a) [f(x)/g(x)] = L/M (provided M ≠ 0)
• Constant: lim(x→a) [c · f(x)] = c · L
• Power: lim(x→a) [f(x)]ⁿ = Lⁿ

3. Derivatives -- The Basics

A derivative measures the instantaneous rate of change of a function. If f(x) describes your position at time x, then f'(x) tells you your velocity -- how fast you are moving right at that instant.

The Formal Definition

f'(x) = lim [f(x + h) - f(x)] / h h → 0

This says: take two points on the curve that are very close together (distance h apart), compute the slope of the line between them, and see what happens as h shrinks to zero. The slope of that infinitely-close line is the derivative.

Geometric Meaning

The derivative at a point is the slope of the tangent line to the curve at that point. A positive derivative means the function is going up; a negative derivative means it is going down; a zero derivative means it is flat (possibly a peak or valley).

Notation

You will see derivatives written in several ways. They all mean the same thing:

f'(x) -- "f prime of x" (Lagrange notation) dy/dx -- "dee y dee x" (Leibniz notation) df/dx -- "dee f dee x" Df(x) -- "D of f" (operator notation)
Tip

Leibniz notation (dy/dx) is not a fraction, but it behaves like one in many situations -- which is part of its power. It reminds you that the derivative is a ratio of tiny changes.

Basic Differentiation Rules

You do not need to use the formal limit definition every time. These rules let you find derivatives quickly:

Rule Formula Example
Constant d/dx(c) = 0 d/dx(7) = 0
Power Rule d/dx(xn) = n · xn-1 d/dx(x³) = 3x²
Constant Multiple d/dx(c · f) = c · f'(x) d/dx(5x²) = 10x
Sum Rule d/dx(f + g) = f' + g' d/dx(x² + 3x) = 2x + 3
Difference Rule d/dx(f - g) = f' - g' d/dx(x³ - x) = 3x² - 1
Example -- Power Rule

Find the derivative of f(x) = x²

Using the limit definition (to show why the power rule works):

f'(x) = lim [(x+h)² - x²] / h as h → 0

= lim [x² + 2xh + h² - x²] / h

= lim [2xh + h²] / h

= lim (2x + h)

= 2x

This matches the power rule: d/dx(x²) = 2x1 = 2x.

Example -- Combining Rules

Find f'(x) where f(x) = 4x³ - 2x² + 7x - 5

Apply the rules term by term:

  • d/dx(4x³) = 12x² (power rule + constant multiple)
  • d/dx(-2x²) = -4x
  • d/dx(7x) = 7
  • d/dx(-5) = 0 (constant rule)

f'(x) = 12x² - 4x + 7

Example -- Negative and Fractional Exponents

The power rule works for all exponents, not just positive integers:

  • d/dx(1/x) = d/dx(x-1) = -1 · x-2 = -1/x²
  • d/dx(√x) = d/dx(x1/2) = (1/2) · x-1/2 = 1 / (2√x)
Warning

The power rule only works on xn where n is a constant. It does NOT work on functions like 2x or xx -- those require different rules.

4. Derivative Rules (Advanced)

The basic rules handle simple terms, but what about products, quotients, and compositions of functions? That is where these three powerful rules come in.

The Product Rule

(f · g)' = f' · g + f · g' "Derivative of first times second, plus first times derivative of second"
Example -- Product Rule

Find d/dx [x² · sin(x)]

Let f = x² and g = sin(x)

f' = 2x, g' = cos(x)

Product rule: f'g + fg' = 2x · sin(x) + x² · cos(x)

Answer: 2x sin(x) + x² cos(x)

The Quotient Rule

(f / g)' = [f' · g - f · g'] / g² "Low d-high minus high d-low, over the square of what's below"
Example -- Quotient Rule

Find d/dx [(3x + 1) / (x² + 1)]

f = 3x + 1, g = x² + 1

f' = 3, g' = 2x

= [3(x² + 1) - (3x + 1)(2x)] / (x² + 1)²

= [3x² + 3 - 6x² - 2x] / (x² + 1)²

= (-3x² - 2x + 3) / (x² + 1)²

The Chain Rule

The chain rule is arguably the most important rule in all of calculus, especially for machine learning. It tells you how to differentiate compositions of functions -- a function inside another function.

d/dx [f(g(x))] = f'(g(x)) · g'(x) "Derivative of the outer (evaluated at inner) times derivative of the inner"
Example -- Chain Rule

Find d/dx [(3x + 2)³]

Outer function: f(u) = u³, so f'(u) = 3u²

Inner function: g(x) = 3x + 2, so g'(x) = 3

Chain rule: f'(g(x)) · g'(x) = 3(3x + 2)² · 3

= 9(3x + 2)²

Example -- Chain Rule with Trig

Find d/dx [sin(x²)]

Outer: sin(u), derivative = cos(u)

Inner: x², derivative = 2x

= cos(x²) · 2x = 2x cos(x²)

Tip -- Why the Chain Rule Matters for ML

Neural networks are just chains of functions: layer after layer of transformations. Backpropagation -- the algorithm that trains neural networks -- is literally the chain rule applied repeatedly. When people say "deep learning requires calculus," this is what they mean.

Derivatives of Common Functions

Memorize these. They come up constantly:

f(x) f'(x) Notes
sin(x) cos(x)
cos(x) -sin(x) Note the negative sign
tan(x) sec²(x)
ex ex ex is its own derivative!
ln(x) 1/x Only for x > 0
ax ax · ln(a) General exponential
Example -- Combining Multiple Rules

Find d/dx [ex · ln(x)]

Use product rule with f = ex, g = ln(x):

f' = ex, g' = 1/x

= ex · ln(x) + ex · (1/x)

= ex(ln(x) + 1/x)

Example -- Chain Rule with e

Find d/dx [e(3x²)]

Outer: eu, derivative = eu

Inner: 3x², derivative = 6x

= 6x · e(3x²)

5. Applications of Derivatives

Derivatives are not just abstract math -- they are tools for solving real problems. Here are the most important applications.

Finding Maxima and Minima

At the top of a hill or the bottom of a valley, the slope is zero. This means we can find peaks and valleys of any function by setting its derivative equal to zero and solving.

To find maxima/minima of f(x): 1. Find f'(x) 2. Set f'(x) = 0 and solve for x (these are "critical points") 3. Use the second derivative to classify: - f''(x) > 0 at critical point → local minimum (concave up) - f''(x) < 0 at critical point → local maximum (concave down) - f''(x) = 0 → inconclusive (need further analysis)
Example -- Finding Extrema

Find the maximum and minimum of f(x) = x³ - 3x + 2

Step 1: f'(x) = 3x² - 3

Step 2: Set f'(x) = 0: 3x² - 3 = 0 → x² = 1 → x = 1 or x = -1

Step 3: f''(x) = 6x

  • At x = 1: f''(1) = 6 > 0, so x = 1 is a local minimum. f(1) = 0
  • At x = -1: f''(-1) = -6 < 0, so x = -1 is a local maximum. f(-1) = 4

Optimization Problems

Real-world optimization follows the same pattern: write a function for the thing you want to maximize or minimize, take the derivative, set it to zero, solve.

Example -- Optimization

A farmer has 100 meters of fencing. What dimensions maximize the area of a rectangular pen?

Let width = x, so length = (100 - 2x) / 2 = 50 - x

Area: A(x) = x(50 - x) = 50x - x²

A'(x) = 50 - 2x

Set A'(x) = 0: 50 - 2x = 0 → x = 25

So width = 25m, length = 25m. A square!

Maximum area = 25 × 25 = 625 m²

Related Rates (Brief Introduction)

Related rates problems ask: if one quantity is changing, how fast is a related quantity changing? You use the chain rule with respect to time.

Example -- Related Rates

A circle's radius is increasing at 2 cm/s. How fast is the area increasing when r = 5 cm?

A = πr²

Differentiate both sides with respect to time t:

dA/dt = 2πr · (dr/dt)

Plug in r = 5 and dr/dt = 2:

dA/dt = 2π(5)(2) = 20π ≈ 62.83 cm²/s

Gradient Descent -- Why ML Engineers Love Derivatives

In machine learning, you have a loss function that measures how wrong your model is. Training the model means finding the inputs (weights) that minimize this loss. How? Derivatives!

Gradient Descent Update Rule: w_new = w_old - learning_rate × dL/dw "Move the weight in the direction that reduces the loss"

The idea is beautifully simple:

  1. Compute the derivative of the loss with respect to each weight (the gradient).
  2. The derivative tells you which direction increases the loss.
  3. Go in the opposite direction (hence the minus sign).
  4. Repeat until you reach a minimum.
Tip

Think of gradient descent like walking downhill in fog. You cannot see the bottom, but you can feel the slope under your feet (the derivative). Take a step in the steepest downhill direction, check the slope again, repeat. You will eventually reach a valley.

Warning

Gradient descent can get stuck in local minima -- a valley that is not the lowest point overall. This is a real problem in ML and is why techniques like momentum, Adam optimizer, and random restarts exist.

6. Integrals -- The Basics

If derivatives answer "how fast is it changing?", integrals answer "how much has accumulated?" Integration is the reverse process of differentiation -- finding the antiderivative.

Antiderivatives

If F'(x) = f(x), then F(x) is the antiderivative of f(x). We write this as:

∫ f(x) dx = F(x) + C where F'(x) = f(x) and C is the constant of integration

Why the +C?

Since the derivative of any constant is zero, there are infinitely many antiderivatives. For example, x² + 5 and x² - 100 and x² + π all have the same derivative: 2x. The "+C" captures all these possibilities.

Warning

Never forget the +C on indefinite integrals. It is a constant source of lost marks (pun intended) and it actually matters in applications -- the constant represents initial conditions (like starting position or initial amount).

Basic Integration Rules

These are the derivative rules, but reversed:

Function Integral Example
xn (n ≠ -1) xn+1 / (n+1) + C ∫ x³ dx = x4/4 + C
1/x ln|x| + C ∫ (1/x) dx = ln|x| + C
ex ex + C ∫ ex dx = ex + C
cos(x) sin(x) + C ∫ cos(x) dx = sin(x) + C
sin(x) -cos(x) + C ∫ sin(x) dx = -cos(x) + C
constant k kx + C ∫ 5 dx = 5x + C
Tip -- Checking Your Work

You can always verify an integral by differentiating your answer. If you get back the original function, you did it right. This is the best self-check in calculus.

Example -- Power Rule for Integration

Find ∫ x4 dx

Add 1 to the exponent: 4 + 1 = 5

Divide by the new exponent: x5 / 5

∫ x4 dx = x5/5 + C

Check: d/dx (x5/5 + C) = 5x4/5 = x4. Correct!

Example -- Integrating a Polynomial

Find ∫ (3x² + 2x - 7) dx

Integrate term by term:

  • ∫ 3x² dx = 3 · x³/3 = x³
  • ∫ 2x dx = 2 · x²/2 = x²
  • ∫ -7 dx = -7x

= x³ + x² - 7x + C

Example -- Rewriting Before Integrating

Find ∫ (1/x³) dx

Rewrite: 1/x³ = x-3

Apply power rule: x-3+1 / (-3+1) = x-2 / (-2) = -1/(2x²)

∫ (1/x³) dx = -1/(2x²) + C

Example -- Roots

Find ∫ √x dx

Rewrite: √x = x1/2

Apply power rule: x1/2 + 1 / (1/2 + 1) = x3/2 / (3/2) = (2/3)x3/2

∫ √x dx = (2/3)x3/2 + C

Integration Linearity (The One Rule That Governs All Integration):

∫[a·f(x) + b·g(x)]dx = a·∫f(x)dx + b·∫g(x)dx

Constants pull out. Sums split. This single rule, combined with the table of basic antiderivatives, lets you integrate any polynomial.

7. Definite Integrals

An indefinite integral gives you a family of functions (+C). A definite integral gives you a specific number -- the total accumulation between two points.

Area Under a Curve

The definite integral of f(x) from a to b represents the signed area between the curve and the x-axis, from x = a to x = b.

b ∫ f(x) dx = "Area under f(x) from a to b" a Area above x-axis is positive; area below x-axis is negative.

The Fundamental Theorem of Calculus

This is the crown jewel of calculus. It says that differentiation and integration are inverses of each other, and it gives us a practical way to evaluate definite integrals:

b ∫ f(x) dx = F(b) - F(a) a where F(x) is any antiderivative of f(x) "Find the antiderivative, plug in the upper limit, subtract the antiderivative at the lower limit."

Notice that the +C cancels out (it appears in both F(b) and F(a)), which is why definite integrals do not have a +C.

Tip

The notation F(x) |ab (with the evaluation bar) is a shorthand for F(b) - F(a). You will see this written as [F(x)]ab in many textbooks.

Example -- Basic Definite Integral

Evaluate ∫ from 1 to 3 of x² dx

Step 1: Find antiderivative: F(x) = x³/3

Step 2: Evaluate at the bounds:

F(3) - F(1) = (3³/3) - (1³/3) = 27/3 - 1/3 = 26/3

Answer: 26/3 ≈ 8.667

This is the area under y = x² from x = 1 to x = 3.

Example -- Definite Integral with Multiple Terms

Evaluate ∫ from 0 to 2 of (3x² - 4x + 1) dx

Antiderivative: F(x) = x³ - 2x² + x

F(2) = 8 - 8 + 2 = 2

F(0) = 0 - 0 + 0 = 0

Answer: 2 - 0 = 2

Example -- Using e and ln

Evaluate ∫ from 1 to e of (1/x) dx

Antiderivative: F(x) = ln(x)

F(e) - F(1) = ln(e) - ln(1) = 1 - 0

Answer: 1

The area under 1/x from 1 to e is exactly 1. This is actually how the number e is defined!

Warning

The definite integral gives signed area. If f(x) is below the x-axis, the integral is negative. To find the total (unsigned) area, you need to split the integral at the zeros and take absolute values.

8. Applications of Integrals

Area Between Two Curves

To find the area between two curves f(x) and g(x) (where f(x) ≥ g(x)) from a to b:

b ∫ [f(x) - g(x)] dx a "Integral of (top curve minus bottom curve)"
Example -- Area Between Curves

Find the area between y = x² and y = x from x = 0 to x = 1.

On [0, 1], x ≥ x² (the line is above the parabola).

∫ from 0 to 1 of (x - x²) dx

= [x²/2 - x³/3] from 0 to 1

= (1/2 - 1/3) - (0 - 0) = 1/6

Area = 1/6

Why CS People Care About Integrals

Probability Distributions

In statistics and ML, continuous probability distributions are defined using integrals. The probability that a random variable X falls between a and b is:

b P(a ≤ X ≤ b) = ∫ f(x) dx a where f(x) is the probability density function (PDF)

The total area under the entire PDF must equal 1 (there is a 100% chance of something happening). This is why integrals are unavoidable in probability.

Expected Value

The expected value (average outcome) of a continuous random variable is:

∞ E[X] = ∫ x · f(x) dx -∞

This shows up everywhere in ML: loss functions, reward signals in reinforcement learning, and Bayesian statistics.

Computational Geometry

Integrals help compute areas and volumes of complex shapes, which matters for computer graphics, geographic information systems (GIS), physics simulations, and any application that works with continuous shapes.

Tip

In practice, computers almost never compute integrals symbolically. They use numerical integration -- approximating the area by dividing it into tiny rectangles or trapezoids and adding them up. But understanding the theory tells you what the computer is approximating and why.

9. Partial Derivatives & Gradients (For ML)

Everything you learned about derivatives so far involved one variable: y = f(x). But in real-world CS, functions depend on many variables at once. A machine learning loss function might depend on millions of weights. How do you take the derivative when there are 10 inputs, not 1?

The answer: partial derivatives. You take the derivative with respect to one variable at a time, treating all the others as constants.

What Is a Partial Derivative?

Consider this function with two inputs:

f(x, y) = x² + 3xy + y²

The partial derivative with respect to x (written ∂f/∂x) asks: "If I nudge x slightly while keeping y fixed, how much does f change?"

To compute it, just differentiate normally with respect to x, treating y as a constant number:

Example: Partial Derivatives of f(x, y) = x² + 3xy + y²

∂f/∂x: Treat y as a constant, differentiate with respect to x:

x² → 2x    3xy → 3y (y is a constant times x)    y² → 0 (constant)

∂f/∂x = 2x + 3y


∂f/∂y: Treat x as a constant, differentiate with respect to y:

x² → 0 (constant)    3xy → 3x (x is a constant times y)    y² → 2y

∂f/∂y = 3x + 2y

That's it. Each partial derivative tells you how sensitive the output is to one specific input. If ∂f/∂x = 10 and ∂f/∂y = 2, then tweaking x has 5 times more effect than tweaking y.

Example: f(x, y, z) = 2x²y + z³ - 5xz

∂f/∂x (y and z are constants): 4xy + 0 - 5z = 4xy - 5z

∂f/∂y (x and z are constants): 2x² + 0 - 0 = 2x²

∂f/∂z (x and y are constants): 0 + 3z² - 5x = 3z² - 5x

The Gradient Vector

If you collect all the partial derivatives into a single vector, you get the gradient (written ∇f, pronounced "nabla f" or "grad f"):

∇f = (∂f/∂x, ∂f/∂y, ∂f/∂z, ...)

For f(x, y) = x² + 3xy + y²:
∇f = (2x + 3y, 3x + 2y)

The gradient has a beautiful geometric meaning: it points in the direction of steepest increase. If you want the function to grow as fast as possible, walk in the direction of the gradient. If you want it to shrink as fast as possible, walk in the opposite direction.

Loss landscape (imagine a hilly surface): ^ Loss | high | * ← you are here | \ | \ ← gradient points uphill | \ so NEGATIVE gradient points downhill | \ low | * ← minimum (goal!) +-------------------> weights Gradient descent: move OPPOSITE to the gradient new_weights = old_weights - learning_rate × ∇Loss

Why This Is the Heart of Machine Learning

In machine learning, you have a loss function that measures how wrong your model is. It depends on all the model's weights (parameters). A neural network might have millions of weights: w₁, w₂, ..., w₁₀₀₀₀₀₀.

Training the model means minimizing the loss. How? Gradient descent:

1. Compute the gradient: ∇Loss = (∂Loss/∂w₁, ∂Loss/∂w₂, ..., ∂Loss/∂wₙ)
2. Update each weight: wᵢ = wᵢ - learning_rate × ∂Loss/∂wᵢ
3. Repeat until loss is small enough

Each partial derivative ∂Loss/∂wᵢ tells you: "How much does the loss change if I nudge weight i?" If the partial derivative is large and positive, increasing that weight makes the loss worse -- so decrease it. If it's negative, increasing that weight helps -- so increase it.

Concrete Example: Simple Linear Model

Model: prediction = wx + b (two parameters: weight w and bias b)

Loss for one data point: L = (prediction - actual)² = (wx + b - y)²


∂L/∂w = 2(wx + b - y) · x    (chain rule!)

∂L/∂b = 2(wx + b - y) · 1


If x = 3, y = 7, w = 1, b = 0:

prediction = 1(3) + 0 = 3    error = 3 - 7 = -4

∂L/∂w = 2(-4)(3) = -24 → w should increase (gradient is negative)

∂L/∂b = 2(-4)(1) = -8 → b should increase too


With learning_rate = 0.01:

new w = 1 - 0.01(-24) = 1 + 0.24 = 1.24

new b = 0 - 0.01(-8) = 0 + 0.08 = 0.08

New prediction: 1.24(3) + 0.08 = 3.80 (closer to 7 than before!)

Backpropagation = Chain Rule on Steroids

A neural network is a chain of functions: each layer takes the previous layer's output and transforms it. To find ∂Loss/∂w for a weight in an early layer, you use the chain rule repeatedly through every layer between that weight and the output. This process is called backpropagation.

Layer 1 output: h₁ = f(w₁ · input)
Layer 2 output: h₂ = g(w₂ · h₁)
Loss: L = (h₂ - target)²

To find ∂L/∂w₁, chain rule through the layers:
∂L/∂w₁ = (∂L/∂h₂) · (∂h₂/∂h₁) · (∂h₁/∂w₁)

Each term is a partial derivative. Multiply them all together.
This is what PyTorch/TensorFlow compute automatically with "autograd."
The Big Picture

Partial derivatives let you ask "what happens if I change just this one thing?" The gradient collects all those answers into a direction. Gradient descent follows that direction downhill. Backpropagation uses the chain rule to compute the gradient efficiently through layers. That's all of machine learning optimization in four sentences.

You don't need to compute these by hand -- frameworks do it. But understanding what they compute helps you debug, tune learning rates, and understand why training fails (vanishing/exploding gradients).

10. Practice Quiz

Test your understanding of limits, derivatives, and integrals.

Q1: What is lim (x² - 9) / (x - 3) as x → 3?

B) 6. Factor the numerator: (x-3)(x+3) / (x-3) = x+3. Plug in x = 3: 3 + 3 = 6.

Q2: What is the derivative of f(x) = 5x4 - 3x + 8?

B) 20x³ - 3. Power rule on each term: d/dx(5x4) = 20x³, d/dx(-3x) = -3, d/dx(8) = 0.

Q3: Using the chain rule, what is d/dx [sin(3x)]?

C) 3 cos(3x). Outer: sin(u) → cos(u). Inner: 3x → 3. Chain rule: cos(3x) · 3 = 3 cos(3x).

Q4: What is ∫ (6x² + 4x) dx?

A) 2x³ + 2x² + C. Integrate term by term: ∫6x² dx = 6(x³/3) = 2x³, ∫4x dx = 4(x²/2) = 2x². Do not forget +C! (That rules out option B.)

Q5: Evaluate ∫ from 0 to 1 of 2x dx

C) 1. Antiderivative: x². Evaluate: (1)² - (0)² = 1 - 0 = 1. Note: definite integrals give a number, no +C needed.