The math of change and accumulation -- limits, derivatives, and integrals explained step by step for programmers.
"Do I really need calculus as a programmer?" If you want to work in any of these areas, yes:
Even if you never solve an integral by hand again, understanding what derivatives and integrals mean lets you use libraries and tools effectively instead of treating them as magic black boxes.
Calculus sounds intimidating, but at its core it answers two simple questions:
That is it. Every formula, every rule, every theorem in calculus is just a tool for answering one of those two questions more precisely.
Imagine you are driving a car. Your speedometer tells you your speed at this exact instant -- that is a derivative. Your odometer tells you the total distance you have traveled -- that is an integral. Calculus is the math that connects these two ideas.
These two operations are inverses of each other -- and that beautiful connection is called the Fundamental Theorem of Calculus.
You might wonder: "I write code, not physics papers. Why do I need this?" Here is why:
You do not need to be a calculus wizard to be a great programmer. But understanding the core ideas -- especially derivatives and what they mean -- will unlock entire areas of CS that would otherwise feel like black-box magic.
Before you can understand derivatives or integrals, you need limits. A limit answers the question: "What value does f(x) approach as x gets closer and closer to some number?"
Notice the word "approach." We do not care what f(x) actually equals at that point -- we care about what it is heading toward.
This means: as x gets closer and closer to a (from both sides), f(x) gets closer and closer to L.
Sometimes a function approaches different values depending on which direction you come from:
The full (two-sided) limit exists only if the left-hand and right-hand limits are equal. If they disagree, the limit does not exist.
We can also ask what happens as x grows without bound:
Find: lim (1/x) as x → ∞
As x gets bigger and bigger (10, 100, 1000, ...), 1/x gets smaller and smaller (0.1, 0.01, 0.001, ...).
Answer: The limit is 0.
A limit does not exist (DNE) when:
Method 1: Direct Substitution
Just plug in the value of a. If you get a real number, that is the limit. This works for most "nice" functions (polynomials, exponentials, trig).
Find: lim (x² + 3x) as x → 2
Plug in x = 2: (2)² + 3(2) = 4 + 6 = 10
Answer: The limit is 10.
Method 2: Factoring
If direct substitution gives 0/0 (an indeterminate form), try factoring and canceling.
Find: lim (x² - 4) / (x - 2) as x → 2
Direct substitution gives: (4 - 4) / (2 - 2) = 0/0. Indeterminate!
Factor the numerator: (x - 2)(x + 2) / (x - 2)
Cancel (x - 2): x + 2
Now substitute: 2 + 2 = 4
Answer: The limit is 4.
Method 3: Rationalizing
When you have square roots and get 0/0, multiply by the conjugate.
Find: lim (√(x+1) - 1) / x as x → 0
Direct substitution: (√1 - 1) / 0 = 0/0. Indeterminate!
Multiply by the conjugate: [(√(x+1) - 1)(√(x+1) + 1)] / [x(√(x+1) + 1)]
Numerator becomes: (x+1) - 1 = x
So we get: x / [x(√(x+1) + 1)] = 1 / (√(x+1) + 1)
Now substitute x = 0: 1 / (√1 + 1) = 1/2
Answer: The limit is 1/2.
Find: lim (3x² + 2x) / (x² - 5) as x → ∞
Divide every term by x² (the highest power): (3 + 2/x) / (1 - 5/x²)
As x → ∞, 2/x → 0 and 5/x² → 0
So the limit becomes: 3 / 1 = 3
Answer: The limit is 3.
For a ratio of polynomials as x → ∞: compare the highest degree in the numerator and denominator. Same degree? Limit = ratio of leading coefficients. Numerator higher? Limit is ∞. Denominator higher? Limit is 0.
0/0 is NOT the answer -- it means the limit is "indeterminate" and you need to do more work (factor, rationalize, or use L'Hopital's rule). Do not confuse 0/0 with "the limit is 0" or "the limit does not exist."
A derivative measures the instantaneous rate of change of a function. If f(x) describes your position at time x, then f'(x) tells you your velocity -- how fast you are moving right at that instant.
This says: take two points on the curve that are very close together (distance h apart), compute the slope of the line between them, and see what happens as h shrinks to zero. The slope of that infinitely-close line is the derivative.
The derivative at a point is the slope of the tangent line to the curve at that point. A positive derivative means the function is going up; a negative derivative means it is going down; a zero derivative means it is flat (possibly a peak or valley).
You will see derivatives written in several ways. They all mean the same thing:
Leibniz notation (dy/dx) is not a fraction, but it behaves like one in many situations -- which is part of its power. It reminds you that the derivative is a ratio of tiny changes.
You do not need to use the formal limit definition every time. These rules let you find derivatives quickly:
| Rule | Formula | Example |
|---|---|---|
| Constant | d/dx(c) = 0 | d/dx(7) = 0 |
| Power Rule | d/dx(xn) = n · xn-1 | d/dx(x³) = 3x² |
| Constant Multiple | d/dx(c · f) = c · f'(x) | d/dx(5x²) = 10x |
| Sum Rule | d/dx(f + g) = f' + g' | d/dx(x² + 3x) = 2x + 3 |
| Difference Rule | d/dx(f - g) = f' - g' | d/dx(x³ - x) = 3x² - 1 |
Find the derivative of f(x) = x²
Using the limit definition (to show why the power rule works):
f'(x) = lim [(x+h)² - x²] / h as h → 0
= lim [x² + 2xh + h² - x²] / h
= lim [2xh + h²] / h
= lim (2x + h)
= 2x
This matches the power rule: d/dx(x²) = 2x1 = 2x.
Find f'(x) where f(x) = 4x³ - 2x² + 7x - 5
Apply the rules term by term:
f'(x) = 12x² - 4x + 7
The power rule works for all exponents, not just positive integers:
The power rule only works on xn where n is a constant. It does NOT work on functions like 2x or xx -- those require different rules.
The basic rules handle simple terms, but what about products, quotients, and compositions of functions? That is where these three powerful rules come in.
Find d/dx [x² · sin(x)]
Let f = x² and g = sin(x)
f' = 2x, g' = cos(x)
Product rule: f'g + fg' = 2x · sin(x) + x² · cos(x)
Answer: 2x sin(x) + x² cos(x)
Find d/dx [(3x + 1) / (x² + 1)]
f = 3x + 1, g = x² + 1
f' = 3, g' = 2x
= [3(x² + 1) - (3x + 1)(2x)] / (x² + 1)²
= [3x² + 3 - 6x² - 2x] / (x² + 1)²
= (-3x² - 2x + 3) / (x² + 1)²
The chain rule is arguably the most important rule in all of calculus, especially for machine learning. It tells you how to differentiate compositions of functions -- a function inside another function.
Find d/dx [(3x + 2)³]
Outer function: f(u) = u³, so f'(u) = 3u²
Inner function: g(x) = 3x + 2, so g'(x) = 3
Chain rule: f'(g(x)) · g'(x) = 3(3x + 2)² · 3
= 9(3x + 2)²
Find d/dx [sin(x²)]
Outer: sin(u), derivative = cos(u)
Inner: x², derivative = 2x
= cos(x²) · 2x = 2x cos(x²)
Neural networks are just chains of functions: layer after layer of transformations. Backpropagation -- the algorithm that trains neural networks -- is literally the chain rule applied repeatedly. When people say "deep learning requires calculus," this is what they mean.
Memorize these. They come up constantly:
| f(x) | f'(x) | Notes |
|---|---|---|
| sin(x) | cos(x) | |
| cos(x) | -sin(x) | Note the negative sign |
| tan(x) | sec²(x) | |
| ex | ex | ex is its own derivative! |
| ln(x) | 1/x | Only for x > 0 |
| ax | ax · ln(a) | General exponential |
Find d/dx [ex · ln(x)]
Use product rule with f = ex, g = ln(x):
f' = ex, g' = 1/x
= ex · ln(x) + ex · (1/x)
= ex(ln(x) + 1/x)
Find d/dx [e(3x²)]
Outer: eu, derivative = eu
Inner: 3x², derivative = 6x
= 6x · e(3x²)
Derivatives are not just abstract math -- they are tools for solving real problems. Here are the most important applications.
At the top of a hill or the bottom of a valley, the slope is zero. This means we can find peaks and valleys of any function by setting its derivative equal to zero and solving.
Find the maximum and minimum of f(x) = x³ - 3x + 2
Step 1: f'(x) = 3x² - 3
Step 2: Set f'(x) = 0: 3x² - 3 = 0 → x² = 1 → x = 1 or x = -1
Step 3: f''(x) = 6x
Real-world optimization follows the same pattern: write a function for the thing you want to maximize or minimize, take the derivative, set it to zero, solve.
A farmer has 100 meters of fencing. What dimensions maximize the area of a rectangular pen?
Let width = x, so length = (100 - 2x) / 2 = 50 - x
Area: A(x) = x(50 - x) = 50x - x²
A'(x) = 50 - 2x
Set A'(x) = 0: 50 - 2x = 0 → x = 25
So width = 25m, length = 25m. A square!
Maximum area = 25 × 25 = 625 m²
Related rates problems ask: if one quantity is changing, how fast is a related quantity changing? You use the chain rule with respect to time.
A circle's radius is increasing at 2 cm/s. How fast is the area increasing when r = 5 cm?
A = πr²
Differentiate both sides with respect to time t:
dA/dt = 2πr · (dr/dt)
Plug in r = 5 and dr/dt = 2:
dA/dt = 2π(5)(2) = 20π ≈ 62.83 cm²/s
In machine learning, you have a loss function that measures how wrong your model is. Training the model means finding the inputs (weights) that minimize this loss. How? Derivatives!
The idea is beautifully simple:
Think of gradient descent like walking downhill in fog. You cannot see the bottom, but you can feel the slope under your feet (the derivative). Take a step in the steepest downhill direction, check the slope again, repeat. You will eventually reach a valley.
Gradient descent can get stuck in local minima -- a valley that is not the lowest point overall. This is a real problem in ML and is why techniques like momentum, Adam optimizer, and random restarts exist.
If derivatives answer "how fast is it changing?", integrals answer "how much has accumulated?" Integration is the reverse process of differentiation -- finding the antiderivative.
If F'(x) = f(x), then F(x) is the antiderivative of f(x). We write this as:
Since the derivative of any constant is zero, there are infinitely many antiderivatives. For example, x² + 5 and x² - 100 and x² + π all have the same derivative: 2x. The "+C" captures all these possibilities.
Never forget the +C on indefinite integrals. It is a constant source of lost marks (pun intended) and it actually matters in applications -- the constant represents initial conditions (like starting position or initial amount).
These are the derivative rules, but reversed:
| Function | Integral | Example |
|---|---|---|
| xn (n ≠ -1) | xn+1 / (n+1) + C | ∫ x³ dx = x4/4 + C |
| 1/x | ln|x| + C | ∫ (1/x) dx = ln|x| + C |
| ex | ex + C | ∫ ex dx = ex + C |
| cos(x) | sin(x) + C | ∫ cos(x) dx = sin(x) + C |
| sin(x) | -cos(x) + C | ∫ sin(x) dx = -cos(x) + C |
| constant k | kx + C | ∫ 5 dx = 5x + C |
You can always verify an integral by differentiating your answer. If you get back the original function, you did it right. This is the best self-check in calculus.
Find ∫ x4 dx
Add 1 to the exponent: 4 + 1 = 5
Divide by the new exponent: x5 / 5
∫ x4 dx = x5/5 + C
Check: d/dx (x5/5 + C) = 5x4/5 = x4. Correct!
Find ∫ (3x² + 2x - 7) dx
Integrate term by term:
= x³ + x² - 7x + C
Find ∫ (1/x³) dx
Rewrite: 1/x³ = x-3
Apply power rule: x-3+1 / (-3+1) = x-2 / (-2) = -1/(2x²)
∫ (1/x³) dx = -1/(2x²) + C
Find ∫ √x dx
Rewrite: √x = x1/2
Apply power rule: x1/2 + 1 / (1/2 + 1) = x3/2 / (3/2) = (2/3)x3/2
∫ √x dx = (2/3)x3/2 + C
An indefinite integral gives you a family of functions (+C). A definite integral gives you a specific number -- the total accumulation between two points.
The definite integral of f(x) from a to b represents the signed area between the curve and the x-axis, from x = a to x = b.
This is the crown jewel of calculus. It says that differentiation and integration are inverses of each other, and it gives us a practical way to evaluate definite integrals:
Notice that the +C cancels out (it appears in both F(b) and F(a)), which is why definite integrals do not have a +C.
The notation F(x) |ab (with the evaluation bar) is a shorthand for F(b) - F(a). You will see this written as [F(x)]ab in many textbooks.
Evaluate ∫ from 1 to 3 of x² dx
Step 1: Find antiderivative: F(x) = x³/3
Step 2: Evaluate at the bounds:
F(3) - F(1) = (3³/3) - (1³/3) = 27/3 - 1/3 = 26/3
Answer: 26/3 ≈ 8.667
This is the area under y = x² from x = 1 to x = 3.
Evaluate ∫ from 0 to 2 of (3x² - 4x + 1) dx
Antiderivative: F(x) = x³ - 2x² + x
F(2) = 8 - 8 + 2 = 2
F(0) = 0 - 0 + 0 = 0
Answer: 2 - 0 = 2
Evaluate ∫ from 1 to e of (1/x) dx
Antiderivative: F(x) = ln(x)
F(e) - F(1) = ln(e) - ln(1) = 1 - 0
Answer: 1
The area under 1/x from 1 to e is exactly 1. This is actually how the number e is defined!
The definite integral gives signed area. If f(x) is below the x-axis, the integral is negative. To find the total (unsigned) area, you need to split the integral at the zeros and take absolute values.
To find the area between two curves f(x) and g(x) (where f(x) ≥ g(x)) from a to b:
Find the area between y = x² and y = x from x = 0 to x = 1.
On [0, 1], x ≥ x² (the line is above the parabola).
∫ from 0 to 1 of (x - x²) dx
= [x²/2 - x³/3] from 0 to 1
= (1/2 - 1/3) - (0 - 0) = 1/6
Area = 1/6
Probability Distributions
In statistics and ML, continuous probability distributions are defined using integrals. The probability that a random variable X falls between a and b is:
The total area under the entire PDF must equal 1 (there is a 100% chance of something happening). This is why integrals are unavoidable in probability.
Expected Value
The expected value (average outcome) of a continuous random variable is:
This shows up everywhere in ML: loss functions, reward signals in reinforcement learning, and Bayesian statistics.
Computational Geometry
Integrals help compute areas and volumes of complex shapes, which matters for computer graphics, geographic information systems (GIS), physics simulations, and any application that works with continuous shapes.
In practice, computers almost never compute integrals symbolically. They use numerical integration -- approximating the area by dividing it into tiny rectangles or trapezoids and adding them up. But understanding the theory tells you what the computer is approximating and why.
Everything you learned about derivatives so far involved one variable: y = f(x). But in real-world CS, functions depend on many variables at once. A machine learning loss function might depend on millions of weights. How do you take the derivative when there are 10 inputs, not 1?
The answer: partial derivatives. You take the derivative with respect to one variable at a time, treating all the others as constants.
Consider this function with two inputs:
The partial derivative with respect to x (written ∂f/∂x) asks: "If I nudge x slightly while keeping y fixed, how much does f change?"
To compute it, just differentiate normally with respect to x, treating y as a constant number:
∂f/∂x: Treat y as a constant, differentiate with respect to x:
x² → 2x 3xy → 3y (y is a constant times x) y² → 0 (constant)
∂f/∂x = 2x + 3y
∂f/∂y: Treat x as a constant, differentiate with respect to y:
x² → 0 (constant) 3xy → 3x (x is a constant times y) y² → 2y
∂f/∂y = 3x + 2y
That's it. Each partial derivative tells you how sensitive the output is to one specific input. If ∂f/∂x = 10 and ∂f/∂y = 2, then tweaking x has 5 times more effect than tweaking y.
∂f/∂x (y and z are constants): 4xy + 0 - 5z = 4xy - 5z
∂f/∂y (x and z are constants): 2x² + 0 - 0 = 2x²
∂f/∂z (x and y are constants): 0 + 3z² - 5x = 3z² - 5x
If you collect all the partial derivatives into a single vector, you get the gradient (written ∇f, pronounced "nabla f" or "grad f"):
The gradient has a beautiful geometric meaning: it points in the direction of steepest increase. If you want the function to grow as fast as possible, walk in the direction of the gradient. If you want it to shrink as fast as possible, walk in the opposite direction.
In machine learning, you have a loss function that measures how wrong your model is. It depends on all the model's weights (parameters). A neural network might have millions of weights: w₁, w₂, ..., w₁₀₀₀₀₀₀.
Training the model means minimizing the loss. How? Gradient descent:
Each partial derivative ∂Loss/∂wᵢ tells you: "How much does the loss change if I nudge weight i?" If the partial derivative is large and positive, increasing that weight makes the loss worse -- so decrease it. If it's negative, increasing that weight helps -- so increase it.
Model: prediction = wx + b (two parameters: weight w and bias b)
Loss for one data point: L = (prediction - actual)² = (wx + b - y)²
∂L/∂w = 2(wx + b - y) · x (chain rule!)
∂L/∂b = 2(wx + b - y) · 1
If x = 3, y = 7, w = 1, b = 0:
prediction = 1(3) + 0 = 3 error = 3 - 7 = -4
∂L/∂w = 2(-4)(3) = -24 → w should increase (gradient is negative)
∂L/∂b = 2(-4)(1) = -8 → b should increase too
With learning_rate = 0.01:
new w = 1 - 0.01(-24) = 1 + 0.24 = 1.24
new b = 0 - 0.01(-8) = 0 + 0.08 = 0.08
New prediction: 1.24(3) + 0.08 = 3.80 (closer to 7 than before!)
A neural network is a chain of functions: each layer takes the previous layer's output and transforms it. To find ∂Loss/∂w for a weight in an early layer, you use the chain rule repeatedly through every layer between that weight and the output. This process is called backpropagation.
Partial derivatives let you ask "what happens if I change just this one thing?" The gradient collects all those answers into a direction. Gradient descent follows that direction downhill. Backpropagation uses the chain rule to compute the gradient efficiently through layers. That's all of machine learning optimization in four sentences.
You don't need to compute these by hand -- frameworks do it. But understanding what they compute helps you debug, tune learning rates, and understand why training fails (vanishing/exploding gradients).
Test your understanding of limits, derivatives, and integrals.