The mathematics of uncertainty -- from coin flips to machine learning, this is how we reason about the unknown.
Probability is the math that makes intelligent software possible. Here's where you'll use it:
Probability transforms you from someone who guesses to someone who quantifies uncertainty and makes data-driven decisions.
Probability is the branch of mathematics that measures how likely something is to happen. It gives us a number between 0 (impossible) and 1 (certain) -- or equivalently, between 0% and 100%. If you flip a fair coin, the probability of heads is 0.5, or 50%. Simple enough on the surface, but this idea powers everything from spam filters to self-driving cars.
Here's the easiest way to think about probability: imagine running the experiment a million times. The probability is just the fraction of those runs where your event happens.
When probability feels abstract, use this trick. "If I ran this a million times, what fraction would satisfy my condition?" That fraction IS the probability.
The sample space (often written S or Ω) is the set of all possible outcomes of an experiment. An event is any subset of the sample space -- the outcomes we actually care about.
Sample space: S = {1, 2, 3, 4, 5, 6}
Event A = "rolling an even number" = {2, 4, 6}
Event B = "rolling greater than 4" = {5, 6}
P(even) = |{2, 4, 6}| / |{1, 2, 3, 4, 5, 6}| = 3/6 = 0.5
There are 3 even numbers out of 6 total equally likely outcomes.
Probability shows up everywhere in computer science:
Think of probability as a language for expressing uncertainty precisely. Instead of saying "it probably works," you can say "it works with probability 0.997." That precision is what makes probability so powerful in engineering.
These rules are the toolkit for combining probabilities. Once you know them, you can break complex problems into simple pieces.
The probability that an event does not happen is 1 minus the probability that it does. This is incredibly useful when "not happening" is easier to calculate.
Instead of counting all the ways to get at least one head (HHH, HHT, HTH, HTT, THH, THT, TTH), use the complement:
P(at least one head) = 1 - P(no heads) = 1 - P(TTT)
P(TTT) = (1/2)^3 = 1/8
P(at least one head) = 1 - 1/8 = 7/8 = 0.875
The complement rule is like the ! (NOT) operator in code. If you know P(condition), then P(!condition) = 1 - P(condition). Whenever you see "at least one" in a probability problem, think complement first.
The probability that A or B (or both) happens:
We subtract P(A and B) because we would count those outcomes twice otherwise -- once when counting A and again when counting B.
What is the probability of drawing a King or a Heart from a standard 52-card deck?
P(King) = 4/52
P(Heart) = 13/52
P(King AND Heart) = P(King of Hearts) = 1/52
P(King OR Heart) = 4/52 + 13/52 - 1/52 = 16/52 = 4/13 ≈ 0.308
Two events are mutually exclusive (or disjoint) if they cannot happen at
the same time. When events are mutually exclusive, P(A and B) = 0, so the addition rule simplifies:
P(rolling a 2 or rolling a 5) = ?
These are mutually exclusive -- you cannot roll both at once.
P(2 or 5) = P(2) + P(5) = 1/6 + 1/6 = 2/6 = 1/3
The probability that both A and B happen:
Here, P(B | A) is the probability of B given that A has already occurred (conditional probability -- we will cover this fully in the next section).
What is the probability of drawing two Aces in a row from a deck (without putting the first card back)?
P(1st Ace) = 4/52
P(2nd Ace | 1st was Ace) = 3/51 (one fewer Ace, one fewer card in the deck)
P(both Aces) = (4/52) × (3/51) = 12/2652 = 1/221 ≈ 0.0045
Two events are independent if the occurrence of one does not affect the probability of the other. For independent events, the multiplication rule simplifies because P(B | A) = P(B):
P(1st coin heads AND 2nd coin heads) = ?
Coin flips are independent -- what the first coin does has no effect on the second.
P(HH) = P(H) × P(H) = (1/2) × (1/2) = 1/4 = 0.25
A 4-digit PIN where each digit is independent and chosen from 0-9:
P(guessing correctly) = P(1st digit correct) × P(2nd) × P(3rd) × P(4th)
= (1/10) × (1/10) × (1/10) × (1/10) = 1/10,000 = 0.0001
This is why longer passwords are exponentially harder to crack!
Do not confuse mutually exclusive with independent! They are different concepts. Mutually exclusive events are actually dependent -- if one happens, the other definitely cannot (P = 0). Independent events can absolutely happen at the same time.
Conditional probability answers the question: "What is the probability of A, given that B has already happened?" The notation P(A | B) reads as "the probability of A given B."
P(A | B) means: filter your universe to only the cases where B happened, then check how often A occurs in that filtered set.
Think of it like a database query:
-- P(A | B) in SQL:
SELECT COUNT(*) WHERE A AND B -- outcomes where both happen
/ COUNT(*) WHERE B -- total outcomes where B happens
The "|" in P(A | B) is like a WHERE clause -- it filters your world before you start counting.
The idea is that once B has happened, our sample space shrinks to only the outcomes where B is true. We then ask: of those outcomes, how many also include A?
You roll a fair die. Someone tells you the result is greater than 3. What is the probability it is a 5?
Event A: rolling a 5 = {5}
Event B: rolling greater than 3 = {4, 5, 6}
P(A and B) = P({5}) = 1/6
P(B) = P({4, 5, 6}) = 3/6 = 1/2
P(A | B) = (1/6) / (1/2) = 1/3 ≈ 0.333
Knowing the roll is greater than 3 narrows our world to three possibilities, and 5 is one of them.
A bag has 5 red balls and 3 blue balls. You draw one ball (red), keep it out, then draw another. What is the probability the second ball is red?
After drawing one red ball, the bag has 4 red and 3 blue balls (7 total).
P(2nd red | 1st red) = 4/7 ≈ 0.571
This is conditional probability in action -- the first draw changes the conditions for the second.
A tree diagram is a visual tool for mapping out sequential events. Each branch represents a possible outcome with its probability. To find the probability of a specific path, you multiply along the branches. To find the total probability of an event, you add up all paths leading to it.
A company has two servers. Server A handles 60% of requests, Server B handles 40%. Server A fails 5% of the time, Server B fails 10% of the time. What is the probability a random request fails?
+----- Fail (0.05) --> P = 0.60 x 0.05 = 0.030
+-- Server A --+
| (0.60) +----- OK (0.95) --> P = 0.60 x 0.95 = 0.570
Start-+
| +----- Fail (0.10) --> P = 0.40 x 0.10 = 0.040
+-- Server B --+
(0.40) +----- OK (0.90) --> P = 0.40 x 0.90 = 0.360
P(fail) = 0.030 + 0.040 = 0.070 = 7%
Tree diagrams are your best friend for multi-step probability problems. Even if you do not draw them on paper, thinking in terms of "branches" helps you organize your calculation. Each branch multiplies, parallel branches add.
Bayes' theorem is arguably the single most important formula in machine learning and data science. It lets you update your beliefs when you receive new evidence. It flips conditional probability around: if you know P(B | A), Bayes tells you P(A | B).
You start with a belief (prior). You see evidence. You update your belief (posterior).
Example thought process:
Seeing "FREE" updated our belief from 1% to 8%. That's Bayes -- evidence changes belief proportionally to how surprising the evidence is.
The derivation is straightforward from the definition of conditional probability:
| Term | Name | Meaning |
|---|---|---|
| P(A) | Prior | What you believed before seeing evidence |
| P(B | A) | Likelihood | How likely the evidence is if A is true |
| P(B) | Evidence (Marginal) | How likely the evidence is overall |
| P(A | B) | Posterior | Your updated belief after seeing evidence |
A disease affects 1 in 1,000 people. A test for the disease is 99% accurate (it correctly identifies 99% of sick people, and correctly identifies 99% of healthy people). You test positive. What is the probability you actually have the disease?
Your intuition probably says 99%. The real answer is shocking.
Let D = has disease, T+ = tests positive.
P(D) = 0.001 (prior -- 1 in 1,000 people are sick)
P(T+ | D) = 0.99 (sensitivity -- test catches 99% of sick people)
P(T+ | not D) = 0.01 (false positive rate -- 1% of healthy people test positive)
We need P(D | T+). First, find P(T+) using the law of total probability:
P(T+) = P(T+ | D) × P(D) + P(T+ | not D) × P(not D)
P(T+) = (0.99)(0.001) + (0.01)(0.999) = 0.00099 + 0.00999 = 0.01098
Now apply Bayes:
P(D | T+) = (0.99 × 0.001) / 0.01098 = 0.00099 / 0.01098 = 0.0902 ≈ 9%
Even with a 99% accurate test, a positive result only means a 9% chance of having the disease! This is because the disease is so rare that false positives vastly outnumber true positives.
The medical test example is not just academic. This is the base rate fallacy -- ignoring how rare a condition is when interpreting test results. It is one of the most common reasoning errors humans make, and understanding Bayes' theorem is the cure.
This is exactly how early spam filters worked (and many still do). Suppose:
P(spam) = 0.40 (40% of all emails are spam)
P("free" | spam) = 0.70 (70% of spam emails contain the word "free")
P("free" | not spam) = 0.05 (5% of legitimate emails contain "free")
An email contains the word "free." What is the probability it is spam?
P("free") = P("free" | spam) × P(spam) + P("free" | not spam) × P(not spam)
P("free") = (0.70)(0.40) + (0.05)(0.60) = 0.28 + 0.03 = 0.31
P(spam | "free") = (0.70 × 0.40) / 0.31 = 0.28 / 0.31 = 0.903 ≈ 90.3%
An email containing "free" is about 90% likely to be spam! In practice, Naive Bayes classifiers combine evidence from many words, not just one.
Factory X produces 70% of parts, Factory Y produces 30%. Factory X has a 2% defect rate, Factory Y has a 5% defect rate. A randomly chosen part is defective. Which factory most likely made it?
P(defective) = (0.02)(0.70) + (0.05)(0.30) = 0.014 + 0.015 = 0.029
P(X | defective) = (0.02 × 0.70) / 0.029 = 0.014 / 0.029 = 0.483 ≈ 48.3%
P(Y | defective) = (0.05 × 0.30) / 0.029 = 0.015 / 0.029 = 0.517 ≈ 51.7%
Even though Factory X produces most parts, Factory Y is slightly more likely to be the source of a defective part because of its higher defect rate.
When applying Bayes' theorem, always start by identifying your prior P(A), your likelihood P(B|A), and then compute P(B) using the law of total probability. This three-step approach works every time.
Permutations and combinations are the counting tools you need when computing probabilities. They answer a simple question: how many ways can we choose or arrange items? (For a deeper treatment, see the Discrete Math page.)
A permutation counts the number of ways to arrange r items from n items, where the order of selection matters.
In a race with 10 runners, how many different ways can the gold, silver, and bronze be awarded?
P(10, 3) = 10! / 7! = 10 × 9 × 8 = 720
Order matters because gold is different from silver.
A combination counts the number of ways to choose r items from n items, where order does not matter.
How many ways can you choose 3 people from a group of 10 to form a committee?
C(10, 3) = 10! / (3! × 7!) = (10 × 9 × 8) / (3 × 2 × 1) = 720 / 6 = 120
Order does not matter -- choosing {Alice, Bob, Charlie} is the same committee as {Charlie, Alice, Bob}.
Ask yourself: "Does the order of selection change the outcome?"
A random variable is a variable whose value is determined by a random process. It is a function that maps each outcome in the sample space to a number. We use capital letters (X, Y, Z) for random variables and lowercase (x, y, z) for specific values they take.
A discrete random variable takes on a countable number of values -- you can list them out. Think integers: number of heads in 10 flips, number of bugs in your code, number of users online.
A probability mass function (PMF) gives the probability that a discrete random variable equals each specific value:
Let X = number of heads in 2 fair coin flips.
Possible outcomes: HH, HT, TH, TT
| x (heads) | Outcomes | P(X = x) |
|---|---|---|
| 0 | TT | 1/4 |
| 1 | HT, TH | 2/4 = 1/2 |
| 2 | HH | 1/4 |
Check: 1/4 + 1/2 + 1/4 = 1. The probabilities sum to 1.
A continuous random variable can take on any value in a range (uncountably many values). Think real numbers: exact height, exact time until a server responds, temperature.
Since there are infinitely many possible values, the probability of any single exact value is 0. Instead, we use a probability density function (PDF) and calculate probabilities over intervals:
Think of discrete random variables like integer types (int) -- they take specific, countable values. Continuous random variables are like floating-point types (float, double) -- they can take any value in a range. The PMF is like a dictionary mapping values to probabilities; the PDF is like a mathematical function you integrate over a range.
A probability distribution describes all the possible values a random variable can take and their probabilities. Here are the distributions you will encounter most often in CS.
The simplest distribution: a single trial with two outcomes -- success (1) with probability p, or failure (0) with probability 1-p.
A network has a 3% bit error rate. For a single bit: X ~ Bernoulli(0.03)
P(bit is corrupted) = 0.03
P(bit is fine) = 0.97
The result of n independent Bernoulli trials. It counts how many successes you get in n attempts, each with success probability p.
You have 5 independent servers, each with 95% uptime (p = 0.95). What is the probability that exactly 4 are running?
X ~ Binomial(5, 0.95), find P(X = 4):
P(X = 4) = C(5, 4) × (0.95)^4 × (0.05)^1
= 5 × 0.8145 × 0.05
= 0.2036 ≈ 20.4%
Use the binomial distribution when you have: (1) a fixed number of trials n, (2) each trial is independent, (3) each trial has only two outcomes (success/failure), and (4) the probability p is the same for every trial.
Every outcome is equally likely. Comes in two flavors:
Math.random() in JavaScript returns a value from the continuous uniform distribution on [0, 1).
P(0.3 ≤ X ≤ 0.7) = 0.7 - 0.3 = 0.4
The probability of landing in any interval is just the length of that interval (divided by the total range).
The famous bell curve. It is defined by its mean μ (center) and standard deviation σ (spread). It appears everywhere in nature and statistics thanks to the Central Limit Theorem: the average of many independent random variables tends toward a normal distribution.
The 68-95-99.7 Rule (Empirical Rule):
A web API has response times normally distributed with μ = 200ms and σ = 30ms.
68% of requests complete between 170ms and 230ms (200 ± 30)
95% complete between 140ms and 260ms (200 ± 60)
A response taking 290ms (3σ above mean) is very unusual -- only about 0.15% of requests are that slow.
Models the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate λ (lambda).
A website averages 3 server errors per hour (λ = 3). What is the probability of exactly 0 errors in the next hour?
P(X = 0) = (e^(-3) × 3^0) / 0! = e^(-3) × 1 / 1 = e^(-3) ≈ 0.0498 ≈ 5%
So there is about a 5% chance of a completely error-free hour.
Use Poisson when counting occurrences per unit of time/space: website hits per minute, typos per page, network packets per second, bugs per 1000 lines of code. The key assumption is that events happen independently at a constant rate.
| Distribution | Type | Use When | E[X] | Var(X) |
|---|---|---|---|---|
| Bernoulli(p) | Discrete | Single yes/no trial | p | p(1-p) |
| Binomial(n,p) | Discrete | Count successes in n trials | np | np(1-p) |
| Uniform(a,b) | Both | All outcomes equally likely | (a+b)/2 | (b-a)²/12 |
| Normal(μ,σ²) | Continuous | Natural phenomena, averages | μ | σ² |
| Poisson(λ) | Discrete | Events per time interval | λ | λ |
The expected value E[X] is the long-run average of a random variable -- what you would get on average if you repeated the experiment infinitely many times. It is a weighted average of all possible values, where each value is weighted by its probability.
Imagine you run an experiment 1,000,000 times. The expected value is the average of all those results.
Why "weighted"? Because more likely outcomes contribute more to the average. If you roll a loaded die that shows 6 half the time, the expected value is pulled toward 6 -- not the middle.
Key insight: Expected value does NOT have to be a value you can actually get. The expected value of a die roll is 3.5 -- you can never roll 3.5, but it's the "center of gravity" of all possible outcomes.
X = result of rolling a fair six-sided die.
E[X] = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6)
= (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21/6 = 3.5
You can never roll a 3.5, but over many rolls, the average converges to 3.5.
Searching for a random element in an array of n elements (assuming it is present and equally likely to be at any position):
E[comparisons] = 1(1/n) + 2(1/n) + 3(1/n) + ... + n(1/n)
= (1 + 2 + ... + n) / n = n(n+1)/(2n) = (n+1)/2
On average, linear search checks about half the array. This is why we say it has O(n) average complexity.
You roll two dice. What is the expected sum?
E[die1 + die2] = E[die1] + E[die2] = 3.5 + 3.5 = 7
This works even though the two dice create 36 different outcomes. Linearity of expectation is a surprisingly powerful shortcut!
While expected value tells you the center, variance tells you how spread out the values are around that center. Standard deviation is the square root of variance and has the same units as the original variable.
E[X] = 3.5 (from above)
E[X²] = 1²(1/6) + 2²(1/6) + 3²(1/6) + 4²(1/6) + 5²(1/6) + 6²(1/6)
= (1 + 4 + 9 + 16 + 25 + 36) / 6 = 91/6 ≈ 15.167
Var(X) = E[X²] - (E[X])² = 91/6 - (3.5)² = 91/6 - 12.25 = 35/12 ≈ 2.917
σ = √(35/12) ≈ 1.708
Expected value is the foundation of algorithm analysis. When we say quicksort is O(n log n) "on average," we mean the expected number of comparisons is proportional to n log n. Variance tells you how much the actual runtime might deviate from that average -- a low-variance algorithm is more predictable, which matters for real-time systems.
Statistics is the practice of collecting, organizing, and interpreting data. While probability predicts what should happen, statistics analyzes what did happen. Here are the key descriptive statistics you need to know.
Mean (Average): Sum of all values divided by the count. Sensitive to outliers.
Median: The middle value when data is sorted. Robust to outliers.
Mode: The most frequently occurring value. Can have multiple modes or none.
Team salaries: $50k, $55k, $60k, $65k, $70k, $500k (the CEO)
Mean: (50 + 55 + 60 + 65 + 70 + 500) / 6 = 800/6 = $133.3k
Median: Sort and find middle. With 6 values, average positions 3 and 4: (60 + 65)/2 = $62.5k
The mean ($133.3k) is misleading -- nobody on the team earns close to that. The median ($62.5k) better represents the "typical" salary. This is why median household income is more meaningful than mean household income.
Simple but crude -- it only looks at two data points and is heavily influenced by outliers.
Measures how spread out the data is from the mean. A small standard deviation means data points cluster tightly around the mean; a large one means they are spread out.
If your data is the entire population (every possible data point), divide by N. If your data is a sample (a subset), divide by n-1. The n-1 correction (called Bessel's correction) prevents underestimating the true variability. In CS, you almost always work with samples.
A percentile tells you what percentage of data falls below a given value. Quartiles divide data into four equal parts:
Web services commonly report latency percentiles:
The p99 is crucial -- it tells you the experience of your worst-off 1% of users. A service with great p50 but terrible p99 has a hidden reliability problem.
| Measure | Best For | Watch Out For |
|---|---|---|
| Mean | Symmetric data, calculating totals | Skewed by outliers |
| Median | Skewed data, income, home prices | Ignores extreme values |
| Mode | Categorical data, most common value | May not exist or may be multiple |
| Standard Deviation | Measuring consistency/reliability | Affected by outliers (like mean) |
| Percentiles | Understanding distribution shape | Need enough data points |
Here is where probability and statistics come alive in computer science. These are real problems you will encounter as a developer, and probability gives you the tools to reason about them.
A hash function maps inputs to a fixed-size output (e.g., 256 bits). When two different inputs produce the same hash, that is a collision. How likely is it?
If your hash function has m possible outputs and you hash n items, the probability of at least one collision is approximately:
For a 128-bit hash (m = 2^128), you need about 2^64 items before a collision becomes likely. This is why 128-bit hashes are considered secure for most purposes -- 2^64 is an astronomically large number.
How many people do you need in a room before there is a 50% chance that two of them share a birthday? Surprisingly, only 23!
There are 365 possible birthdays (ignoring leap years). Instead of directly computing the probability of a match, we use the complement: what is the probability that nobody shares a birthday?
Person 1: any birthday. Probability of no conflict: 365/365
Person 2: must differ from person 1: 364/365
Person 3: must differ from persons 1 and 2: 363/365
...
Person n: (365 - n + 1)/365
P(no shared birthday among n people) = (365/365) × (364/365) × ... × ((365 - n + 1)/365)
P(at least one shared birthday) = 1 - P(no shared birthday)
For n = 23: P(match) ≈ 50.7%
For n = 50: P(match) ≈ 97%
For n = 70: P(match) ≈ 99.9%
The birthday problem is not about birthdays -- it is about collisions. The same math applies to hash collisions, UUID conflicts, random port assignments, and any situation where you are picking random values from a finite set. The general rule: collisions become likely after about √m selections from m possibilities. This is called the birthday bound.
Many algorithms use randomness to achieve better expected performance, even if worst-case performance is still bad.
Quicksort's performance depends on choosing a good pivot:
By choosing the pivot randomly, the probability of consistently picking the worst pivot is astronomically small. The expected number of comparisons is about 2n ln n ≈ 1.39n log&sub2; n.
This is why randomized quicksort is preferred in practice -- it has no "adversarial" inputs that can force worst-case behavior (unlike always picking the first element as pivot).
Monte Carlo methods use random sampling to estimate quantities that are hard to compute exactly. The idea is simple: generate lots of random samples, and use them to approximate the answer.
Draw a unit square [0,1] x [0,1] and inscribe a quarter circle of radius 1.
Area of quarter circle = π/4. Area of square = 1.
Generate N random points in the square. Count how many fall inside the quarter circle (where x² + y² ≤ 1).
π ≈ 4 × (points inside circle) / (total points)
With 10,000 random points, you typically get π ≈ 3.14 ± 0.02. With 1,000,000 points, the estimate becomes much more accurate.
Probability is the backbone of machine learning:
A/B testing uses statistics to determine whether a change (new UI, different algorithm, pricing change) actually improves a metric or whether the difference is just random chance.
Version A (blue button): 1,000 visitors, 50 clicked (5.0% conversion)
Version B (green button): 1,000 visitors, 65 clicked (6.5% conversion)
Is the 1.5% difference real or just luck?
Statistical tests (like a chi-squared test or z-test for proportions) compute a p-value -- the probability of seeing this large a difference by pure chance.
If p-value < 0.05 (the conventional threshold), we say the result is statistically significant and the green button likely performs better.
Without understanding probability, you might deploy changes that only appeared to work due to random variation.
Probability is not just theory -- it is a practical tool used daily in software engineering. From load balancers to recommendation engines, from security analysis to performance testing, understanding probability makes you a better engineer.
Test your understanding. Think through each problem before clicking an answer.