Table of Contents

1. What is Probability? 2. Basic Probability Rules 3. Conditional Probability 4. Bayes' Theorem 5. Permutations & Combinations (Review) 6. Random Variables 7. Common Distributions 8. Expected Value & Variance 9. Basic Statistics 10. CS Applications 11. Practice Quiz

1. What is Probability?

Probability is the branch of mathematics that measures how likely something is to happen. It gives us a number between 0 (impossible) and 1 (certain) -- or equivalently, between 0% and 100%. If you flip a fair coin, the probability of heads is 0.5, or 50%. Simple enough on the surface, but this idea powers everything from spam filters to self-driving cars.

The Mental Model: Probability = Fraction of Worlds

Here's the easiest way to think about probability: imagine running the experiment a million times. The probability is just the fraction of those runs where your event happens.

  • P(heads) = 0.5 means: flip a coin 1,000,000 times → ~500,000 heads
  • P(rolling a 6) = 1/6 means: roll a die 1,000,000 times → ~166,667 sixes
  • P(server crash) = 0.001 means: out of 1,000,000 requests → ~1,000 crashes

When probability feels abstract, use this trick. "If I ran this a million times, what fraction would satisfy my condition?" That fraction IS the probability.

Sample Space and Events

The sample space (often written S or Ω) is the set of all possible outcomes of an experiment. An event is any subset of the sample space -- the outcomes we actually care about.

Example -- Rolling a Six-Sided Die

Sample space: S = {1, 2, 3, 4, 5, 6}

Event A = "rolling an even number" = {2, 4, 6}

Event B = "rolling greater than 4" = {5, 6}

The Basic Probability Formula

P(event) = (number of favorable outcomes) / (total number of outcomes)
Example -- Probability of Rolling Even

P(even) = |{2, 4, 6}| / |{1, 2, 3, 4, 5, 6}| = 3/6 = 0.5

There are 3 even numbers out of 6 total equally likely outcomes.

Why CS Cares About Probability

Probability shows up everywhere in computer science:

Tip

Think of probability as a language for expressing uncertainty precisely. Instead of saying "it probably works," you can say "it works with probability 0.997." That precision is what makes probability so powerful in engineering.

2. Basic Probability Rules

These rules are the toolkit for combining probabilities. Once you know them, you can break complex problems into simple pieces.

Kolmogorov Axioms (The Foundation of All Probability):

1. Non-negativity: P(A) ≥ 0 for any event A
2. Unitarity: P(Ω) = 1 (the probability of the entire sample space is 1)
3. Additivity: If A and B are mutually exclusive, P(A ∪ B) = P(A) + P(B)

Every probability rule you'll ever learn is derived from these three axioms. They are the "source code" of probability.

The Complement Rule

The probability that an event does not happen is 1 minus the probability that it does. This is incredibly useful when "not happening" is easier to calculate.

P(not A) = 1 - P(A)

Equivalently written: P(A') = 1 - P(A)   or   P(A̅) = 1 - P(A)
Example -- At Least One Head in 3 Coin Flips

Instead of counting all the ways to get at least one head (HHH, HHT, HTH, HTT, THH, THT, TTH), use the complement:

P(at least one head) = 1 - P(no heads) = 1 - P(TTT)

P(TTT) = (1/2)^3 = 1/8

P(at least one head) = 1 - 1/8 = 7/8 = 0.875

Programming Analogy

The complement rule is like the ! (NOT) operator in code. If you know P(condition), then P(!condition) = 1 - P(condition). Whenever you see "at least one" in a probability problem, think complement first.

The Addition Rule (OR)

The probability that A or B (or both) happens:

P(A or B) = P(A) + P(B) - P(A and B)

We subtract P(A and B) because we would count those outcomes twice otherwise -- once when counting A and again when counting B.

Example -- Drawing a Card

What is the probability of drawing a King or a Heart from a standard 52-card deck?

P(King) = 4/52

P(Heart) = 13/52

P(King AND Heart) = P(King of Hearts) = 1/52

P(King OR Heart) = 4/52 + 13/52 - 1/52 = 16/52 = 4/13 ≈ 0.308

Mutually Exclusive Events

Two events are mutually exclusive (or disjoint) if they cannot happen at the same time. When events are mutually exclusive, P(A and B) = 0, so the addition rule simplifies:

If A and B are mutually exclusive:
P(A or B) = P(A) + P(B)
Example -- Rolling a Die

P(rolling a 2 or rolling a 5) = ?

These are mutually exclusive -- you cannot roll both at once.

P(2 or 5) = P(2) + P(5) = 1/6 + 1/6 = 2/6 = 1/3

The Multiplication Rule (AND)

The probability that both A and B happen:

P(A and B) = P(A) × P(B | A)

Here, P(B | A) is the probability of B given that A has already occurred (conditional probability -- we will cover this fully in the next section).

Example -- Drawing Two Cards Without Replacement

What is the probability of drawing two Aces in a row from a deck (without putting the first card back)?

P(1st Ace) = 4/52

P(2nd Ace | 1st was Ace) = 3/51   (one fewer Ace, one fewer card in the deck)

P(both Aces) = (4/52) × (3/51) = 12/2652 = 1/221 ≈ 0.0045

Independent Events

Two events are independent if the occurrence of one does not affect the probability of the other. For independent events, the multiplication rule simplifies because P(B | A) = P(B):

If A and B are independent:
P(A and B) = P(A) × P(B)
Example -- Flipping Two Coins

P(1st coin heads AND 2nd coin heads) = ?

Coin flips are independent -- what the first coin does has no effect on the second.

P(HH) = P(H) × P(H) = (1/2) × (1/2) = 1/4 = 0.25

Example -- Password Brute Force

A 4-digit PIN where each digit is independent and chosen from 0-9:

P(guessing correctly) = P(1st digit correct) × P(2nd) × P(3rd) × P(4th)

= (1/10) × (1/10) × (1/10) × (1/10) = 1/10,000 = 0.0001

This is why longer passwords are exponentially harder to crack!

Common Mistake

Do not confuse mutually exclusive with independent! They are different concepts. Mutually exclusive events are actually dependent -- if one happens, the other definitely cannot (P = 0). Independent events can absolutely happen at the same time.

3. Conditional Probability

Conditional probability answers the question: "What is the probability of A, given that B has already happened?" The notation P(A | B) reads as "the probability of A given B."

The Key Intuition: "Given" = Filter First, Then Count

P(A | B) means: filter your universe to only the cases where B happened, then check how often A occurs in that filtered set.

Think of it like a database query:

-- P(A | B) in SQL:
SELECT COUNT(*) WHERE A AND B   -- outcomes where both happen
     / COUNT(*) WHERE B            -- total outcomes where B happens

The "|" in P(A | B) is like a WHERE clause -- it filters your world before you start counting.

P(A | B) = P(A and B) / P(B)     (where P(B) > 0)

The idea is that once B has happened, our sample space shrinks to only the outcomes where B is true. We then ask: of those outcomes, how many also include A?

Example -- Die Roll with Information

You roll a fair die. Someone tells you the result is greater than 3. What is the probability it is a 5?

Event A: rolling a 5 = {5}

Event B: rolling greater than 3 = {4, 5, 6}

P(A and B) = P({5}) = 1/6

P(B) = P({4, 5, 6}) = 3/6 = 1/2

P(A | B) = (1/6) / (1/2) = 1/3 ≈ 0.333

Knowing the roll is greater than 3 narrows our world to three possibilities, and 5 is one of them.

Example -- Colored Balls in a Bag

A bag has 5 red balls and 3 blue balls. You draw one ball (red), keep it out, then draw another. What is the probability the second ball is red?

After drawing one red ball, the bag has 4 red and 3 blue balls (7 total).

P(2nd red | 1st red) = 4/7 ≈ 0.571

This is conditional probability in action -- the first draw changes the conditions for the second.

Tree Diagrams

A tree diagram is a visual tool for mapping out sequential events. Each branch represents a possible outcome with its probability. To find the probability of a specific path, you multiply along the branches. To find the total probability of an event, you add up all paths leading to it.

Example -- Tree Diagram (Textual)

A company has two servers. Server A handles 60% of requests, Server B handles 40%. Server A fails 5% of the time, Server B fails 10% of the time. What is the probability a random request fails?

                     +----- Fail (0.05)   --> P = 0.60 x 0.05 = 0.030
       +-- Server A --+
       |   (0.60)     +----- OK   (0.95)   --> P = 0.60 x 0.95 = 0.570
 Start-+
       |              +----- Fail (0.10)   --> P = 0.40 x 0.10 = 0.040
       +-- Server B --+
           (0.40)     +----- OK   (0.90)   --> P = 0.40 x 0.90 = 0.360

P(fail) = 0.030 + 0.040 = 0.070 = 7%

Tip

Tree diagrams are your best friend for multi-step probability problems. Even if you do not draw them on paper, thinking in terms of "branches" helps you organize your calculation. Each branch multiplies, parallel branches add.

4. Bayes' Theorem

Bayes' theorem is arguably the single most important formula in machine learning and data science. It lets you update your beliefs when you receive new evidence. It flips conditional probability around: if you know P(B | A), Bayes tells you P(A | B).

P(A | B) = [ P(B | A) × P(A) ] / P(B)
Bayes' in Plain English

You start with a belief (prior). You see evidence. You update your belief (posterior).

Example thought process:

  1. Prior: "1% of emails are spam" → P(spam) = 0.01
  2. Evidence: This email contains the word "FREE"
  3. Likelihood: "80% of spam emails say FREE" → P(FREE | spam) = 0.80
  4. Baseline: "10% of all emails say FREE" → P(FREE) = 0.10
  5. Posterior: P(spam | FREE) = (0.80 × 0.01) / 0.10 = 0.08 = 8%

Seeing "FREE" updated our belief from 1% to 8%. That's Bayes -- evidence changes belief proportionally to how surprising the evidence is.

Deriving Bayes' Theorem

The derivation is straightforward from the definition of conditional probability:

From the definition:  P(A | B) = P(A and B) / P(B)

We also know:  P(A and B) = P(B | A) × P(A)

Substitute:  P(A | B) = [ P(B | A) × P(A) ] / P(B)

The Terms Have Names

Term Name Meaning
P(A) Prior What you believed before seeing evidence
P(B | A) Likelihood How likely the evidence is if A is true
P(B) Evidence (Marginal) How likely the evidence is overall
P(A | B) Posterior Your updated belief after seeing evidence
Classic Example -- Medical Test Accuracy

A disease affects 1 in 1,000 people. A test for the disease is 99% accurate (it correctly identifies 99% of sick people, and correctly identifies 99% of healthy people). You test positive. What is the probability you actually have the disease?

Your intuition probably says 99%. The real answer is shocking.

Let D = has disease, T+ = tests positive.

P(D) = 0.001   (prior -- 1 in 1,000 people are sick)

P(T+ | D) = 0.99   (sensitivity -- test catches 99% of sick people)

P(T+ | not D) = 0.01   (false positive rate -- 1% of healthy people test positive)

We need P(D | T+). First, find P(T+) using the law of total probability:

P(T+) = P(T+ | D) × P(D) + P(T+ | not D) × P(not D)

P(T+) = (0.99)(0.001) + (0.01)(0.999) = 0.00099 + 0.00999 = 0.01098

Now apply Bayes:

P(D | T+) = (0.99 × 0.001) / 0.01098 = 0.00099 / 0.01098 = 0.0902 ≈ 9%

Even with a 99% accurate test, a positive result only means a 9% chance of having the disease! This is because the disease is so rare that false positives vastly outnumber true positives.

Why This Matters

The medical test example is not just academic. This is the base rate fallacy -- ignoring how rare a condition is when interpreting test results. It is one of the most common reasoning errors humans make, and understanding Bayes' theorem is the cure.

CS Example -- Spam Filter (Naive Bayes)

This is exactly how early spam filters worked (and many still do). Suppose:

P(spam) = 0.40   (40% of all emails are spam)

P("free" | spam) = 0.70   (70% of spam emails contain the word "free")

P("free" | not spam) = 0.05   (5% of legitimate emails contain "free")

An email contains the word "free." What is the probability it is spam?

P("free") = P("free" | spam) × P(spam) + P("free" | not spam) × P(not spam)

P("free") = (0.70)(0.40) + (0.05)(0.60) = 0.28 + 0.03 = 0.31

P(spam | "free") = (0.70 × 0.40) / 0.31 = 0.28 / 0.31 = 0.903 ≈ 90.3%

An email containing "free" is about 90% likely to be spam! In practice, Naive Bayes classifiers combine evidence from many words, not just one.

Example -- Defective Parts from Two Factories

Factory X produces 70% of parts, Factory Y produces 30%. Factory X has a 2% defect rate, Factory Y has a 5% defect rate. A randomly chosen part is defective. Which factory most likely made it?

P(defective) = (0.02)(0.70) + (0.05)(0.30) = 0.014 + 0.015 = 0.029

P(X | defective) = (0.02 × 0.70) / 0.029 = 0.014 / 0.029 = 0.483 ≈ 48.3%

P(Y | defective) = (0.05 × 0.30) / 0.029 = 0.015 / 0.029 = 0.517 ≈ 51.7%

Even though Factory X produces most parts, Factory Y is slightly more likely to be the source of a defective part because of its higher defect rate.

Tip

When applying Bayes' theorem, always start by identifying your prior P(A), your likelihood P(B|A), and then compute P(B) using the law of total probability. This three-step approach works every time.

5. Permutations & Combinations (Review)

Permutations and combinations are the counting tools you need when computing probabilities. They answer a simple question: how many ways can we choose or arrange items? (For a deeper treatment, see the Discrete Math page.)

Factorial

n! = n × (n-1) × (n-2) × ... × 2 × 1

0! = 1   (by definition)
5! = 5 × 4 × 3 × 2 × 1 = 120

Permutations -- Order Matters

A permutation counts the number of ways to arrange r items from n items, where the order of selection matters.

P(n, r) = n! / (n - r)!
Example -- Podium Finishers

In a race with 10 runners, how many different ways can the gold, silver, and bronze be awarded?

P(10, 3) = 10! / 7! = 10 × 9 × 8 = 720

Order matters because gold is different from silver.

Combinations -- Order Does Not Matter

A combination counts the number of ways to choose r items from n items, where order does not matter.

C(n, r) = n! / [ r! × (n - r)! ]
Example -- Choosing a Team

How many ways can you choose 3 people from a group of 10 to form a committee?

C(10, 3) = 10! / (3! × 7!) = (10 × 9 × 8) / (3 × 2 × 1) = 720 / 6 = 120

Order does not matter -- choosing {Alice, Bob, Charlie} is the same committee as {Charlie, Alice, Bob}.

Quick Decision Guide

Ask yourself: "Does the order of selection change the outcome?"

  • Yes (passwords, rankings, arrangements) -- use Permutations
  • No (teams, hands of cards, committees) -- use Combinations

6. Random Variables

A random variable is a variable whose value is determined by a random process. It is a function that maps each outcome in the sample space to a number. We use capital letters (X, Y, Z) for random variables and lowercase (x, y, z) for specific values they take.

Discrete Random Variables

A discrete random variable takes on a countable number of values -- you can list them out. Think integers: number of heads in 10 flips, number of bugs in your code, number of users online.

A probability mass function (PMF) gives the probability that a discrete random variable equals each specific value:

PMF: P(X = x) for each possible value x

Rules:   0 ≤ P(X = x) ≤ 1   for all x
        Σ P(X = x) = 1   (probabilities must sum to 1)
Example -- PMF of Two Coin Flips

Let X = number of heads in 2 fair coin flips.

Possible outcomes: HH, HT, TH, TT

x (heads)OutcomesP(X = x)
0TT1/4
1HT, TH2/4 = 1/2
2HH1/4

Check: 1/4 + 1/2 + 1/4 = 1. The probabilities sum to 1.

Continuous Random Variables

A continuous random variable can take on any value in a range (uncountably many values). Think real numbers: exact height, exact time until a server responds, temperature.

Since there are infinitely many possible values, the probability of any single exact value is 0. Instead, we use a probability density function (PDF) and calculate probabilities over intervals:

PDF: f(x)

P(a ≤ X ≤ b) = integral of f(x) from a to b

Rules:   f(x) ≥ 0   for all x
        Total area under f(x) = 1
Programming Analogy

Think of discrete random variables like integer types (int) -- they take specific, countable values. Continuous random variables are like floating-point types (float, double) -- they can take any value in a range. The PMF is like a dictionary mapping values to probabilities; the PDF is like a mathematical function you integrate over a range.

7. Common Distributions

A probability distribution describes all the possible values a random variable can take and their probabilities. Here are the distributions you will encounter most often in CS.

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes -- success (1) with probability p, or failure (0) with probability 1-p.

X ~ Bernoulli(p)

P(X = 1) = p
P(X = 0) = 1 - p

E[X] = p     Var(X) = p(1 - p)
Example -- Single Bit Flip

A network has a 3% bit error rate. For a single bit: X ~ Bernoulli(0.03)

P(bit is corrupted) = 0.03

P(bit is fine) = 0.97

Binomial Distribution

The result of n independent Bernoulli trials. It counts how many successes you get in n attempts, each with success probability p.

X ~ Binomial(n, p)

P(X = k) = C(n, k) × p^k × (1 - p)^(n - k)

E[X] = np     Var(X) = np(1 - p)
Example -- Server Uptime

You have 5 independent servers, each with 95% uptime (p = 0.95). What is the probability that exactly 4 are running?

X ~ Binomial(5, 0.95), find P(X = 4):

P(X = 4) = C(5, 4) × (0.95)^4 × (0.05)^1

= 5 × 0.8145 × 0.05

= 0.2036 ≈ 20.4%

When to Use Binomial

Use the binomial distribution when you have: (1) a fixed number of trials n, (2) each trial is independent, (3) each trial has only two outcomes (success/failure), and (4) the probability p is the same for every trial.

Uniform Distribution

Every outcome is equally likely. Comes in two flavors:

Discrete Uniform: X takes values {a, a+1, ..., b}, each with probability 1/(b - a + 1)

Continuous Uniform: X ~ Uniform(a, b)
f(x) = 1/(b - a)   for a ≤ x ≤ b

E[X] = (a + b) / 2     Var(X) = (b - a)^2 / 12
Example -- Random Number Generator

Math.random() in JavaScript returns a value from the continuous uniform distribution on [0, 1).

P(0.3 ≤ X ≤ 0.7) = 0.7 - 0.3 = 0.4

The probability of landing in any interval is just the length of that interval (divided by the total range).

Normal (Gaussian) Distribution

The famous bell curve. It is defined by its mean μ (center) and standard deviation σ (spread). It appears everywhere in nature and statistics thanks to the Central Limit Theorem: the average of many independent random variables tends toward a normal distribution.

X ~ Normal(μ, σ²)

f(x) = (1 / σ√(2π)) × e^(-(x - μ)² / (2σ²))

E[X] = μ     Var(X) = σ²

The 68-95-99.7 Rule (Empirical Rule):

Example -- Response Times

A web API has response times normally distributed with μ = 200ms and σ = 30ms.

68% of requests complete between 170ms and 230ms (200 ± 30)

95% complete between 140ms and 260ms (200 ± 60)

A response taking 290ms (3σ above mean) is very unusual -- only about 0.15% of requests are that slow.

Poisson Distribution

Models the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate λ (lambda).

X ~ Poisson(λ)

P(X = k) = (e^(-λ) × λ^k) / k!

E[X] = λ     Var(X) = λ
Example -- Website Errors

A website averages 3 server errors per hour (λ = 3). What is the probability of exactly 0 errors in the next hour?

P(X = 0) = (e^(-3) × 3^0) / 0! = e^(-3) × 1 / 1 = e^(-3) ≈ 0.0498 ≈ 5%

So there is about a 5% chance of a completely error-free hour.

When to Use Poisson

Use Poisson when counting occurrences per unit of time/space: website hits per minute, typos per page, network packets per second, bugs per 1000 lines of code. The key assumption is that events happen independently at a constant rate.

Distribution Summary Table

Distribution Type Use When E[X] Var(X)
Bernoulli(p) Discrete Single yes/no trial p p(1-p)
Binomial(n,p) Discrete Count successes in n trials np np(1-p)
Uniform(a,b) Both All outcomes equally likely (a+b)/2 (b-a)²/12
Normal(μ,σ²) Continuous Natural phenomena, averages μ σ²
Poisson(λ) Discrete Events per time interval λ λ

8. Expected Value & Variance

Expected Value (Mean)

The expected value E[X] is the long-run average of a random variable -- what you would get on average if you repeated the experiment infinitely many times. It is a weighted average of all possible values, where each value is weighted by its probability.

Think of Expected Value as a Weighted Average

Imagine you run an experiment 1,000,000 times. The expected value is the average of all those results.

Why "weighted"? Because more likely outcomes contribute more to the average. If you roll a loaded die that shows 6 half the time, the expected value is pulled toward 6 -- not the middle.

Key insight: Expected value does NOT have to be a value you can actually get. The expected value of a die roll is 3.5 -- you can never roll 3.5, but it's the "center of gravity" of all possible outcomes.

For discrete random variables:

E[X] = Σ x × P(X = x)   (sum over all possible values of x)

Translation: For each possible value, multiply it by how likely it is. Add them all up.
Example -- Expected Value of a Die Roll

X = result of rolling a fair six-sided die.

E[X] = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6)

= (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21/6 = 3.5

You can never roll a 3.5, but over many rolls, the average converges to 3.5.

CS Example -- Expected Comparisons in Linear Search

Searching for a random element in an array of n elements (assuming it is present and equally likely to be at any position):

E[comparisons] = 1(1/n) + 2(1/n) + 3(1/n) + ... + n(1/n)

= (1 + 2 + ... + n) / n = n(n+1)/(2n) = (n+1)/2

On average, linear search checks about half the array. This is why we say it has O(n) average complexity.

Properties of Expected Value

E[aX + b] = a × E[X] + b    (scaling and shifting)

E[X + Y] = E[X] + E[Y]    (always true, even if X and Y are dependent!)

E[XY] = E[X] × E[Y]    (only if X and Y are independent)
Example -- Linearity of Expectation

You roll two dice. What is the expected sum?

E[die1 + die2] = E[die1] + E[die2] = 3.5 + 3.5 = 7

This works even though the two dice create 36 different outcomes. Linearity of expectation is a surprisingly powerful shortcut!

Variance and Standard Deviation

While expected value tells you the center, variance tells you how spread out the values are around that center. Standard deviation is the square root of variance and has the same units as the original variable.

Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²

Standard Deviation: σ = √Var(X)

Properties:
Var(aX + b) = a² × Var(X)   (constants shift but don't spread; scaling squares)
Var(X + Y) = Var(X) + Var(Y)   (only if X and Y are independent)
Example -- Variance of a Die Roll

E[X] = 3.5 (from above)

E[X²] = 1²(1/6) + 2²(1/6) + 3²(1/6) + 4²(1/6) + 5²(1/6) + 6²(1/6)

= (1 + 4 + 9 + 16 + 25 + 36) / 6 = 91/6 ≈ 15.167

Var(X) = E[X²] - (E[X])² = 91/6 - (3.5)² = 91/6 - 12.25 = 35/12 ≈ 2.917

σ = √(35/12) ≈ 1.708

Why CS Cares

Expected value is the foundation of algorithm analysis. When we say quicksort is O(n log n) "on average," we mean the expected number of comparisons is proportional to n log n. Variance tells you how much the actual runtime might deviate from that average -- a low-variance algorithm is more predictable, which matters for real-time systems.

9. Basic Statistics

Statistics is the practice of collecting, organizing, and interpreting data. While probability predicts what should happen, statistics analyzes what did happen. Here are the key descriptive statistics you need to know.

Measures of Central Tendency

Mean (Average): Sum of all values divided by the count. Sensitive to outliers.

Mean = (x_1 + x_2 + ... + x_n) / n = (Σ x_i) / n

Median: The middle value when data is sorted. Robust to outliers.

Mode: The most frequently occurring value. Can have multiple modes or none.

Example -- Salary Data

Team salaries: $50k, $55k, $60k, $65k, $70k, $500k (the CEO)

Mean: (50 + 55 + 60 + 65 + 70 + 500) / 6 = 800/6 = $133.3k

Median: Sort and find middle. With 6 values, average positions 3 and 4: (60 + 65)/2 = $62.5k

The mean ($133.3k) is misleading -- nobody on the team earns close to that. The median ($62.5k) better represents the "typical" salary. This is why median household income is more meaningful than mean household income.

Range

Range = Maximum value - Minimum value

Simple but crude -- it only looks at two data points and is heavily influenced by outliers.

Standard Deviation

Measures how spread out the data is from the mean. A small standard deviation means data points cluster tightly around the mean; a large one means they are spread out.

Population standard deviation:  σ = √[ Σ(x_i - μ)² / N ]

Sample standard deviation:  s = √[ Σ(x_i - x̅)² / (n - 1) ]
Population vs Sample

If your data is the entire population (every possible data point), divide by N. If your data is a sample (a subset), divide by n-1. The n-1 correction (called Bessel's correction) prevents underestimating the true variability. In CS, you almost always work with samples.

Percentiles and Quartiles

A percentile tells you what percentage of data falls below a given value. Quartiles divide data into four equal parts:

CS Example -- Latency Percentiles

Web services commonly report latency percentiles:

  • p50 = 45ms: Half of requests complete in 45ms or less
  • p95 = 120ms: 95% of requests complete in 120ms or less
  • p99 = 800ms: 99% complete in 800ms or less (the "tail latency")

The p99 is crucial -- it tells you the experience of your worst-off 1% of users. A service with great p50 but terrible p99 has a hidden reliability problem.

When to Use Each Measure

Measure Best For Watch Out For
Mean Symmetric data, calculating totals Skewed by outliers
Median Skewed data, income, home prices Ignores extreme values
Mode Categorical data, most common value May not exist or may be multiple
Standard Deviation Measuring consistency/reliability Affected by outliers (like mean)
Percentiles Understanding distribution shape Need enough data points

10. CS Applications

Here is where probability and statistics come alive in computer science. These are real problems you will encounter as a developer, and probability gives you the tools to reason about them.

Hash Function Collision Probability

A hash function maps inputs to a fixed-size output (e.g., 256 bits). When two different inputs produce the same hash, that is a collision. How likely is it?

If your hash function has m possible outputs and you hash n items, the probability of at least one collision is approximately:

P(collision) ≈ 1 - e^(-n² / (2m))

For a 128-bit hash (m = 2^128), you need about 2^64 items before a collision becomes likely. This is why 128-bit hashes are considered secure for most purposes -- 2^64 is an astronomically large number.

The Birthday Problem (Birthday Paradox)

How many people do you need in a room before there is a 50% chance that two of them share a birthday? Surprisingly, only 23!

Worked Example -- Birthday Problem

There are 365 possible birthdays (ignoring leap years). Instead of directly computing the probability of a match, we use the complement: what is the probability that nobody shares a birthday?

Person 1: any birthday. Probability of no conflict: 365/365

Person 2: must differ from person 1: 364/365

Person 3: must differ from persons 1 and 2: 363/365

...

Person n: (365 - n + 1)/365

P(no shared birthday among n people) = (365/365) × (364/365) × ... × ((365 - n + 1)/365)

P(at least one shared birthday) = 1 - P(no shared birthday)

For n = 23: P(match) ≈ 50.7%

For n = 50: P(match) ≈ 97%

For n = 70: P(match) ≈ 99.9%

CS Connection

The birthday problem is not about birthdays -- it is about collisions. The same math applies to hash collisions, UUID conflicts, random port assignments, and any situation where you are picking random values from a finite set. The general rule: collisions become likely after about √m selections from m possibilities. This is called the birthday bound.

Randomized Algorithms

Many algorithms use randomness to achieve better expected performance, even if worst-case performance is still bad.

Example -- Quicksort Pivot Selection

Quicksort's performance depends on choosing a good pivot:

  • Worst case (bad pivot every time): O(n²)
  • Expected case (random pivot): O(n log n)

By choosing the pivot randomly, the probability of consistently picking the worst pivot is astronomically small. The expected number of comparisons is about 2n ln n ≈ 1.39n log&sub2; n.

This is why randomized quicksort is preferred in practice -- it has no "adversarial" inputs that can force worst-case behavior (unlike always picking the first element as pivot).

Monte Carlo Methods

Monte Carlo methods use random sampling to estimate quantities that are hard to compute exactly. The idea is simple: generate lots of random samples, and use them to approximate the answer.

Example -- Estimating Pi with Random Points

Draw a unit square [0,1] x [0,1] and inscribe a quarter circle of radius 1.

Area of quarter circle = π/4. Area of square = 1.

Generate N random points in the square. Count how many fall inside the quarter circle (where x² + y² ≤ 1).

π ≈ 4 × (points inside circle) / (total points)

With 10,000 random points, you typically get π ≈ 3.14 ± 0.02. With 1,000,000 points, the estimate becomes much more accurate.

Machine Learning Connections

Probability is the backbone of machine learning:

A/B Testing

A/B testing uses statistics to determine whether a change (new UI, different algorithm, pricing change) actually improves a metric or whether the difference is just random chance.

Example -- Testing a New Button Color

Version A (blue button): 1,000 visitors, 50 clicked (5.0% conversion)

Version B (green button): 1,000 visitors, 65 clicked (6.5% conversion)

Is the 1.5% difference real or just luck?

Statistical tests (like a chi-squared test or z-test for proportions) compute a p-value -- the probability of seeing this large a difference by pure chance.

If p-value < 0.05 (the conventional threshold), we say the result is statistically significant and the green button likely performs better.

Without understanding probability, you might deploy changes that only appeared to work due to random variation.

Key Takeaway

Probability is not just theory -- it is a practical tool used daily in software engineering. From load balancers to recommendation engines, from security analysis to performance testing, understanding probability makes you a better engineer.

11. Practice Quiz

Test your understanding. Think through each problem before clicking an answer.

Q1: You flip a fair coin 4 times. What is the probability of getting at least one head?

C) 15/16. Use the complement rule: P(at least one head) = 1 - P(no heads) = 1 - P(TTTT) = 1 - (1/2)^4 = 1 - 1/16 = 15/16 = 0.9375. The complement approach avoids counting all 15 favorable outcomes individually.

Q2: A disease affects 1% of the population. A test is 90% accurate (90% sensitivity, 90% specificity). If you test positive, what is the approximate probability you have the disease?

A) About 8.3%. Applying Bayes' theorem: P(D|T+) = P(T+|D) × P(D) / P(T+). P(T+) = (0.90)(0.01) + (0.10)(0.99) = 0.009 + 0.099 = 0.108. So P(D|T+) = 0.009 / 0.108 ≈ 0.083 or 8.3%. The low base rate (1%) means most positive results are false positives. This is the base rate fallacy in action.

Q3: You have 8 programmers and need to choose a team of 3 for a hackathon. How many different teams are possible?

C) 56. Order does not matter (the team {Alice, Bob, Carol} is the same as {Carol, Alice, Bob}), so use combinations: C(8,3) = 8! / (3! × 5!) = (8 × 7 × 6) / (3 × 2 × 1) = 336 / 6 = 56.

Q4: A server has 99.9% uptime (p = 0.999). With 3 independent servers and the system fails only if ALL 3 fail, what is the system's uptime probability?

D) 99.9999999% (nine nines!). The system fails only if all 3 fail. P(all fail) = (0.001)^3 = 0.000000001. P(system up) = 1 - 0.000000001 = 0.999999999. This is why redundancy is so powerful in distributed systems -- multiplying independent failure probabilities gives exponentially better reliability.

Q5: A random variable X has values {1, 2, 3} with probabilities {0.2, 0.5, 0.3}. What is E[X]?

B) 2.1. E[X] = 1(0.2) + 2(0.5) + 3(0.3) = 0.2 + 1.0 + 0.9 = 2.1. Expected value is a weighted average -- each value is multiplied by how likely it is. Note that 2.1 is not one of the possible values of X, and that is fine. Expected value is a long-run average, not a value X must take.