The key insight: Always subtract the intersection when combining probabilities to avoid double-counting
Probability Cheat Sheet
Visual Overview: Probability Rules
Basic Probability
// Probability: Likelihood of event (0 to 1)
P(A) = 0: Impossible
P(A) = 1: Certain
P(A) = 0.5: Equally likely
// Complement
P(A^c) = 1 - P(A)
// Conditional probability
P(A|B) = P(A and B) / P(B)
Probability of A given B happened
Example: P(rain | cloudy)
// Independence
A and B independent if P(A|B) = P(A)
Occurrence of B doesn't affect A
// Joint probability
P(A and B) = P(A) × P(B) if independent
P(A and B) = P(A) × P(B|A) if dependent
// Marginal probability
P(A) = Σ P(A and B_i) for all B_i
Sum over all possible other events
// Law of total probability
P(A) = P(A|B) × P(B) + P(A|¬B) × P(¬B)
// Example: Disease detection
P(positive) = P(positive|disease) × P(disease)
+ P(positive|no disease) × P(no disease)
= 0.95 × 0.01 + 0.05 × 0.99
= 0.0545
Bayes Theorem
// Bayes Theorem: Update beliefs with evidence
P(A|B) = P(B|A) × P(A) / P(B)
P(A|B): Posterior (updated probability)
P(B|A): Likelihood (evidence strength)
P(A): Prior (initial belief)
P(B): Evidence (normalizing constant)
// Bayesian reasoning
Start: Prior belief P(A)
Observe: Event B
Update: P(A|B) = P(B|A) × P(A) / P(B)
// Medical test example
Disease prevalence: 1% (prior)
Test sensitivity: 95% (P(positive|disease))
Test specificity: 95% (P(negative|no disease))
P(disease|positive) = 0.95 × 0.01 / P(positive)
P(positive) = 0.95 × 0.01 + 0.05 × 0.99 = 0.0545
P(disease|positive) = 0.0095 / 0.0545 ≈ 17.4%
Surprising! Despite 95% accurate test, only 17% chance of disease
// Spam email example
Prior: 80% of emails are spam
P(word "cash" | spam) = 0.8
P(word "cash" | legitimate) = 0.1
If email contains "cash":
P(spam | "cash") = 0.8 × 0.8 / P("cash")
= 0.64 / [0.64 + 0.1 × 0.2]
= 0.64 / 0.66 ≈ 97%
// Bayesian updates (multiple evidence)
Posterior becomes new prior for next observation
P(A|B,C) = P(C|A,B) × P(A|B) / P(C|B)
// Machine learning
Naive Bayes classifier uses Bayes theorem
P(class|features) = P(features|class) × P(class) / P(features)
Expectation & Variance
// Expected value: Long-run average
E[X] = Σ x × P(x) (discrete)
E[X] = ∫ x × f(x) dx (continuous)
Example: Fair die
E[X] = 1×(1/6) + 2×(1/6) + ... + 6×(1/6) = 3.5
// Linearity of expectation
E[X + Y] = E[X] + E[Y] (always true!)
E[c×X] = c × E[X]
// Variance: Spread around mean
Var(X) = E[(X - μ)²] = E[X²] - (E[X])²
σ = √Var(X) (standard deviation)
// Example: Fair die
μ = 3.5
E[X²] = 1²×(1/6) + ... + 6²×(1/6) = 91/6
Var(X) = 91/6 - (3.5)² = 35/12 ≈ 2.92
σ ≈ 1.71
// Variance properties
Var(X + c) = Var(X) (adding constant doesn't change spread)
Var(c×X) = c²×Var(X) (scaling increases variance)
Var(X + Y) = Var(X) + Var(Y) if independent
// Covariance: Joint variability
Cov(X,Y) = E[(X-μ_X)(Y-μ_Y)]
Positive: When X high, Y tends high
Negative: When X high, Y tends low
Zero: No linear relationship
// Correlation coefficient
ρ = Cov(X,Y) / (σ_X × σ_Y)
Range: -1 to 1
Standard measure of linear relationship
Distributions (Properties)
| Distribution | Mean μ | Variance σ² | Practical Use |
|---|---|---|---|
| Binomial (n,p) | np | np(1-p) | Number of successes in n trials |
| Poisson (λ) | λ | λ | Count of rare events |
| Normal (μ,σ) | μ | σ² | Natural phenomena |
| Exponential (λ) | 1/λ | 1/λ² | Waiting time |
| Uniform (a,b) | (a+b)/2 | (b-a)²/12 | Equal probability |
| Beta (α,β) | α/(α+β) | - | Probabilities as RV |
| Chi-squared (k) | k | 2k | Variance testing |
68-95-99.7 rule (Normal): 68% within μ±σ, 95% within μ±2σ, 99.7% within μ±3σ
Probability Inequalities & Laws
// Markov's inequality
For non-negative X: P(X ≥ a) ≤ E[X] / a
Loose but always true bound
// Chebyshev's inequality
P(|X - μ| ≥ kσ) ≤ 1/k²
E.g., P(|X - μ| ≥ 2σ) ≤ 1/4 = 25%
(Normal: actually ≈ 5%)
// Law of Large Numbers
Average of many independent samples → expected value
X̄ = (X₁ + ... + X_n) / n → E[X] as n → ∞
Justifies: Using sample mean as estimate
// Central Limit Theorem
Sum/average of many independent variables → Normal distribution
Regardless of original distribution!
X̄ ~ Normal(μ, σ²/n)
Example: Average of 30 dice rolls ~ Normal
// Bernoulli's inequality
(1 + x)^n ≥ 1 + nx for x ≥ -1
// Triangle inequality
P(A ∪ B) ≤ P(A) + P(B)
Union probability ≤ sum of probabilities
// Bonferroni correction
Testing multiple hypotheses: Divide α by number of tests
α_adjusted = α / m
Prevents: Multiple comparisons increasing false positives
// Example: 20 hypothesis tests
Normal α = 0.05
Bonferroni α = 0.05 / 20 = 0.0025
Much stricter threshold