Neural Network Math Cheat Sheet

ai-mlGrades 11-127 sections

Forward Propagation

// Linear transformation
z = Wx + b
- W: weight matrix (n_out × n_in)
- x: input vector (n_in × 1)
- b: bias vector (n_out × 1)
- z: pre-activation (n_out × 1)

// Activation
a = σ(z)
- σ: activation function (ReLU, sigmoid, tanh, etc.)

// Output layer
ŷ = softmax(z) for classification
ŷ = z for regression

// Batch processing
Z = WX + b (broadcast b)
- X: batch of inputs (n_in × batch_size)
- Z: pre-activations (n_out × batch_size)
- A: activations (n_out × batch_size)

// Example: 2-layer network
a1 = σ1(W1×x + b1)
ŷ = σ2(W2×a1 + b2)

Loss & Gradient

// Loss function
L = -Σ yi × log(ŷi)  [Cross-entropy]
L = (1/2m) × Σ(ŷ - y)²  [MSE]

// Cost function (with regularization)
J = L + λ/(2m) × Σ||W||²  [L2 regularization]

// Gradient (derivative w.r.t. parameter)
∂J/∂W = gradient of loss w.r.t. W
∂J/∂b = gradient of loss w.r.t. b

// Numerical gradient (approximate, slow)
∂J/∂w ≈ [J(w+ε) - J(w-ε)] / (2ε)
Used for gradient checking to verify correctness

// Analytical gradient (via backprop, fast)
Compute exactly using chain rule

Backpropagation (Chain Rule)

// Chain Rule
If y = f(g(x)), then dy/dx = (dy/dg) × (dg/dx)

// Backprop through layers
dL/dW2 = (dL/dŷ) × (dŷ/dz2) × (dz2/dW2)
dL/db2 = (dL/dŷ) × (dŷ/dz2) × (dz2/db2)
dL/da1 = (dL/dŷ) × (dŷ/dz2) × (dz2/da1)

dL/dW1 = (dL/da1) × (da1/dz1) × (dz1/dW1)
dL/db1 = (dL/da1) × (da1/dz1) × (dz1/db1)

// Key derivatives
d/dz[σ(z)] = σ'(z)  [activation derivative]
d/dW[Wx + b] = x^T  [matrix derivative]
d/db[Wx + b] = 1

// For softmax + cross-entropy
dL/dz = (ŷ - y)  [Nice form!]

// Batch dimensions
dL/dW shape = (n_out × batch_size) @ (batch_size × n_in) = (n_out × n_in)
Gradients averaged over batch: dL/dW = (1/m) × dL/dW_sum

Gradient Descent & Variants

// Gradient Descent Update
w_new = w_old - learning_rate × ∂J/∂w

// Learning rate effects
Too high: Oscillate, diverge
Too low: Slow convergence, get stuck
Sweet spot: Fast, stable convergence

// Batch Gradient Descent
Use all data → accurate gradient, slow, stable

// Stochastic Gradient Descent (SGD)
Use 1 sample → noisy gradient, fast, chaotic

// Mini-batch SGD
Use subset (32, 64, 128) → balance

// Momentum (velocity)
v = β×v + ∂J/∂w  [accumulate gradient]
w = w - lr × v
β typically 0.9 → "friction", smooths updates

// Nesterov Momentum
Evaluate gradient at position ahead
w_ahead = w - β × v
v = β×v + ∂J/∂w|_{w_ahead}
w = w - lr × v

// AdaGrad (adaptive learning rate)
g = ∂J/∂w
s = s + g²  [accumulate squared gradients]
w = w - (lr/√(s + ε)) × g
Problem: s keeps growing, lr → 0

// RMSprop
s = β×s + (1-β)×g²
w = w - (lr/√(s + ε)) × g
Fixes AdaGrad: exponential moving average

// Adam (Adaptive Moment)
m = β1×m + (1-β1)×g  [1st moment: mean]
v = β2×v + (1-β2)×g²  [2nd moment: variance]
m_hat = m / (1 - β1^t)  [bias correction]
v_hat = v / (1 - β2^t)
w = w - (lr / √(v_hat + ε)) × m_hat
β1=0.9, β2=0.999, ε=1e-8

Weight Initialization

// Why initialization matters
Bad init → Dead neurons, vanishing/exploding gradients
Good init → Faster convergence, no saturation

// Zero initialization (BAD)
All weights = 0 → all neurons compute same, no learning

// Random Normal
w ~ N(0, σ²)
Too small: Vanishing gradients
Too large: Exploding gradients

// Xavier/Glorot Initialization
For tanh, sigmoid
w ~ U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]
Or: w ~ N(0, 1/(n_in + n_out))
Keeps variance consistent across layers

// He Initialization
For ReLU
w ~ N(0, 2/n_in)
Accounts for ReLU zeros out 50% of activations

// LSUV (Layer-Sequential Unit-Variance)
Initialize weights, then scale to unit variance
More robust than Xavier/He

// Biases
Usually initialized to 0
Sometimes small random values

// Batch Norm
Reduces sensitivity to initialization
Can use near-zero init with batch norm

Gradient Issues

// Vanishing Gradient
Gradients shrink as they backprop through layers
Common with sigmoid, tanh in deep networks
∂a/∂z_sigmoid = σ(z)(1-σ(z)) ≤ 0.25
Repeated multiplication → exponential decay

Solutions:
→ Use ReLU (∂/∂z = 0 or 1, no saturation)
→ Batch normalization (normalize activations)
→ Skip connections (ResNet)
→ Better initialization

// Exploding Gradient
Gradients grow exponentially
w = w - lr×∂J/∂w
If ∂J/∂w is huge → w jumps wildly

Solutions:
→ Gradient clipping: clip(∂J/∂w, -threshold, threshold)
→ Lower learning rate
→ Weight decay/L2 regularization
→ Batch normalization

// Dead ReLU
Neurons output 0 for all inputs
∂L/∂w = 0 → weight never updates
Caused by large negative bias or weight

Solutions:
→ Leaky ReLU (small slope when x<0)
→ ELU (smooth negative)
→ Careful initialization
→ Lower learning rate

// Numerical Stability
Softmax: log-sum-exp trick
softmax(x) = softmax(x - max(x))
Prevents overflow

Cross-entropy + softmax combine nicely:
dL/dz = (ŷ - y)  [no exponential]

Matrix & Vector Calculus

// Scalar derivatives (chain rule)
df/dx = (df/du) × (du/dx)

// Vector derivatives (Jacobian)
∂f/∂x = [∂f₁/∂x₁ ... ∂f₁/∂xₙ]
        [  ...         ...    ]
        [∂fₘ/∂x₁ ... ∂fₘ/∂xₙ]
Shape: (m × n) for f: ℝⁿ → ℝᵐ

// Matrix derivatives (shape rules)
∂(Ax)/∂x = A^T  [if Ax is scalar, use trace]
∂(x^TAx)/∂x = (A + A^T)x  [quadratic form]
∂||Ax - b||²/∂x = 2A^T(Ax - b)

// Chain rule for matrices
dL/dW = (dL/dZ) × (dZ/dW)^T
Use shape matching: (m×n) = (m×k) × (k×n)^T = (m×k) × (n×k)

// Hessian (2nd derivative)
H = ∂²f/∂x²  [matrix of 2nd partials]
Shape: (n × n) for f: ℝⁿ → ℝ
Convex function: H positive semi-definite

// Common derivatives (for ML)
∂(Wx+b)/∂W = x^T + shift
∂sigmoid(z)/∂z = sigmoid(z)(1-sigmoid(z))
∂ReLU(z)/∂z = 1 if z>0, else 0
∂tanh(z)/∂z = 1 - tanh²(z)

Neural Network Math Cheat Sheet

Forward Propagation

Loss & Gradient

Backpropagation (Chain Rule)

Gradient Descent & Variants

Weight Initialization

Gradient Issues

Matrix & Vector Calculus

More Cheat Sheets

Machine Learning Cheat Sheet

Deep Learning Cheat Sheet

NLP Essentials Cheat Sheet