🧠 AI Computer Institute
Content is AI-generated for educational purposes. Verify critical information independently. A bharath.ai initiative.

Deep Learning Cheat Sheet

ai-mlGrades 11-128 sections

Visual Overview: Neural Network Architecture

Neural Network Architecture Input Layer 4 features Hidden Layer 1 Hidden Layer 2 Output Layer 2 classes Input (x) ReLU activated ReLU activated Softmax (y) How it works: 1. Input features pass through fully connected (dense) layers 2. Each hidden layer applies weights and an activation function (ReLU) 3. Output layer produces predictions (softmax for classification) 4. Network learns by backpropagation, adjusting weights to minimize loss

Forward pass: Input flows left→right through weighted connections and activations

Neural Network Architectures

ArchitectureInputUse CaseKey Feature
MLP (Fully Connected)VectorsTabular dataDense layers
CNNImagesComputer VisionConvolution, pooling
RNNSequencesText, time-seriesRecurrent connections
LSTMSequencesLong dependenciesMemory cells, gates
GRUSequencesFaster LSTM alternativeSimplified gates
TransformerSequencesNLP, machine translationSelf-attention, parallel
Vision TransformerImagesImage classificationAttention on patches
AutoencoderAnyDimensionality reductionEncoder-decoder
GANNoiseImage generationGenerator-Discriminator

Activation Functions

FunctionFormulaRangeWhen to Use
ReLUmax(0, x)[0, ∞)Hidden layers (default)
Leaky ReLUx if x>0, else 0.01x(-∞, ∞)Avoid dying ReLU
ELUx if x>0, else α(e^x-1)(-α, ∞)Smooth negative
Sigmoid1/(1+e^-x)(0, 1)Binary classification output
Tanh(e^2x - 1)/(e^2x + 1)(-1, 1)Centered activation
Softmaxe^xi / Σe^xj(0, 1)Multi-class output (probabilities)
Swishx × sigmoid(βx)(-∞, ∞)Better than ReLU
GELUx × Φ(x)(-∞, ∞)Transformers (BERT, GPT)

Loss Functions

// Classification
Cross-Entropy (Binary): -[y*log(ŷ) + (1-y)*log(1-ŷ)]
Cross-Entropy (Multi-class): -Σ yi*log(ŷi)
Focal Loss: Handles class imbalance, focuses on hard examples

// Regression
Mean Squared Error (MSE): (1/n)Σ(y - ŷ)²
Mean Absolute Error (MAE): (1/n)Σ|y - ŷ|
Huber Loss: Combines MSE & MAE, robust to outliers

// Distance/Ranking
Contrastive Loss: Bring similar samples close, push apart different
Triplet Loss: Anchor, positive, negative samples
ArcFace, CosFace: Face recognition losses

// Regularization
L1: Σ|w| (sparse)
L2: Σw² (weight decay, small weights)
Combined (Elastic Net): L1 + L2

Code example (PyTorch):
criterion = nn.CrossEntropyLoss()
loss = criterion(predictions, targets)

Optimizers

OptimizerUpdate RuleWhen to UseNotes
SGDw = w - lr×∇LSimple baselineCan oscillate
SGD+MomentumAccelerates in consistent directionFaster convergencePopular choice
NesterovLook-ahead gradientFaster than momentumBetter convergence
AdaGradPer-parameter learning rateSparse dataLearning rate decreases over time
RMSpropAdaptive learning rateRNNs, good defaultDivides by root of squared gradients
AdamMomentum + RMSpropMost popularWorks well, default choice
AdamWAdam + weight decayBetter than Adam+L2Recommended for modern DL
LAMBLayer-wise adaptive learningLarge batch trainingScales better to large batches

Regularization Techniques

// Dropout
Randomly deactivate neurons during training
Prevents co-adaptation, reduces overfitting
p = 0.5 is common (50% dropout)

// Batch Normalization
Normalize layer inputs during training
Reduces internal covariate shift
Allows higher learning rates
Usually before or after activation

// Layer Normalization
Normalize across features (not batch)
Better for RNNs, Transformers
Batch-size independent

// Weight Decay (L2 Regularization)
Penalizes large weights
Encourages simpler models
In PyTorch: optimizer with weight_decay parameter

// Early Stopping
Monitor validation loss
Stop when it stops improving
Prevents overfitting

// Data Augmentation
Image: Rotation, flip, crop, color jitter
Text: Paraphrase, back-translation
Time-series: Scaling, noise

// Mix-Up
Create virtual samples: x' = λx_i + (1-λ)x_j
Smooths decision boundary
λ ~ Beta(α, α)

// CutMix
Cut and paste regions between images
Improves robustness

Convolutional Neural Networks (CNN)

// Convolution Operation
Filter size: 3x3, 5x5, 7x7 common
Stride: How much filter moves (1 or 2)
Padding: Add zeros around input
Output size = (input - kernel + 2×padding) / stride + 1

// Pooling
Max Pooling: Take maximum value
Average Pooling: Take average
Stride usually 2, pooling size 2x2

// Common Architectures
AlexNet: Deep CNN, ImageNet breakthrough (2012)
VGG: Deeper, simpler (3x3 filters)
ResNet: Skip connections, very deep (50-152 layers)
Inception: Multi-scale features
MobileNet: Efficient, mobile-friendly
EfficientNet: Scales depth/width/resolution

// Typical CNN Structure
Input → Conv → ReLU → Pool → ... (repeat) → FC → Softmax

// Example (PyTorch)
import torch.nn as nn
model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64*56*56, 10)
)

Recurrent & Attention

// LSTM (Long Short-Term Memory)
Cell state: Carries info across time
Gates: Input, output, forget
Mitigates vanishing gradient problem
Good for sequences with long dependencies

// GRU (Gated Recurrent Unit)
Simpler than LSTM (2 gates vs 3)
Slightly faster
Similar performance

// Transformer (Self-Attention)
Query (Q), Key (K), Value (V) vectors
Attention = softmax(Q×K^T / √d)×V
Parallel processing (vs sequential RNN)
Basis for BERT, GPT

// Positional Encoding
Sin/cos functions based on position
Helps model learn position information
Absolute or relative positions

// Multi-Head Attention
Multiple attention heads in parallel
Captures different types of relationships
Concatenate and project results

// Sequence to Sequence (Seq2Seq)
Encoder: Process input sequence
Decoder: Generate output sequence
With attention: Decoder attends to encoder

// Example use cases
LSTM: Time-series forecasting, speech recognition
Transformer: Machine translation, text generation, classification

More Cheat Sheets