NumPy: Supercharge Your Numerical Computing 100x Faster
NumPy powers data science, machine learning, and scientific computing worldwide. Google, NASA (ISRO), and every AI company use NumPy for processing massive datasets. Pure Python loops are slow—NumPy arrays are 10-100x faster because they're implemented in C. In this chapter, you'll learn vectorized operations that replace loops and unlock professional-grade data processing.
Why NumPy? Python is Slow, NumPy is Fast
Consider processing 1 million student marks:
import numpy as np
import time
marks = list(range(1000000)) # 1 million marks
# Pure Python (SLOW)
start = time.time()
squared = []
for mark in marks:
squared.append(mark ** 2)
python_time = time.time() - start
print(f"Python loop: {python_time:.3f} seconds")
# NumPy (FAST!)
marks_array = np.array(marks)
start = time.time()
squared_numpy = marks_array ** 2
numpy_time = time.time() - start
print(f"NumPy: {numpy_time:.3f} seconds")
print(f"NumPy is {python_time / numpy_time:.0f}x faster!")
# Output: NumPy is 50x faster!
Creating NumPy Arrays: The Foundation
Arrays are NumPy's core data structure. They're like lists but much more powerful:
import numpy as np
print("=== Creating arrays ===")
# From Python list
arr = np.array([1, 2, 3, 4, 5])
print(f"Array from list: {arr}")
print(f"Array shape: {arr.shape}, dtype: {arr.dtype}") # shape: (5,), dtype: int64
# 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"2D array shape: {matrix.shape}") # (3, 3)
# Generate special arrays
print("
=== Special arrays ===")
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones((2, 3)) # [[1. 1. 1.]
# [1. 1. 1.]]
identity = np.eye(3) # Identity matrix (1s on diagonal)
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace_arr = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
print(f"Zeros: {zeros}")
print(f"Range: {range_arr}")
print(f"Linspace: {linspace_arr}")
# Random arrays (useful for simulations)
print("
=== Random arrays ===")
random_marks = np.random.randint(0, 100, 10) # 10 random marks 0-99
print(f"Random marks: {random_marks}")
random_normal = np.random.normal(75, 10, 5) # Normal distribution (mean=75, std=10)
print(f"Random normal: {random_normal}")
Array Indexing and Slicing: Smart Data Access
import numpy as np
arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
print(f"Original array: {arr}")
print("
=== 1D Indexing ===")
print(f"First element: {arr[0]}") # 10
print(f"Last element: {arr[-1]}") # 100
print(f"Fifth element: {arr[4]}") # 50
print("
=== 1D Slicing ===")
print(f"Elements 2-5: {arr[2:5]}") # [30 40 50]
print(f"Every 2nd element: {arr[::2]}") # [10 30 50 70 90]
print(f"Reversed: {arr[::-1]}") # [100 90 80 ... 20 10]
print(f"Last 3 elements: {arr[-3:]}") # [80 90 100]
print("
=== 2D Indexing (Matrix) ===")
matrix = np.arange(1, 13).reshape(3, 4)
print("Matrix:")
print(matrix)
# Output:
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
print(f"Element at row 1, col 2: {matrix[1, 2]}") # 7
print(f"First row: {matrix[0, :]}") # [1 2 3 4]
print(f"Second column: {matrix[:, 1]}") # [2 6 10]
print(f"Submatrix (first 2 rows, last 2 cols):")
print(matrix[0:2, 2:4]) # [[3 4] [7 8]]
# Real example: Extract student marks for specific subjects
print("
=== Real Example: Student Marks ===")
student_marks = np.array([
[92, 88, 85], # Student 1: Math, Science, English
[78, 82, 79], # Student 2
[95, 93, 94], # Student 3
])
print(f"All Science scores: {student_marks[:, 1]}") # [88 82 93]
print(f"Top 2 students:
{student_marks[:2, :]}")
Vectorized Operations: Replace Loops with Array Operations
This is where NumPy shines. Operations on arrays happen element-wise automatically:
import numpy as np
marks1 = np.array([92, 88, 95, 78, 85])
marks2 = np.array([88, 90, 92, 80, 87])
print(f"Marks 1: {marks1}")
print(f"Marks 2: {marks2}")
print("
=== Element-wise Operations ===")
print(f"Sum: {marks1 + marks2}") # [180 178 187 158 172]
print(f"Difference: {marks1 - marks2}") # [4 -2 3 -2 -2]
print(f"Product: {marks1 * marks2}") # Element-wise multiply
print(f"Division: {marks1 / marks2}") # Element-wise divide
print(f"Power: {marks1 ** 2}") # Square each mark
print("
=== Statistical Operations ===")
all_marks = np.array([92, 88, 95, 78, 85, 91, 76, 89])
print(f"Mean: {all_marks.mean()}") # Average
print(f"Median: {np.median(all_marks)}") # Middle value
print(f"Std Dev: {all_marks.std()}") # Spread
print(f"Min: {all_marks.min()}, Max: {all_marks.max()}")
print(f"Sum: {all_marks.sum()}")
print(f"Percentile 75: {np.percentile(all_marks, 75)}")
print("
=== Filtering Data ===")
high_performers = all_marks[all_marks > 85]
print(f"Marks > 85: {high_performers}")
passed = all_marks[all_marks >= 60]
print(f"Passed (>= 60): {passed}")
print("
=== More Complex Operations ===")
# Normalize marks to 0-1 scale
normalized = (all_marks - all_marks.min()) / (all_marks.max() - all_marks.min())
print(f"Normalized marks: {normalized}")
# Calculate percentage improvement
old_marks = np.array([70, 75, 80, 65])
new_marks = np.array([85, 92, 88, 78])
improvement = ((new_marks - old_marks) / old_marks) * 100
print(f"Percentage improvement: {improvement.round(1)}%")
Broadcasting: Let NumPy Handle the Details
Broadcasting automatically expands arrays to match shapes for operations:
import numpy as np
print("=== Broadcasting Example ===")
# Test scores: 5 students, 3 subjects
marks = np.array([
[80, 85, 90], # Student 1
[75, 88, 92], # Student 2
[90, 87, 85], # Student 3
[82, 84, 89], # Student 4
[95, 91, 88], # Student 5
])
# Subject weights
weights = np.array([0.3, 0.3, 0.4]) # Math 30%, Science 30%, English 40%
print("Original marks:")
print(marks)
print(f"Weights: {weights}")
# Weighted score (broadcasting does the work!)
weighted_marks = marks * weights
print("
Weighted marks (marks * weights):")
print(weighted_marks)
# Final score per student
final_scores = weighted_marks.sum(axis=1)
print(f"
Final scores: {final_scores}")
# Subtract average from each student
average_per_student = marks.mean(axis=1, keepdims=True)
deviation = marks - average_per_student
print(f"
Deviation from student's average:")
print(deviation.round(2))
Matrix Operations: Linear Algebra with NumPy
Solve real mathematics problems with matrix operations:
import numpy as np
print("=== Matrix Operations ===")
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("Matrix A:")
print(A)
print("
Matrix B:")
print(B)
# Matrix multiplication (not element-wise)
C = np.dot(A, B)
print("
A · B (dot product):")
print(C)
# Or use @ operator (Python 3.5+)
C_alt = A @ B
print("
A @ B (same result):")
print(C_alt)
# Transpose
print(f"
Transpose of A:")
print(A.T)
# Determinant
det_A = np.linalg.det(A)
print(f"
Determinant of A: {det_A}")
# Inverse
inv_A = np.linalg.inv(A)
print(f"
Inverse of A:")
print(inv_A)
# Verify: A · A^-1 = Identity
identity_check = A @ inv_A
print(f"
A @ A^-1 (should be identity):")
print(np.round(identity_check))
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"
Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:")
print(eigenvectors)
# Solve Ax = b
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(f"
Solving Ax = b where b = {b}:")
print(f"Solution x: {x}")
print(f"Verification (A @ x): {A @ x}")
Real-World Example: ISRO Satellite Image Processing
ISRO uses NumPy to process satellite data for crop monitoring across India:
import numpy as np
print("=== ISRO Satellite Image Analysis ===")
# Simulate satellite image (multi-spectral)
# Each image is bands: Red, Green, Blue, NIR (Near Infrared)
height, width = 100, 100
red_band = np.random.randint(50, 200, (height, width))
green_band = np.random.randint(50, 200, (height, width))
blue_band = np.random.randint(50, 200, (height, width))
nir_band = np.random.randint(50, 200, (height, width))
print(f"Image dimensions: {red_band.shape}")
# Calculate NDVI (Normalized Difference Vegetation Index)
# Higher NDVI = healthier vegetation
ndvi = (nir_band.astype(float) - red_band) / (nir_band + red_band + 1e-8)
print(f"
NDVI Statistics:")
print(f"Mean NDVI: {ndvi.mean():.3f}")
print(f"Min NDVI: {ndvi.min():.3f}, Max NDVI: {ndvi.max():.3f}")
print(f"Std Dev: {ndvi.std():.3f}")
# Identify healthy crops (NDVI > 0.6)
healthy_crops = (ndvi > 0.6).sum()
total_pixels = ndvi.size
health_percentage = (healthy_crops / total_pixels) * 100
print(f"
Crop Health Analysis:")
print(f"Healthy pixels: {healthy_crops}/{total_pixels} ({health_percentage:.1f}%)")
# Find areas needing irrigation (NDVI < 0.3)
stressed_areas = (ndvi < 0.3).sum()
print(f"Stressed areas: {stressed_areas} pixels")
# Calculate vegetation index for better visualization
vegetation_mask = np.where(ndvi > 0.4, 1, 0)
print(f"Vegetation coverage: {vegetation_mask.sum() / vegetation_mask.size * 100:.1f}%")
Advanced: Multi-Dimensional Arrays and Reshaping
Work with 3D and higher-dimensional data for complex applications:
import numpy as np
print("=== Multi-Dimensional Arrays ===")
# 3D array: 3 classes, 4 students, 3 subjects
school_data = np.array([
# Class 8
[[92, 88, 85], [78, 82, 79], [95, 93, 94], [81, 80, 82]],
# Class 9
[[88, 90, 87], [85, 88, 90], [92, 91, 93], [79, 81, 78]],
# Class 10
[[95, 94, 93], [90, 92, 91], [88, 89, 87], [93, 95, 94]],
])
print(f"Shape: {school_data.shape}") # (3, 4, 3)
print(f"Class 8, Student 1, Math: {school_data[0, 0, 0]}") # 92
print(f"All Class 9 English scores: {school_data[1, :, 2]}") # [87 90 93 78]
# Reshape arrays
print("
=== Reshaping ===")
arr = np.arange(24) # 1D array with 24 elements
print(f"Original shape: {arr.shape}")
# Reshape to 2x3x4
reshaped = arr.reshape(2, 3, 4)
print(f"Reshaped to (2, 3, 4): {reshaped.shape}")
# Flatten back to 1D
flattened = reshaped.flatten()
print(f"Flattened: {flattened}")
# Transpose (swap axes)
print("
=== Transpose ===")
matrix = np.arange(12).reshape(3, 4)
print("Original (3x4):")
print(matrix)
print("
Transposed (4x3):")
print(matrix.T)
# Stacking arrays
print("
=== Stacking Arrays ===")
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
stacked_v = np.vstack([arr1, arr2]) # Vertical stack
print("Vertical stack:")
print(stacked_v) # 2x3
stacked_h = np.hstack([arr1, arr2]) # Horizontal stack
print("Horizontal stack:")
print(stacked_h) # [1 2 3 4 5 6]
Working with Large Datasets: Handling Missing Data
Real-world datasets have missing values. NumPy provides tools to handle them:
import numpy as np
print("=== Handling Missing Data ===")
# Create data with NaN (Not a Number)
marks = np.array([92, 88, np.nan, 78, np.nan, 85, 91, 76])
print(f"Data with missing values: {marks}")
# Count missing values
missing_count = np.isnan(marks).sum()
print(f"Missing values: {missing_count}")
# Get valid values only
valid_marks = marks[~np.isnan(marks)]
print(f"Valid marks: {valid_marks}")
# Calculate statistics ignoring NaN
print(f"Mean (ignoring NaN): {np.nanmean(marks):.2f}")
print(f"Median (ignoring NaN): {np.nanmedian(marks):.2f}")
print(f"Std Dev (ignoring NaN): {np.nanstd(marks):.2f}")
# Fill missing values with mean
marks_filled = marks.copy()
marks_filled[np.isnan(marks_filled)] = np.nanmean(marks)
print(f"After filling with mean: {marks_filled}")
# Forward fill (use previous value)
marks_ffill = marks.copy()
mask = np.isnan(marks_ffill)
idx = np.where(~mask, np.arange(len(mask)), 0)
idx = np.maximum.accumulate(idx)
marks_ffill[mask] = marks_ffill[idx[mask]]
print(f"After forward fill: {marks_ffill}")
Real-World Example: Processing Aadhaar-Based School Attendance Data
Aadhaar system tracks attendance in millions of schools. Here's how it's processed:
import numpy as np
print("=== School Attendance Analytics ===")
# Attendance data: 30 students, 200 school days
np.random.seed(42)
attendance = np.random.randint(0, 2, (30, 200)) # 1 = present, 0 = absent
# Calculate attendance percentage per student
attendance_percent = (attendance.sum(axis=1) / attendance.shape[1]) * 100
print(f"Attendance statistics:")
print(f"Mean attendance: {attendance_percent.mean():.1f}%")
print(f"Min attendance: {attendance_percent.min():.1f}%")
print(f"Max attendance: {attendance_percent.max():.1f}%")
# Find students with low attendance (<75%)
low_attendance = np.where(attendance_percent < 75)[0]
print(f"
Students with <75% attendance: {low_attendance + 1}")
print(f"Their attendance: {attendance_percent[low_attendance].round(1)}%")
# Days with highest/lowest overall attendance
daily_attendance = (attendance.sum(axis=0) / attendance.shape[0]) * 100
worst_day = daily_attendance.argmin()
best_day = daily_attendance.argmax()
print(f"
Worst attendance day: Day {worst_day + 1} ({daily_attendance[worst_day]:.1f}%)")
print(f"Best attendance day: Day {best_day + 1} ({daily_attendance[best_day]:.1f}%)")
# Find students with perfect attendance
perfect = np.where(attendance_percent == 100)[0]
print(f"
Students with perfect attendance: {len(perfect)} - IDs: {perfect + 1}")
# Cumulative attendance (rolling average)
rolling_avg = np.convolve(daily_attendance, np.ones(30)/30, mode='valid')
print(f"30-day rolling average shape: {rolling_avg.shape}")
Performance Optimization: From Python Loops to NumPy
import numpy as np
import time
# Problem: Calculate distance from each point to centroid
# Data: 100,000 points, 3 dimensions
np.random.seed(42)
points = np.random.randn(100000, 3)
centroid = points.mean(axis=0)
# SLOW: Pure Python with loops
start = time.time()
distances_python = []
for point in points:
dist = np.sqrt(sum((point - centroid) ** 2))
distances_python.append(dist)
python_time = time.time() - start
# FAST: NumPy vectorized
start = time.time()
distances_numpy = np.sqrt(np.sum((points - centroid) ** 2, axis=1))
numpy_time = time.time() - start
print(f"Python loops: {python_time:.4f} seconds")
print(f"NumPy vectorized: {numpy_time:.4f} seconds")
print(f"NumPy is {python_time / numpy_time:.0f}x faster!")
# Verify results match
print(f"Results match: {np.allclose(distances_python, distances_numpy)}")
Saving and Loading NumPy Arrays
import numpy as np
# Create data
marks = np.array([[92, 88, 85], [78, 82, 79], [95, 93, 94]])
# Save as binary format (fast, small)
np.save('marks.npy', marks)
loaded = np.load('marks.npy')
print(f"Loaded from .npy: {loaded}")
# Save multiple arrays
np.savez('school_data.npz', marks=marks, classes=['8A', '8B'])
data = np.load('school_data.npz')
print(f"Loaded marks from .npz: {data['marks']}")
# Save as CSV (human-readable)
np.savetxt('marks.csv', marks, delimiter=',', fmt='%d')
loaded_csv = np.loadtxt('marks.csv', delimiter=',', dtype=int)
print(f"Loaded from CSV: {loaded_csv}")
Performance Tips: Make NumPy Even Faster
- Use vectorized operations instead of loops (100x faster)
- Use appropriate dtypes (int32 vs int64, float32 vs float64) to save memory
- Avoid creating temporary arrays—chain operations when possible
- Use in-place operations: arr += 5 instead of arr = arr + 5
- Use memory-mapped arrays for huge files: np.load('file.npy', mmap_mode='r')
- For millions of rows with missing data, use pandas which builds on NumPy
- Use np.nditer for complex multi-dimensional operations
- Profile code with timeit to find bottlenecks before optimizing
Key Takeaways
- NumPy arrays are 10-100x faster than Python lists for numerical operations
- Always think vectorized—replace loops with array operations
- Slicing and indexing work on multi-dimensional arrays automatically
- Broadcasting lets you operate on different-shaped arrays without loops
- Linear algebra functions (determinant, inverse, eigenvalues) are built in
- Reshape, transpose, and stack to transform data for analysis
- Handle missing data with nan functions (nanmean, nanmedian)
- Save arrays efficiently with .npy format or .csv for compatibility
- ISRO, TCS, and all data science companies use NumPy in production daily
- NumPy is the foundation for pandas, scikit-learn, and TensorFlow
Practice Problems
- Create a 5x5 matrix and extract diagonal elements using np.diag()
- Calculate mean and standard deviation of 100 random test scores
- Create two matrices and perform element-wise and matrix multiplication
- Filter an array of marks and return only those in range 70-90
- Normalize marks to 0-1 scale using (marks - min) / (max - min)
- Create 3x4 matrix of student grades, calculate average by subject using broadcasting
- Simulate 10,000 student marks and find percentiles (25th, 50th, 75th, 90th)
- Process attendance data with missing values using nanmean and filling techniques
- Reshape a 1D array of 100 elements into different 2D configurations
- Calculate correlation matrix between 4 subjects from student marks data
Introduction and Overview
Welcome to this chapter on numpy basics! In this chapter, you will learn the core concepts, see real-world examples, and build your skills step by step. This is an essential topic for competitive exam preparation including CBSE Board, JEE, and BITSAT.
Summary and Recap
Key Takeaways: In this chapter, we covered the fundamentals of numpy basics, explored practical examples with Python code, and connected these concepts to real-world applications in Indian tech companies. Remember: mastery comes from practice, not just reading!
Challenge Exercises
Think about this: How would you explain numpy basics to someone who has never programmed before? What analogy or metaphor would make it click? Imagine you are building a real application — which concepts from this chapter would you use first?
Try this exercise: implement one concept from this chapter from scratch, without looking at the examples. Then compare your solution. What did you learn?