AI Computer Institute
Expert-curated CS & AI curriculum aligned to CBSE standards. A bharath.ai initiative. About Us

Working with CSV and JSON Data

📚 Databases & Data Science⏱️ 17 min read🎓 Grade 8
✍️ AI Computer Institute Editorial Team Published: March 2026 CBSE-aligned · Peer-reviewed · 17 min read
Content curated by subject matter experts with IIT/NIT backgrounds. All chapters are fact-checked against official CBSE/NCERT syllabi.

Working with CSV and JSON Data: Handling Real-World Information

Every day, millions of data files flow through the internet. Banking systems, e-commerce platforms, weather forecasts, and medical records all rely on structured data formats like CSV and JSON. This chapter teaches you to read, parse, transform, and generate these data formats—essential skills for any data scientist, backend engineer, or data analyst.

Understanding CSV: The Universal Data Format

CSV (Comma-Separated Values) is the most widely used data format. Every Excel spreadsheet, government dataset, and database export can become CSV. Here's the real data structure:

# File: student_marks.csv (actual content)
roll_no,name,math_score,science_score,english_score,grade
101,Raj,92,88,85,8
102,Priya,88,95,90,8
103,Amit,78,82,79,8
104,Neha,95,93,94,8
105,Arjun,70,68,72,8

# Notice: First row is header, data follows
# Commas separate values, newlines separate rows
      

Reading CSV Files with Different Approaches

Learn multiple techniques for different scenarios:

import csv

# Approach 1: Simple reading (list of rows)
print("=== Reading as list of lists ===")
with open('student_marks.csv', 'r') as file:
    reader = csv.reader(file)
    for i, row in enumerate(reader):
        if i == 0:
            print(f"Header: {row}")
        else:
            print(f"Row {i}: {row}")
    # Output: Row 1: ['101', 'Raj', '92', '88', '85', '8']

# Approach 2: Reading as dictionaries (much better for real work)
print("
=== Reading with column names ===")
with open('student_marks.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(f"Student: {row['name']}, Math: {row['math_score']}")
        # row['name'] is easier than row[1]!

# Approach 3: Filter and process while reading
print("
=== Filter high performers ===")
high_performers = []
with open('student_marks.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        total = int(row['math_score']) + int(row['science_score']) + int(row['english_score'])
        avg = total / 3
        if avg > 85:
            high_performers.append({
                'name': row['name'],
                'average': round(avg, 2),
                'grade': row['grade']
            })

for student in high_performers:
    print(f"{student['name']}: Avg {student['average']}")
      

Writing CSV Files: Creating Datasets

Generate CSV files for reports, exports, or data sharing:

import csv
from datetime import datetime

# Write student performance report
print("=== Writing CSV output ===")

data = [
    {'roll_no': 101, 'name': 'Raj', 'math': 92, 'science': 88, 'grade': 8},
    {'roll_no': 102, 'name': 'Priya', 'math': 88, 'science': 95, 'grade': 8},
    {'roll_no': 103, 'name': 'Amit', 'math': 78, 'science': 82, 'grade': 8},
    {'roll_no': 104, 'name': 'Neha', 'math': 95, 'science': 93, 'grade': 8},
]

# Write data to CSV
output_filename = f"report_{datetime.now().strftime('%Y%m%d')}.csv"
with open(output_filename, 'w', newline='') as file:
    fieldnames = ['roll_no', 'name', 'math', 'science', 'grade']
    writer = csv.DictWriter(file, fieldnames=fieldnames)

    # Write header
    writer.writeheader()

    # Write data rows
    writer.writerows(data)

print(f"Report written to {output_filename}")

# Real example: Export filtered data
print("
=== Export only Math A+ students ===")
math_toppers = [row for row in data if row['math'] >= 90]

with open('math_toppers.csv', 'w', newline='') as file:
    fieldnames = ['name', 'math_score']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for student in math_toppers:
        writer.writerow({'name': student['name'], 'math_score': student['math']})
      

JSON: Structured Data for APIs and Web Services

JSON (JavaScript Object Notation) is the standard for web APIs. Every API call returns JSON: UPI payments, weather services, social media, everything uses JSON.

import json

# JSON structure (notice keys, types, nesting)
json_string = '''
{
  "school_id": "DPS001",
  "school_name": "Delhi Public School",
  "established": 1995,
  "is_residential": false,
  "class_8_students": [
    {
      "roll_no": 101,
      "name": "Raj Kumar",
      "age": 14,
      "subjects": ["Math", "Science", "English"],
      "marks": {"math": 92, "science": 88, "english": 85}
    },
    {
      "roll_no": 102,
      "name": "Priya Singh",
      "age": 14,
      "subjects": ["Math", "Science", "English"],
      "marks": {"math": 88, "science": 95, "english": 90}
    }
  ],
  "contact": {
    "phone": "+91-11-2345-6789",
    "email": "info@dps.edu.in",
    "location": {"city": "Delhi", "state": "Delhi"}
  }
}
'''

# Parse JSON string to Python dictionary
print("=== Parse JSON ===")
data = json.loads(json_string)
print(f"School: {data['school_name']}")
print(f"Established: {data['established']}")
print(f"Students: {len(data['class_8_students'])}")

# Access nested data
print("
=== Access nested values ===")
for student in data['class_8_students']:
    print(f"Roll {student['roll_no']}: {student['name']} - Math: {student['marks']['math']}")

# Convert Python object back to JSON string
print("
=== Write JSON ===")
new_student = {
    "roll_no": 103,
    "name": "Amit Patel",
    "age": 14,
    "subjects": ["Math", "Science", "English"],
    "marks": {"math": 78, "science": 82, "english": 79}
}

# Add to existing data
data['class_8_students'].append(new_student)

# Save to file
with open('school_data.json', 'w') as f:
    json.dump(data, f, indent=2)  # indent=2 for readable formatting

print("JSON saved to school_data.json")
      

CSV vs JSON: When to Use Each

FormatBest ForExample
CSVTabular data, spreadsheets, simple dataStudent marks, sales records
JSONComplex nested data, APIs, configurationUser profiles with addresses, API responses
CSVFile size (smaller)Large datasets needing quick processing
JSONHuman-readable structured dataConfiguration files, web APIs

Data Cleaning: Handling Real-World Messy Data

Real data is always dirty. Missing values, typos, inconsistent formats—you must clean before analysis.

import csv

print("=== Data cleaning example ===")

# Raw data with problems
raw_data = [
    {'name': 'Raj', 'score': '92', 'grade': '8'},
    {'name': 'Priya', 'score': '', 'grade': '8'},           # Missing score
    {'name': 'Amit', 'score': 'N/A', 'grade': ''},          # N/A and missing grade
    {'name': '  Neha  ', 'score': '95', 'grade': '8'},      # Extra spaces
    {'name': 'Arjun', 'score': '70', 'grade': 'VIII'},      # Inconsistent format
]

def clean_data(row):
    """Clean a single data row"""
    cleaned = {}

    # Clean name: strip whitespace, title case
    cleaned['name'] = row['name'].strip().title()

    # Clean score: handle missing and invalid values
    score = row['score'].strip()
    if score in ['', 'N/A', 'NA', 'null']:
        cleaned['score'] = None  # Represent missing data as None
    else:
        try:
            cleaned['score'] = int(score)
        except ValueError:
            cleaned['score'] = None

    # Clean grade: standardize format
    grade = row['grade'].strip()
    if grade in ['VIII', '8', '8th']:
        cleaned['grade'] = '8'
    else:
        cleaned['grade'] = grade if grade else None

    return cleaned

print("Original data:")
for row in raw_data:
    print(row)

print("
Cleaned data:")
cleaned_rows = [clean_data(row) for row in raw_data]
for row in cleaned_rows:
    print(row)

# Save cleaned data
with open('student_marks_clean.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'score', 'grade'])
    writer.writeheader()
    writer.writerows(cleaned_rows)
      

Real-World Example: Processing Zomato Restaurant Data

import json
import csv

# JSON data from Zomato API
zomato_json = '''
{
  "restaurants": [
    {
      "id": 1,
      "name": "Spice Route",
      "cuisine": ["North Indian", "Mughlai"],
      "rating": 4.5,
      "avg_cost_for_two": 500,
      "location": "New Delhi",
      "is_vegetarian": false
    },
    {
      "id": 2,
      "name": "Green Plate",
      "cuisine": ["Vegan", "Organic"],
      "rating": 4.2,
      "avg_cost_for_two": 600,
      "location": "Bangalore",
      "is_vegetarian": true
    }
  ]
}
'''

# Parse JSON
data = json.loads(zomato_json)

# Filter and export to CSV
print("=== Export vegetarian restaurants ===")
veg_restaurants = [r for r in data['restaurants'] if r['is_vegetarian']]

with open('vegetarian_restaurants.csv', 'w', newline='') as f:
    fieldnames = ['name', 'cuisine', 'rating', 'cost', 'location']
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()

    for restaurant in veg_restaurants:
        writer.writerow({
            'name': restaurant['name'],
            'cuisine': ', '.join(restaurant['cuisine']),
            'rating': restaurant['rating'],
            'cost': restaurant['avg_cost_for_two'],
            'location': restaurant['location']
        })

print(f"Exported {len(veg_restaurants)} vegetarian restaurants")
      

Using CSV and JSON with Government Data (India)

India's Data.gov.in provides open datasets in CSV and JSON formats:

import json
import csv
import requests  # For API calls

# Example: Fetch rainfall data from India's government portal
# and convert to different formats

def fetch_rainfall_data(state):
    """Fetch from data.gov.in (requires API key in real usage)"""
    # This is pseudocode—real API would require authentication
    return {
        "state": state,
        "annual_rainfall_mm": 800,
        "last_updated": "2024-01-20"
    }

# Convert between formats
rainfall_csv = [
    {"state": "Rajasthan", "rainfall": 500},
    {"state": "Kerala", "rainfall": 3000},
    {"state": "Delhi", "rainfall": 700}
]

# Save as CSV
with open('rainfall.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['state', 'rainfall'])
    writer.writeheader()
    writer.writerows(rainfall_csv)

# Convert to JSON
rainfall_json = {"rainfall_data": rainfall_csv}
with open('rainfall.json', 'w') as f:
    json.dump(rainfall_json, f, indent=2)

print("Data exported to both CSV and JSON formats")
      

Advanced Data Transformation: CSV to JSON and Back

Professional data engineers constantly convert between formats. Master bidirectional conversion:

import csv
import json

# CSV to JSON conversion
def csv_to_json(csv_filename, json_filename):
    """Convert CSV file to JSON format"""
    data = []
    with open(csv_filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            # Type conversion if needed
            if 'score' in row:
                row['score'] = int(row['score'])
            if 'rating' in row:
                row['rating'] = float(row['rating'])
            data.append(row)

    # Write JSON
    with open(json_filename, 'w') as jsonfile:
        json.dump(data, jsonfile, indent=2)

    return len(data)

# JSON to CSV conversion
def json_to_csv(json_filename, csv_filename):
    """Convert JSON file to CSV format"""
    with open(json_filename, 'r') as jsonfile:
        data = json.load(jsonfile)

    # Handle both list and dict JSON structures
    if isinstance(data, dict) and len(data) == 1:
        # If single key containing array, extract it
        key = list(data.keys())[0]
        data = data[key]

    if not data:
        print("No data to write")
        return 0

    # Get field names from first record
    fieldnames = list(data[0].keys())

    # Write CSV
    with open(csv_filename, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

    return len(data)

# Example usage
print("CSV → JSON")
count1 = csv_to_json('students.csv', 'students.json')
print(f"Converted {count1} records")

print("
JSON → CSV")
count2 = json_to_csv('students.json', 'students_export.csv')
print(f"Converted {count2} records")
      

Working with Large Files: Memory-Efficient Processing

Real datasets have millions of rows. Process in chunks to save memory:

import csv
import json

# Memory-efficient CSV reading
def process_large_csv_by_chunks(filename, chunk_size=1000):
    """Process CSV in chunks instead of loading all at once"""
    chunk = []

    with open(filename, 'r') as f:
        reader = csv.DictReader(f)
        for i, row in enumerate(reader):
            chunk.append(row)

            # Process when chunk is full
            if len(chunk) >= chunk_size:
                print(f"Processing chunk {i // chunk_size + 1}")
                # Your processing logic here
                yield chunk
                chunk = []

        # Don't forget last chunk
        if chunk:
            yield chunk

# Example: Process IRCTC booking data (millions of records)
print("Processing large IRCTC booking file...")
for i, chunk in enumerate(process_large_csv_by_chunks('irctc_bookings.csv', 5000)):
    print(f"Chunk {i + 1}: Processing {len(chunk)} records")
    # Filter high-value bookings
    premium_bookings = [r for r in chunk if float(r.get('fare', 0)) > 5000]
    print(f"  Found {len(premium_bookings)} premium bookings in this chunk")

# Memory-efficient JSON streaming
def process_large_json_streaming(filename):
    """Process large JSON arrays line by line"""
    with open(filename, 'r') as f:
        # Read opening bracket
        f.readline()

        for line in f:
            line = line.strip().rstrip(',')  # Remove trailing comma
            if line and line != ']':
                try:
                    record = json.loads(line)
                    yield record
                except json.JSONDecodeError:
                    continue

# Example: Process JSON logs (streaming approach)
print("
Processing large JSON log file...")
error_count = 0
for log_entry in process_large_json_streaming('app_logs.json'):
    if log_entry.get('level') == 'ERROR':
        error_count += 1

print(f"Found {error_count} error entries without loading entire file")
      

Data Validation and Type Conversion

Ensure data is correct type before processing:

import csv
import json
from datetime import datetime

class DataValidator:
    """Validate and convert data types"""

    @staticmethod
    def validate_and_convert(row, schema):
        """
        Validate row against schema
        schema = {'score': int, 'date': str, 'rating': float}
        """
        converted = {}
        errors = []

        for field, field_type in schema.items():
            value = row.get(field)

            try:
                if field_type == int:
                    converted[field] = int(value)
                elif field_type == float:
                    converted[field] = float(value)
                elif field_type == str:
                    converted[field] = str(value).strip()
                elif field_type == bool:
                    converted[field] = value.lower() in ['true', '1', 'yes']
                elif field_type == datetime:
                    converted[field] = datetime.strptime(value, '%Y-%m-%d')
            except (ValueError, AttributeError) as e:
                errors.append(f"{field}: {value} → {field_type.__name__} failed")
                converted[field] = None

        return converted, errors

# Example: Validate student records
schema = {
    'roll_no': int,
    'name': str,
    'math_score': int,
    'attendance_percent': float,
    'is_scholarship': bool
}

student_data = [
    {'roll_no': '101', 'name': '  Raj  ', 'math_score': '92', 'attendance_percent': '85.5', 'is_scholarship': 'true'},
    {'roll_no': '102', 'name': 'Invalid', 'math_score': 'ABC', 'attendance_percent': '90', 'is_scholarship': 'false'},
]

validator = DataValidator()
for student in student_data:
    converted, errors = validator.validate_and_convert(student, schema)
    if errors:
        print(f"Validation errors: {errors}")
    else:
        print(f"Valid student: {converted}")
      

Real-World Case Study: Processing Flipkart Product Data

import csv
import json
from datetime import datetime

# Typical Flipkart product export (CSV)
flipkart_csv_data = '''
product_id,name,category,price,rating,stock_quantity,last_updated
SKU001,Laptop Pro,Electronics,75000,4.5,50,2024-01-20
SKU002,Headphones,Electronics,2500,4.2,150,2024-01-20
SKU003,Cotton Shirt,Fashion,800,4.0,500,2024-01-19
SKU004,Yoga Mat,Sports,1200,4.7,75,2024-01-20
'''

# Process Flipkart data
from io import StringIO

print("=== Processing Flipkart Product Catalog ===")

# Read CSV
reader = csv.DictReader(StringIO(flipkart_csv_data))
products = list(reader)

# Transform: Add derived fields
for product in products:
    product['price'] = int(product['price'])
    product['rating'] = float(product['rating'])
    product['stock_quantity'] = int(product['stock_quantity'])
    product['in_stock'] = product['stock_quantity'] > 0
    product['discount_eligible'] = product['price'] > 5000
    product['last_updated'] = datetime.strptime(product['last_updated'], '%Y-%m-%d')

# Filter: High-value products with good ratings
premium_products = [p for p in products if p['price'] > 2000 and p['rating'] >= 4.5]

print(f"Total products: {len(products)}")
print(f"Premium products: {len(premium_products)}")

# Export: Save filtered data as JSON
output = {
    'export_date': datetime.now().isoformat(),
    'total_products': len(products),
    'premium_products': premium_products
}

with open('flipkart_premium.json', 'w') as f:
    json.dump(output, f, indent=2, default=str)  # default=str to handle datetime

print("Exported to flipkart_premium.json")
      

Common Pitfalls and How to Avoid Them

Pitfall Problem Solution
Reading without header Confusing indices: row[1] vs row['name'] Always use DictReader for CSV
Not handling missing values Crashes on None/empty values Check for None and provide defaults
Loading entire file in memory Out of memory for large files Use generators, process in chunks
Type mismatches Comparing "92" (string) with 92 (int) Validate and convert types explicitly
Encoding issues UTF-8 characters cause errors Specify encoding: open(file, encoding='utf-8')

Key Takeaways

  • CSV is simple, universal, and perfect for tabular data
  • JSON is flexible, human-readable, and the standard for APIs
  • Always read CSV with DictReader (column names are clearer than indices)
  • Always validate and clean real-world data before analysis
  • Missing values must be handled explicitly—don't ignore them
  • Convert between formats as needed (CSV → JSON for APIs, JSON → CSV for reports)
  • Process large files in chunks or streams to save memory
  • Validate data types before processing to catch errors early
  • Every major company (Flipkart, Zomato, ISRO, TCS) processes millions of CSV/JSON files daily
  • Encoding matters—always specify UTF-8 for international data

Practice Problems

  1. Create a CSV file with movie data (title, year, rating, director) and read it with DictReader
  2. Convert a CSV file to JSON format and save the output with proper formatting
  3. Write a script that filters CSV data and writes only records where score > 80 to a new CSV
  4. Clean a CSV file with missing values, whitespace, and inconsistent data formats
  5. Create JSON representing a student with nested marks data, then extract and process specific subjects
  6. Combine CSV and JSON: read CSV with 1000+ rows, convert to JSON, add computed fields, save as JSON
  7. Process a large CSV file (10MB+) in chunks without loading entire file in memory
  8. Validate student records against a schema: convert types, handle errors, output validated records
  9. Build a data transformation pipeline: CSV → validate → transform → JSON → output
  10. Create a class DataProcessor that can convert CSV↔JSON bidirectionally with error handling

Introduction and Overview

Welcome to this chapter on CSV and JSON data formats! In this chapter, you will learn the core concepts, see real-world examples, and build your skills step by step. This is an essential topic for competitive exam preparation including CBSE Board, JEE, and BITSAT.

Summary and Recap

Key Takeaways: In this chapter, we covered the fundamentals of CSV and JSON data formats, explored practical examples with Python code, and connected these concepts to real-world applications in Indian tech companies. Remember: mastery comes from practice, not just reading!

Challenge Exercises

Think about this: How would you explain CSV and JSON data formats to someone who has never programmed before? What analogy or metaphor would make it click? Imagine you are building a real application — which concepts from this chapter would you use first?

Try this exercise: implement one concept from this chapter from scratch, without looking at the examples. Then compare your solution. What did you learn?

← Statistics with Python: Understanding DataWeb Scraping with BeautifulSoup →

Found this useful? Share it!

📱 WhatsApp 🐦 Twitter 💼 LinkedIn