← Back to databases

Database Topic

Normalization

Master database normalization to eliminate redundancy and improve data integrity. Essential for database design interviews.

Normalization is the process of organizing data in a database to eliminate redundancy and dependency issues. It's a fundamental concept in database design that prevents data anomalies and ensures data integrity.


Why Normalize?

Problems without normalization:

  • Data redundancy: Same data stored multiple times (wastes space, causes inconsistency)
  • Update anomalies: Changing data requires updating multiple rows
  • Insertion anomalies: Can't add data without other related data
  • Deletion anomalies: Deleting data accidentally removes other data

Example of problems:

Orders Table (Unnormalized):
order_id | customer_name | customer_email | product_name | quantity | price
1        | John Doe      | john@email.com | Widget       | 2        | 10.00
1        | John Doe      | john@email.com | Gadget       | 1        | 20.00
2        | Jane Smith    | jane@email.com | Widget       | 3        | 10.00

Problems:
- Customer info repeated for each order item
- If John's email changes, must update multiple rows
- Can't add a customer without an order
- Deleting an order might lose customer information

Normal Forms

First Normal Form (1NF)

Rules:

  • Each column contains atomic (indivisible) values
  • Each row is unique
  • No repeating groups or arrays

Example - Before 1NF:

Users Table:
id | name | phone_numbers
1  | John | 555-1234, 555-5678

After 1NF:

Users Table:
id | name
1  | John

Phone Numbers Table:
user_id | phone_number
1       | 555-1234
1       | 555-5678

Interview tip: JSON columns in modern databases can violate 1NF if not used carefully. Consider if the data should be in a separate table.

Second Normal Form (2NF)

Rules:

  • Must be in 1NF
  • All non-key attributes are fully functionally dependent on the primary key
  • No partial dependencies (for composite keys)

Example - Violates 2NF:

Order Items Table:
order_id | product_id | product_name | quantity | price
1        | 101        | Widget       | 2        | 10.00
1        | 102        | Gadget       | 1        | 20.00

Problem: product_name depends only on product_id, not the full key (order_id, product_id)

After 2NF:

Order Items Table:
order_id | product_id | quantity | price

Products Table:
product_id | product_name
101        | Widget
102        | Gadget

Third Normal Form (3NF)

Rules:

  • Must be in 2NF
  • No transitive dependencies (non-key attributes don't depend on other non-key attributes)

Example - Violates 3NF:

Orders Table:
order_id | customer_id | customer_name | customer_email | total
1        | 100         | John Doe      | john@email.com | 30.00

Problem: customer_name and customer_email depend on customer_id, not order_id

After 3NF:

Orders Table:
order_id | customer_id | total
1        | 100         | 30.00

Customers Table:
customer_id | customer_name | customer_email
100         | John Doe      | john@email.com

Interview scenario: "Why is customer email in the orders table a problem?" Answer: It creates transitive dependency and update anomalies.

Boyce-Codd Normal Form (BCNF)

Rules:

  • Must be in 3NF
  • Every determinant is a candidate key

When needed: When you have overlapping candidate keys or multiple determinants.

Fourth Normal Form (4NF)

Rules:

  • Must be in BCNF
  • No multi-valued dependencies

Example: A table with independent multi-valued attributes should be split.

Fifth Normal Form (5NF)

Rules:

  • Must be in 4NF
  • No join dependencies

Rarely used in practice. Most databases stop at 3NF or BCNF.


When to Denormalize

While normalization reduces redundancy, sometimes denormalization improves performance:

Read-Heavy Workloads

Denormalized data can reduce JOIN operations:

-- Normalized (requires JOIN)
SELECT u.name, u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name, u.email;

-- Denormalized (no JOIN needed)
SELECT name, email, order_count
FROM user_summary;  -- Pre-computed view

Reporting Databases

Data warehouses often use star/snowflake schemas (denormalized):

Fact Table (denormalized):
sale_id | date | customer_id | customer_name | product_id | product_name | amount

Caching Layers

Store computed/aggregated values:

-- Denormalize order count into users table
ALTER TABLE users ADD COLUMN order_count INT;
-- Update via trigger or application logic

Trade-off: Performance vs. data consistency and storage efficiency.


Best Practices

  1. Start normalized: Begin with 3NF, then denormalize only when needed
  2. Measure first: Profile queries before denormalizing
  3. Document decisions: If you denormalize, document why and how to maintain consistency
  4. Maintain consistency: Use triggers or application logic to keep denormalized data in sync
  5. Consider materialized views: Use views for read optimization without denormalizing base tables

Interview Questions

1. Beginner Question

Q: What is database normalization, and why is it important?

A: Normalization is the process of organizing database tables to eliminate redundancy and dependency issues. It's important because:

  1. Reduces redundancy: Saves storage space and prevents data inconsistency
  2. Prevents anomalies: Update, insertion, and deletion anomalies
  3. Improves data integrity: Easier to maintain consistent data
  4. Simplifies maintenance: Changes to data only need to be made in one place

Example: Without normalization, if a customer's email is stored in 10 order records, updating it requires changing 10 rows. With normalization, it's stored once in the customers table.

2. Intermediate Question

Q: Explain the difference between 2NF and 3NF with an example.

A:

2NF: Eliminates partial dependencies (for composite keys). All non-key attributes must depend on the entire primary key.

3NF: Eliminates transitive dependencies. Non-key attributes must not depend on other non-key attributes.

Example:

-- Violates 2NF (if order_id, product_id is composite key)
Order Items:
order_id | product_id | product_name | quantity
-- product_name depends only on product_id, not full key

-- After 2NF:
Order Items: order_id | product_id | quantity
Products: product_id | product_name

-- Still violates 3NF:
Orders:
order_id | customer_id | customer_name | customer_email | total
-- customer_name and customer_email depend on customer_id, not order_id

-- After 3NF:
Orders: order_id | customer_id | total
Customers: customer_id | customer_name | customer_email

Key difference: 2NF addresses partial dependencies in composite keys; 3NF addresses transitive dependencies in single-key tables.

3. Senior-Level System Question

Q: Design a database schema for an e-commerce platform handling 10M products, 100M orders, and 1B order items. How would you balance normalization with performance?

A:

Strategy: Hybrid approach—normalize for writes, denormalize for reads

  1. Core normalized schema (source of truth):

    -- Fully normalized for data integrity
    products (product_id, name, description, category_id, ...)
    categories (category_id, name, parent_id)
    orders (order_id, customer_id, created_at, status, ...)
    order_items (order_id, product_id, quantity, price, ...)
    customers (customer_id, name, email, ...)
    
  2. Denormalized read-optimized views:

    -- Materialized view for product catalog (read-heavy)
    CREATE MATERIALIZED VIEW product_catalog AS
    SELECT 
      p.product_id,
      p.name,
      p.description,
      c.name as category_name,
      c.parent_name,
      p.price,
      p.stock
    FROM products p
    JOIN categories c ON p.category_id = c.id;
    
    -- Refresh periodically or on change
    
  3. Caching layer for hot data:

    • Product details in Redis (frequently accessed products)
    • User shopping carts in Redis (temporary, high-read)
    • Search indexes (Elasticsearch) with denormalized product data
  4. Partitioning strategy:

    -- Partition orders by date for query performance
    CREATE TABLE orders_2024_01 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
    
  5. Write path (normalized):

    • All writes go to normalized tables
    • Triggers update denormalized views/cache
    • Ensures data consistency
  6. Read path (optimized):

    • Check cache first
    • Use materialized views for analytics
    • Fall back to normalized tables for complex queries

Trade-offs:

  • Storage: Denormalization increases storage (acceptable for read performance)
  • Consistency: Eventual consistency between normalized and denormalized data
  • Complexity: More moving parts (views, triggers, cache invalidation)

Monitoring:

  • Track query performance on normalized vs. denormalized
  • Monitor cache hit rates
  • Alert on data inconsistencies

Key Takeaways

  • Normalization eliminates redundancy and prevents data anomalies (update, insertion, deletion)
  • 1NF requires atomic values and unique rows—no repeating groups
  • 2NF eliminates partial dependencies in composite keys
  • 3NF eliminates transitive dependencies—most databases stop here
  • Start normalized, then denormalize only when performance requires it
  • Denormalization trades storage/consistency for read performance—measure before optimizing
  • Use materialized views for read optimization without denormalizing base tables
  • Document denormalization decisions and how to maintain consistency
  • Hybrid approach works best—normalize for writes (data integrity), denormalize for reads (performance)
  • Understand the trade-offs—normalization improves integrity but can hurt query performance

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.