Database Topic
Normalization
Master database normalization to eliminate redundancy and improve data integrity. Essential for database design interviews.
Normalization is the process of organizing data in a database to eliminate redundancy and dependency issues. It's a fundamental concept in database design that prevents data anomalies and ensures data integrity.
Why Normalize?
Problems without normalization:
- Data redundancy: Same data stored multiple times (wastes space, causes inconsistency)
- Update anomalies: Changing data requires updating multiple rows
- Insertion anomalies: Can't add data without other related data
- Deletion anomalies: Deleting data accidentally removes other data
Example of problems:
Orders Table (Unnormalized):
order_id | customer_name | customer_email | product_name | quantity | price
1 | John Doe | john@email.com | Widget | 2 | 10.00
1 | John Doe | john@email.com | Gadget | 1 | 20.00
2 | Jane Smith | jane@email.com | Widget | 3 | 10.00
Problems:
- Customer info repeated for each order item
- If John's email changes, must update multiple rows
- Can't add a customer without an order
- Deleting an order might lose customer information
Normal Forms
First Normal Form (1NF)
Rules:
- Each column contains atomic (indivisible) values
- Each row is unique
- No repeating groups or arrays
Example - Before 1NF:
Users Table:
id | name | phone_numbers
1 | John | 555-1234, 555-5678
After 1NF:
Users Table:
id | name
1 | John
Phone Numbers Table:
user_id | phone_number
1 | 555-1234
1 | 555-5678
Interview tip: JSON columns in modern databases can violate 1NF if not used carefully. Consider if the data should be in a separate table.
Second Normal Form (2NF)
Rules:
- Must be in 1NF
- All non-key attributes are fully functionally dependent on the primary key
- No partial dependencies (for composite keys)
Example - Violates 2NF:
Order Items Table:
order_id | product_id | product_name | quantity | price
1 | 101 | Widget | 2 | 10.00
1 | 102 | Gadget | 1 | 20.00
Problem: product_name depends only on product_id, not the full key (order_id, product_id)
After 2NF:
Order Items Table:
order_id | product_id | quantity | price
Products Table:
product_id | product_name
101 | Widget
102 | Gadget
Third Normal Form (3NF)
Rules:
- Must be in 2NF
- No transitive dependencies (non-key attributes don't depend on other non-key attributes)
Example - Violates 3NF:
Orders Table:
order_id | customer_id | customer_name | customer_email | total
1 | 100 | John Doe | john@email.com | 30.00
Problem: customer_name and customer_email depend on customer_id, not order_id
After 3NF:
Orders Table:
order_id | customer_id | total
1 | 100 | 30.00
Customers Table:
customer_id | customer_name | customer_email
100 | John Doe | john@email.com
Interview scenario: "Why is customer email in the orders table a problem?" Answer: It creates transitive dependency and update anomalies.
Boyce-Codd Normal Form (BCNF)
Rules:
- Must be in 3NF
- Every determinant is a candidate key
When needed: When you have overlapping candidate keys or multiple determinants.
Fourth Normal Form (4NF)
Rules:
- Must be in BCNF
- No multi-valued dependencies
Example: A table with independent multi-valued attributes should be split.
Fifth Normal Form (5NF)
Rules:
- Must be in 4NF
- No join dependencies
Rarely used in practice. Most databases stop at 3NF or BCNF.
When to Denormalize
While normalization reduces redundancy, sometimes denormalization improves performance:
Read-Heavy Workloads
Denormalized data can reduce JOIN operations:
-- Normalized (requires JOIN)
SELECT u.name, u.email, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name, u.email;
-- Denormalized (no JOIN needed)
SELECT name, email, order_count
FROM user_summary; -- Pre-computed view
Reporting Databases
Data warehouses often use star/snowflake schemas (denormalized):
Fact Table (denormalized):
sale_id | date | customer_id | customer_name | product_id | product_name | amount
Caching Layers
Store computed/aggregated values:
-- Denormalize order count into users table
ALTER TABLE users ADD COLUMN order_count INT;
-- Update via trigger or application logic
Trade-off: Performance vs. data consistency and storage efficiency.
Best Practices
- Start normalized: Begin with 3NF, then denormalize only when needed
- Measure first: Profile queries before denormalizing
- Document decisions: If you denormalize, document why and how to maintain consistency
- Maintain consistency: Use triggers or application logic to keep denormalized data in sync
- Consider materialized views: Use views for read optimization without denormalizing base tables
Interview Questions
1. Beginner Question
Q: What is database normalization, and why is it important?
A: Normalization is the process of organizing database tables to eliminate redundancy and dependency issues. It's important because:
- Reduces redundancy: Saves storage space and prevents data inconsistency
- Prevents anomalies: Update, insertion, and deletion anomalies
- Improves data integrity: Easier to maintain consistent data
- Simplifies maintenance: Changes to data only need to be made in one place
Example: Without normalization, if a customer's email is stored in 10 order records, updating it requires changing 10 rows. With normalization, it's stored once in the customers table.
2. Intermediate Question
Q: Explain the difference between 2NF and 3NF with an example.
A:
2NF: Eliminates partial dependencies (for composite keys). All non-key attributes must depend on the entire primary key.
3NF: Eliminates transitive dependencies. Non-key attributes must not depend on other non-key attributes.
Example:
-- Violates 2NF (if order_id, product_id is composite key)
Order Items:
order_id | product_id | product_name | quantity
-- product_name depends only on product_id, not full key
-- After 2NF:
Order Items: order_id | product_id | quantity
Products: product_id | product_name
-- Still violates 3NF:
Orders:
order_id | customer_id | customer_name | customer_email | total
-- customer_name and customer_email depend on customer_id, not order_id
-- After 3NF:
Orders: order_id | customer_id | total
Customers: customer_id | customer_name | customer_email
Key difference: 2NF addresses partial dependencies in composite keys; 3NF addresses transitive dependencies in single-key tables.
3. Senior-Level System Question
Q: Design a database schema for an e-commerce platform handling 10M products, 100M orders, and 1B order items. How would you balance normalization with performance?
A:
Strategy: Hybrid approach—normalize for writes, denormalize for reads
-
Core normalized schema (source of truth):
-- Fully normalized for data integrity products (product_id, name, description, category_id, ...) categories (category_id, name, parent_id) orders (order_id, customer_id, created_at, status, ...) order_items (order_id, product_id, quantity, price, ...) customers (customer_id, name, email, ...) -
Denormalized read-optimized views:
-- Materialized view for product catalog (read-heavy) CREATE MATERIALIZED VIEW product_catalog AS SELECT p.product_id, p.name, p.description, c.name as category_name, c.parent_name, p.price, p.stock FROM products p JOIN categories c ON p.category_id = c.id; -- Refresh periodically or on change -
Caching layer for hot data:
- Product details in Redis (frequently accessed products)
- User shopping carts in Redis (temporary, high-read)
- Search indexes (Elasticsearch) with denormalized product data
-
Partitioning strategy:
-- Partition orders by date for query performance CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); -
Write path (normalized):
- All writes go to normalized tables
- Triggers update denormalized views/cache
- Ensures data consistency
-
Read path (optimized):
- Check cache first
- Use materialized views for analytics
- Fall back to normalized tables for complex queries
Trade-offs:
- Storage: Denormalization increases storage (acceptable for read performance)
- Consistency: Eventual consistency between normalized and denormalized data
- Complexity: More moving parts (views, triggers, cache invalidation)
Monitoring:
- Track query performance on normalized vs. denormalized
- Monitor cache hit rates
- Alert on data inconsistencies
Key Takeaways
- Normalization eliminates redundancy and prevents data anomalies (update, insertion, deletion)
- 1NF requires atomic values and unique rows—no repeating groups
- 2NF eliminates partial dependencies in composite keys
- 3NF eliminates transitive dependencies—most databases stop here
- Start normalized, then denormalize only when performance requires it
- Denormalization trades storage/consistency for read performance—measure before optimizing
- Use materialized views for read optimization without denormalizing base tables
- Document denormalization decisions and how to maintain consistency
- Hybrid approach works best—normalize for writes (data integrity), denormalize for reads (performance)
- Understand the trade-offs—normalization improves integrity but can hurt query performance
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.