Database Topic

Document Stores

Master document-oriented databases like MongoDB. Essential for NoSQL and modern database design interviews.

Document Stores

Why This Matters

Think of document stores like filing cabinets with flexible folders. In a relational database, you have rigid tables with fixed columns. In a document store, you have flexible documents (like folders) that can contain different information. A user document might have name, email, and address, while another user document might have name, email, and phone number. The structure is flexible.

This matters because real-world data is often semi-structured. Not every user has the same fields. Some users have addresses, others don't. Some users have multiple phone numbers, others have one. Document stores handle this naturally, while relational databases require nullable columns or separate tables.

In interviews, when someone asks "How would you design a database for X?", they're testing whether you understand when to use document stores vs relational databases. Do you know when flexible schemas help? Do you understand the trade-offs? Most engineers don't. They just use relational databases for everything.

What Engineers Usually Get Wrong

Most engineers think "document stores are just JSON databases." But document stores have features beyond JSON: indexing, queries, transactions (in some), replication, sharding. Understanding these helps you use document stores effectively. Also, document stores aren't always better—they're better for flexible schemas, but relational databases are better for complex queries and relationships.

Engineers also don't understand that document stores still need schema design. Just because schemas are flexible doesn't mean you shouldn't design them. Poor schema design leads to poor performance and hard-to-maintain code. Design your documents thoughtfully, even if the structure is flexible.

How This Breaks Systems in the Real World

A service was using a document store but didn't design the schema. Documents had inconsistent structures—some had nested objects, others had flat structures. Queries were slow because they couldn't use indexes effectively. The fix? Design your document schema. Decide on a structure (even if flexible), create indexes, and design for your query patterns.

Another story: A service was using a document store for data that had many relationships. They tried to model relationships using embedded documents, but the documents became huge (16MB limit in MongoDB). Queries were slow. The fix? Use references for relationships, or use a relational database. Document stores are great for flexible schemas, but relational databases are better for complex relationships.

What is a Document?

A document is a self-contained unit of data, similar to a row in SQL but more flexible.

{
  "_id": "user_123",
  "name": "John Doe",
  "email": "john@example.com",
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "zip": "10001"
  },
  "orders": [
    {
      "order_id": "o1",
      "date": "2024-01-15",
      "total": 100.50,
      "items": [
        {"product": "Widget", "quantity": 2, "price": 50.25}
      ]
    }
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}

Key Characteristics

Schema Flexibility

Documents in the same collection can have different structures:

// Document 1
{"_id": "1", "name": "John", "age": 30}

// Document 2
{"_id": "2", "name": "Jane", "email": "jane@example.com", "tags": ["vip", "premium"]}

Benefit: Easy schema evolution, no migrations needed.

Challenge: Application must handle varying structures.

Embedded vs. Referenced

Embedded (denormalized):

{
  "user_id": "123",
  "name": "John",
  "orders": [
    {"order_id": "o1", "total": 100},
    {"order_id": "o2", "total": 200}
  ]
}

Referenced (normalized):

// Users collection
{"_id": "123", "name": "John"}

// Orders collection
{"_id": "o1", "user_id": "123", "total": 100}
{"_id": "o2", "user_id": "123", "total": 200}

Rule of thumb: Embed when data is accessed together, reference when data is large or shared.

Data Modeling Patterns

One-to-Few: Embed

{
  "blog_post_id": "p1",
  "title": "My Post",
  "author": "John",
  "comments": [
    {"user": "Alice", "text": "Great post!"},
    {"user": "Bob", "text": "Thanks for sharing"}
  ]
}

When: Comments are always loaded with the post, limited in number.

One-to-Many: Reference

// Posts collection
{"_id": "p1", "title": "My Post", "author": "John"}

// Comments collection
{"_id": "c1", "post_id": "p1", "user": "Alice", "text": "Great post!"}
{"_id": "c2", "post_id": "p1", "user": "Bob", "text": "Thanks"}

When: Comments can be numerous, accessed independently, or shared across posts.

Many-to-Many: Reference Array

// Users collection
{"_id": "u1", "name": "John", "group_ids": ["g1", "g2"]}

// Groups collection
{"_id": "g1", "name": "Developers", "member_ids": ["u1", "u2"]}

Querying Documents

Basic Queries

// Equality
db.users.find({ "name": "John" })

// Comparison
db.users.find({ "age": { $gt: 18 } })

// Array contains
db.users.find({ "tags": "vip" })

// Nested field
db.users.find({ "address.city": "New York" })

Aggregation Pipeline

db.orders.aggregate([
  // Match stage: filter documents
  { $match: { "status": "completed" } },
  
  // Group stage: group and aggregate
  { $group: {
    _id: "$user_id",
    total_spent: { $sum: "$total" },
    order_count: { $sum: 1 }
  }},
  
  // Sort stage: order results
  { $sort: { total_spent: -1 } },
  
  // Limit stage: top N
  { $limit: 10 }
])

Indexing

Document stores support various index types:

// Single field index
db.users.createIndex({ "email": 1 })

// Compound index
db.users.createIndex({ "last_name": 1, "first_name": 1 })

// Text index (full-text search)
db.posts.createIndex({ "title": "text", "content": "text" })

// Geospatial index
db.locations.createIndex({ "coordinates": "2dsphere" })

When to Use Document Stores

Good Fit

Content management: Blogs, CMS, wikis
User profiles: Flexible user data
Catalogs: Product catalogs with varying attributes
Real-time analytics: Event logging, metrics
Mobile apps: Flexible schema for rapid iteration

Not a Good Fit

Complex transactions: Multi-document ACID transactions are limited
Heavy JOINs: Document stores don't support SQL-style JOINs
Strict schema: When data structure must be consistent
Reporting: Complex analytical queries are harder

Best Practices

Design for access patterns: Structure documents based on how you query
Limit document size: Large documents are slow to transfer and update
Use indexes wisely: Index fields you query frequently
Plan for growth: Consider document size limits (MongoDB: 16MB)
Validate at application level: Schema-less doesn't mean no validation

Common Patterns

Versioning

{
  "_id": "user_123",
  "version": 2,
  "name": "John Doe",
  "previous_versions": [
    {"version": 1, "name": "John D."}
  ]
}

Soft Deletes

{
  "_id": "post_123",
  "title": "My Post",
  "deleted": false,
  "deleted_at": null
}

Polymorphic Collections

// All in "content" collection
{"_id": "1", "type": "article", "title": "...", "body": "..."}
{"_id": "2", "type": "video", "title": "...", "url": "..."}
{"_id": "3", "type": "image", "title": "...", "src": "..."}

Interview Questions

1. Beginner Question

Q: What is a document store, and how does it differ from a relational database?

A: A document store is a NoSQL database that stores data as documents (typically JSON/BSON), rather than rows and columns.

Key differences:

Schema flexibility: Documents can have different structures in the same collection
No JOINs: Related data is often embedded in documents
Horizontal scaling: Easier to scale across multiple servers
Query language: Uses document-based queries instead of SQL

Example:

// Document store (MongoDB)
{
  _id: "user123",
  name: "John",
  orders: [
    {order_id: "o1", total: 100},
    {order_id: "o2", total: 200}
  ]
}

// Relational database (SQL)
// Requires separate tables: users, orders, order_items

2. Intermediate Question

Q: When should you embed data vs. reference it in a document store?

Embed when:

Data is accessed together (e.g., user and their profile)
One-to-few relationship (e.g., user and their addresses)
Data doesn't change frequently
Data size is small

Reference when:

One-to-many or many-to-many relationship
Data is large (e.g., large arrays)
Data is shared across documents
Data changes independently

Example:

// Embed: User and their preferences (small, accessed together)
{
  _id: "user123",
  name: "John",
  preferences: {theme: "dark", notifications: true}
}

// Reference: User and their orders (many orders, large data)
{
  _id: "user123",
  name: "John",
  order_ids: ["o1", "o2", "o3"]  // Reference to orders collection
}

Trade-off: Embedding improves read performance but can cause data duplication. Referencing reduces duplication but requires additional queries.

3. Senior-Level System Question

Q: Design a content management system (CMS) using MongoDB. The system needs to handle 1M articles, support versioning, handle comments, and enable full-text search. How would you model the data?

Data modeling strategy:

Articles collection (embedded approach for content):

{
  _id: ObjectId("..."),
  slug: "article-title",
  title: "Article Title",
  content: "...",  // Full article content
  author_id: "user123",
  published_at: ISODate("..."),
  status: "published",  // draft, published, archived
  tags: ["tech", "database"],
  metadata: {
    reading_time: 5,
    word_count: 1000
  },
  // Embedded: Current version
  version: 2,
  // Reference: Previous versions
  version_history: [
    {version: 1, created_at: ISODate("..."), snapshot_id: "..."}
  ]
}

Comments collection (reference approach):

{
  _id: ObjectId("..."),
  article_id: ObjectId("..."),  // Reference to article
  user_id: "user456",
  content: "Great article!",
  created_at: ISODate("..."),
  parent_id: null  // For nested comments
}

Search index (Elasticsearch):
- Denormalized article data for fast search
- Full-text indexing on title, content, tags
- Updated asynchronously when articles change

Indexes:

// Fast lookups
db.articles.createIndex({slug: 1}, {unique: true})
db.articles.createIndex({author_id: 1, published_at: -1})
db.articles.createIndex({status: 1, published_at: -1})
db.articles.createIndex({tags: 1})  // Array index

// Comments
db.comments.createIndex({article_id: 1, created_at: -1})

Versioning strategy:

// Store versions in separate collection for large articles
{
  article_id: ObjectId("..."),
  version: 1,
  content: "...",
  created_at: ISODate("...")
}

Query patterns:

// Get article with comments
const article = await db.articles.findOne({slug: "..."})
const comments = await db.comments.find({article_id: article._id})
  .sort({created_at: -1})
  .limit(50)

// Search (use Elasticsearch)
const results = await elasticsearch.search({
  index: "articles",
  body: {query: {match: {content: "search term"}}}
})

Optimizations:

Caching: Cache popular articles in Redis
Pagination: Use cursor-based pagination for comments
Async updates: Update search index asynchronously
Sharding: Shard articles by date or author for scale

Document stores use flexible schemas—documents can have different structures in the same collection
Embed vs. reference—embed for one-to-few, reference for one-to-many or large data
No JOINs—related data is embedded or requires application-level joins
Schema design matters—even without strict schemas, plan your document structure
Indexes are still important—index frequently queried fields for performance
Denormalization is common—store redundant data to avoid multiple queries
Horizontal scaling—document stores scale well across multiple servers
Query patterns—design documents based on how you'll query them
Versioning—use separate collections or embedded arrays for document versioning
Full-text search—often requires separate search engine (Elasticsearch) for complex queries
NoSQL Basics - Document stores are a type of NoSQL database. Understanding NoSQL basics helps understand document store characteristics.
Key-Value Stores - Simpler NoSQL alternative for simple lookups. Understanding key-value stores helps choose between NoSQL types.
Normalization - Document stores often denormalize data. Understanding normalization helps understand document store trade-offs.
Query Optimization - Document store queries have different optimization strategies. Understanding query optimization helps optimize document queries.
Data Replication - Document stores use replication for availability. Understanding replication helps design distributed document stores.

Key Takeaways

Document stores use flexible schemas—documents can have different structures in the same collection

Embed vs. reference—embed for one-to-few, reference for one-to-many or large data

No JOINs—related data is embedded or requires application-level joins

Schema design matters—even without strict schemas, plan your document structure

Indexes are still important—index frequently queried fields for performance

Denormalization is common—store redundant data to avoid multiple queries

Horizontal scaling—document stores scale well across multiple servers

Query patterns—design documents based on how you'll query them

Versioning—use separate collections or embedded arrays for document versioning

Full-text search—often requires separate search engine (Elasticsearch) for complex queries

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.

Document Stores

Document Stores

Why This Matters

What Engineers Usually Get Wrong

How This Breaks Systems in the Real World

What is a Document?

Key Characteristics

Schema Flexibility

Embedded vs. Referenced

Popular Document Stores

MongoDB

CouchDB

DynamoDB (AWS)

Data Modeling Patterns

One-to-Few: Embed

One-to-Many: Reference

Many-to-Many: Reference Array

Querying Documents

Basic Queries

Aggregation Pipeline

Indexing

When to Use Document Stores

Good Fit

Not a Good Fit

Best Practices

Common Patterns

Versioning

Soft Deletes

Polymorphic Collections

Interview Questions

1. Beginner Question

2. Intermediate Question

3. Senior-Level System Question

Key Takeaways

Related Topics

Keep exploring