← Back to databases

Database Topic

Document Stores

Master document-oriented databases like MongoDB. Essential for NoSQL and modern database design interviews.

Document stores are NoSQL databases that store data as documents, typically in JSON, BSON, or XML format.


What is a Document?

A document is a self-contained unit of data, similar to a row in SQL but more flexible.

{
  "_id": "user_123",
  "name": "John Doe",
  "email": "john@example.com",
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "zip": "10001"
  },
  "orders": [
    {
      "order_id": "o1",
      "date": "2024-01-15",
      "total": 100.50,
      "items": [
        {"product": "Widget", "quantity": 2, "price": 50.25}
      ]
    }
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}

Key Characteristics

Schema Flexibility

Documents in the same collection can have different structures:

// Document 1
{"_id": "1", "name": "John", "age": 30}

// Document 2
{"_id": "2", "name": "Jane", "email": "jane@example.com", "tags": ["vip", "premium"]}

Benefit: Easy schema evolution, no migrations needed.

Challenge: Application must handle varying structures.

Embedded vs. Referenced

Embedded (denormalized):

{
  "user_id": "123",
  "name": "John",
  "orders": [
    {"order_id": "o1", "total": 100},
    {"order_id": "o2", "total": 200}
  ]
}

Referenced (normalized):

// Users collection
{"_id": "123", "name": "John"}

// Orders collection
{"_id": "o1", "user_id": "123", "total": 100}
{"_id": "o2", "user_id": "123", "total": 200}

Rule of thumb: Embed when data is accessed together, reference when data is large or shared.


MongoDB

  • Most popular document database
  • Rich query language
  • Aggregation pipeline
  • Horizontal scaling with sharding
// MongoDB query
db.users.find({ "address.city": "New York" })
db.users.aggregate([
  { $match: { "orders.total": { $gt: 100 } } },
  { $group: { _id: "$name", total_spent: { $sum: "$orders.total" } } }
])

CouchDB

  • Multi-master replication
  • RESTful API
  • Built-in conflict resolution
  • Good for offline-first applications

DynamoDB (AWS)

  • Fully managed
  • Serverless scaling
  • Integrated with AWS ecosystem
  • Pay-per-use pricing

Data Modeling Patterns

One-to-Few: Embed

{
  "blog_post_id": "p1",
  "title": "My Post",
  "author": "John",
  "comments": [
    {"user": "Alice", "text": "Great post!"},
    {"user": "Bob", "text": "Thanks for sharing"}
  ]
}

When: Comments are always loaded with the post, limited in number.

One-to-Many: Reference

// Posts collection
{"_id": "p1", "title": "My Post", "author": "John"}

// Comments collection
{"_id": "c1", "post_id": "p1", "user": "Alice", "text": "Great post!"}
{"_id": "c2", "post_id": "p1", "user": "Bob", "text": "Thanks"}

When: Comments can be numerous, accessed independently, or shared across posts.

Many-to-Many: Reference Array

// Users collection
{"_id": "u1", "name": "John", "group_ids": ["g1", "g2"]}

// Groups collection
{"_id": "g1", "name": "Developers", "member_ids": ["u1", "u2"]}

Querying Documents

Basic Queries

// Equality
db.users.find({ "name": "John" })

// Comparison
db.users.find({ "age": { $gt: 18 } })

// Array contains
db.users.find({ "tags": "vip" })

// Nested field
db.users.find({ "address.city": "New York" })

Aggregation Pipeline

db.orders.aggregate([
  // Match stage: filter documents
  { $match: { "status": "completed" } },
  
  // Group stage: group and aggregate
  { $group: {
    _id: "$user_id",
    total_spent: { $sum: "$total" },
    order_count: { $sum: 1 }
  }},
  
  // Sort stage: order results
  { $sort: { total_spent: -1 } },
  
  // Limit stage: top N
  { $limit: 10 }
])

Indexing

Document stores support various index types:

// Single field index
db.users.createIndex({ "email": 1 })

// Compound index
db.users.createIndex({ "last_name": 1, "first_name": 1 })

// Text index (full-text search)
db.posts.createIndex({ "title": "text", "content": "text" })

// Geospatial index
db.locations.createIndex({ "coordinates": "2dsphere" })

When to Use Document Stores

Good Fit

  • Content management: Blogs, CMS, wikis
  • User profiles: Flexible user data
  • Catalogs: Product catalogs with varying attributes
  • Real-time analytics: Event logging, metrics
  • Mobile apps: Flexible schema for rapid iteration

Not a Good Fit

  • Complex transactions: Multi-document ACID transactions are limited
  • Heavy JOINs: Document stores don't support SQL-style JOINs
  • Strict schema: When data structure must be consistent
  • Reporting: Complex analytical queries are harder

Best Practices

  1. Design for access patterns: Structure documents based on how you query
  2. Limit document size: Large documents are slow to transfer and update
  3. Use indexes wisely: Index fields you query frequently
  4. Plan for growth: Consider document size limits (MongoDB: 16MB)
  5. Validate at application level: Schema-less doesn't mean no validation

Common Patterns

Versioning

{
  "_id": "user_123",
  "version": 2,
  "name": "John Doe",
  "previous_versions": [
    {"version": 1, "name": "John D."}
  ]
}

Soft Deletes

{
  "_id": "post_123",
  "title": "My Post",
  "deleted": false,
  "deleted_at": null
}

Polymorphic Collections

// All in "content" collection
{"_id": "1", "type": "article", "title": "...", "body": "..."}
{"_id": "2", "type": "video", "title": "...", "url": "..."}
{"_id": "3", "type": "image", "title": "...", "src": "..."}

Interview Questions

1. Beginner Question

Q: What is a document store, and how does it differ from a relational database?

A: A document store is a NoSQL database that stores data as documents (typically JSON/BSON), rather than rows and columns.

Key differences:

  • Schema flexibility: Documents can have different structures in the same collection
  • No JOINs: Related data is often embedded in documents
  • Horizontal scaling: Easier to scale across multiple servers
  • Query language: Uses document-based queries instead of SQL

Example:

// Document store (MongoDB)
{
  _id: "user123",
  name: "John",
  orders: [
    {order_id: "o1", total: 100},
    {order_id: "o2", total: 200}
  ]
}

// Relational database (SQL)
// Requires separate tables: users, orders, order_items

2. Intermediate Question

Q: When should you embed data vs. reference it in a document store?

A:

Embed when:

  • Data is accessed together (e.g., user and their profile)
  • One-to-few relationship (e.g., user and their addresses)
  • Data doesn't change frequently
  • Data size is small

Reference when:

  • One-to-many or many-to-many relationship
  • Data is large (e.g., large arrays)
  • Data is shared across documents
  • Data changes independently

Example:

// Embed: User and their preferences (small, accessed together)
{
  _id: "user123",
  name: "John",
  preferences: {theme: "dark", notifications: true}
}

// Reference: User and their orders (many orders, large data)
{
  _id: "user123",
  name: "John",
  order_ids: ["o1", "o2", "o3"]  // Reference to orders collection
}

Trade-off: Embedding improves read performance but can cause data duplication. Referencing reduces duplication but requires additional queries.

3. Senior-Level System Question

Q: Design a content management system (CMS) using MongoDB. The system needs to handle 1M articles, support versioning, handle comments, and enable full-text search. How would you model the data?

A:

Data modeling strategy:

  1. Articles collection (embedded approach for content):

    {
      _id: ObjectId("..."),
      slug: "article-title",
      title: "Article Title",
      content: "...",  // Full article content
      author_id: "user123",
      published_at: ISODate("..."),
      status: "published",  // draft, published, archived
      tags: ["tech", "database"],
      metadata: {
        reading_time: 5,
        word_count: 1000
      },
      // Embedded: Current version
      version: 2,
      // Reference: Previous versions
      version_history: [
        {version: 1, created_at: ISODate("..."), snapshot_id: "..."}
      ]
    }
    
  2. Comments collection (reference approach):

    {
      _id: ObjectId("..."),
      article_id: ObjectId("..."),  // Reference to article
      user_id: "user456",
      content: "Great article!",
      created_at: ISODate("..."),
      parent_id: null  // For nested comments
    }
    
  3. Search index (Elasticsearch):

    • Denormalized article data for fast search
    • Full-text indexing on title, content, tags
    • Updated asynchronously when articles change
  4. Indexes:

    // Fast lookups
    db.articles.createIndex({slug: 1}, {unique: true})
    db.articles.createIndex({author_id: 1, published_at: -1})
    db.articles.createIndex({status: 1, published_at: -1})
    db.articles.createIndex({tags: 1})  // Array index
    
    // Comments
    db.comments.createIndex({article_id: 1, created_at: -1})
    
  5. Versioning strategy:

    // Store versions in separate collection for large articles
    {
      article_id: ObjectId("..."),
      version: 1,
      content: "...",
      created_at: ISODate("...")
    }
    
  6. Query patterns:

    // Get article with comments
    const article = await db.articles.findOne({slug: "..."})
    const comments = await db.comments.find({article_id: article._id})
      .sort({created_at: -1})
      .limit(50)
    
    // Search (use Elasticsearch)
    const results = await elasticsearch.search({
      index: "articles",
      body: {query: {match: {content: "search term"}}}
    })
    

Optimizations:

  • Caching: Cache popular articles in Redis
  • Pagination: Use cursor-based pagination for comments
  • Async updates: Update search index asynchronously
  • Sharding: Shard articles by date or author for scale

Key Takeaways

  • Document stores use flexible schemas—documents can have different structures in the same collection
  • Embed vs. reference—embed for one-to-few, reference for one-to-many or large data
  • No JOINs—related data is embedded or requires application-level joins
  • Schema design matters—even without strict schemas, plan your document structure
  • Indexes are still important—index frequently queried fields for performance
  • Denormalization is common—store redundant data to avoid multiple queries
  • Horizontal scaling—document stores scale well across multiple servers
  • Query patterns—design documents based on how you'll query them
  • Versioning—use separate collections or embedded arrays for document versioning
  • Full-text search—often requires separate search engine (Elasticsearch) for complex queries

Keep exploring

Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.