Database Topic
Document Stores
Master document-oriented databases like MongoDB. Essential for NoSQL and modern database design interviews.
Document Stores
Why This Matters
Think of document stores like filing cabinets with flexible folders. In a relational database, you have rigid tables with fixed columns. In a document store, you have flexible documents (like folders) that can contain different information. A user document might have name, email, and address, while another user document might have name, email, and phone number. The structure is flexible.
This matters because real-world data is often semi-structured. Not every user has the same fields. Some users have addresses, others don't. Some users have multiple phone numbers, others have one. Document stores handle this naturally, while relational databases require nullable columns or separate tables.
In interviews, when someone asks "How would you design a database for X?", they're testing whether you understand when to use document stores vs relational databases. Do you know when flexible schemas help? Do you understand the trade-offs? Most engineers don't. They just use relational databases for everything.
What Engineers Usually Get Wrong
Most engineers think "document stores are just JSON databases." But document stores have features beyond JSON: indexing, queries, transactions (in some), replication, sharding. Understanding these helps you use document stores effectively. Also, document stores aren't always better—they're better for flexible schemas, but relational databases are better for complex queries and relationships.
Engineers also don't understand that document stores still need schema design. Just because schemas are flexible doesn't mean you shouldn't design them. Poor schema design leads to poor performance and hard-to-maintain code. Design your documents thoughtfully, even if the structure is flexible.
How This Breaks Systems in the Real World
A service was using a document store but didn't design the schema. Documents had inconsistent structures—some had nested objects, others had flat structures. Queries were slow because they couldn't use indexes effectively. The fix? Design your document schema. Decide on a structure (even if flexible), create indexes, and design for your query patterns.
Another story: A service was using a document store for data that had many relationships. They tried to model relationships using embedded documents, but the documents became huge (16MB limit in MongoDB). Queries were slow. The fix? Use references for relationships, or use a relational database. Document stores are great for flexible schemas, but relational databases are better for complex relationships.
What is a Document?
A document is a self-contained unit of data, similar to a row in SQL but more flexible.
{
"_id": "user_123",
"name": "John Doe",
"email": "john@example.com",
"address": {
"street": "123 Main St",
"city": "New York",
"zip": "10001"
},
"orders": [
{
"order_id": "o1",
"date": "2024-01-15",
"total": 100.50,
"items": [
{"product": "Widget", "quantity": 2, "price": 50.25}
]
}
],
"preferences": {
"theme": "dark",
"notifications": true
}
}
Key Characteristics
Schema Flexibility
Documents in the same collection can have different structures:
// Document 1
{"_id": "1", "name": "John", "age": 30}
// Document 2
{"_id": "2", "name": "Jane", "email": "jane@example.com", "tags": ["vip", "premium"]}
Benefit: Easy schema evolution, no migrations needed.
Challenge: Application must handle varying structures.
Embedded vs. Referenced
Embedded (denormalized):
{
"user_id": "123",
"name": "John",
"orders": [
{"order_id": "o1", "total": 100},
{"order_id": "o2", "total": 200}
]
}
Referenced (normalized):
// Users collection
{"_id": "123", "name": "John"}
// Orders collection
{"_id": "o1", "user_id": "123", "total": 100}
{"_id": "o2", "user_id": "123", "total": 200}
Rule of thumb: Embed when data is accessed together, reference when data is large or shared.
Popular Document Stores
MongoDB
- Most popular document database
- Rich query language
- Aggregation pipeline
- Horizontal scaling with sharding
// MongoDB query
db.users.find({ "address.city": "New York" })
db.users.aggregate([
{ $match: { "orders.total": { $gt: 100 } } },
{ $group: { _id: "$name", total_spent: { $sum: "$orders.total" } } }
])
CouchDB
- Multi-master replication
- RESTful API
- Built-in conflict resolution
- Good for offline-first applications
DynamoDB (AWS)
- Fully managed
- Serverless scaling
- Integrated with AWS ecosystem
- Pay-per-use pricing
Data Modeling Patterns
One-to-Few: Embed
{
"blog_post_id": "p1",
"title": "My Post",
"author": "John",
"comments": [
{"user": "Alice", "text": "Great post!"},
{"user": "Bob", "text": "Thanks for sharing"}
]
}
When: Comments are always loaded with the post, limited in number.
One-to-Many: Reference
// Posts collection
{"_id": "p1", "title": "My Post", "author": "John"}
// Comments collection
{"_id": "c1", "post_id": "p1", "user": "Alice", "text": "Great post!"}
{"_id": "c2", "post_id": "p1", "user": "Bob", "text": "Thanks"}
When: Comments can be numerous, accessed independently, or shared across posts.
Many-to-Many: Reference Array
// Users collection
{"_id": "u1", "name": "John", "group_ids": ["g1", "g2"]}
// Groups collection
{"_id": "g1", "name": "Developers", "member_ids": ["u1", "u2"]}
Querying Documents
Basic Queries
// Equality
db.users.find({ "name": "John" })
// Comparison
db.users.find({ "age": { $gt: 18 } })
// Array contains
db.users.find({ "tags": "vip" })
// Nested field
db.users.find({ "address.city": "New York" })
Aggregation Pipeline
db.orders.aggregate([
// Match stage: filter documents
{ $match: { "status": "completed" } },
// Group stage: group and aggregate
{ $group: {
_id: "$user_id",
total_spent: { $sum: "$total" },
order_count: { $sum: 1 }
}},
// Sort stage: order results
{ $sort: { total_spent: -1 } },
// Limit stage: top N
{ $limit: 10 }
])
Indexing
Document stores support various index types:
// Single field index
db.users.createIndex({ "email": 1 })
// Compound index
db.users.createIndex({ "last_name": 1, "first_name": 1 })
// Text index (full-text search)
db.posts.createIndex({ "title": "text", "content": "text" })
// Geospatial index
db.locations.createIndex({ "coordinates": "2dsphere" })
When to Use Document Stores
Good Fit
- Content management: Blogs, CMS, wikis
- User profiles: Flexible user data
- Catalogs: Product catalogs with varying attributes
- Real-time analytics: Event logging, metrics
- Mobile apps: Flexible schema for rapid iteration
Not a Good Fit
- Complex transactions: Multi-document ACID transactions are limited
- Heavy JOINs: Document stores don't support SQL-style JOINs
- Strict schema: When data structure must be consistent
- Reporting: Complex analytical queries are harder
Best Practices
- Design for access patterns: Structure documents based on how you query
- Limit document size: Large documents are slow to transfer and update
- Use indexes wisely: Index fields you query frequently
- Plan for growth: Consider document size limits (MongoDB: 16MB)
- Validate at application level: Schema-less doesn't mean no validation
Common Patterns
Versioning
{
"_id": "user_123",
"version": 2,
"name": "John Doe",
"previous_versions": [
{"version": 1, "name": "John D."}
]
}
Soft Deletes
{
"_id": "post_123",
"title": "My Post",
"deleted": false,
"deleted_at": null
}
Polymorphic Collections
// All in "content" collection
{"_id": "1", "type": "article", "title": "...", "body": "..."}
{"_id": "2", "type": "video", "title": "...", "url": "..."}
{"_id": "3", "type": "image", "title": "...", "src": "..."}
Interview Questions
1. Beginner Question
Q: What is a document store, and how does it differ from a relational database?
A: A document store is a NoSQL database that stores data as documents (typically JSON/BSON), rather than rows and columns.
Key differences:
- Schema flexibility: Documents can have different structures in the same collection
- No JOINs: Related data is often embedded in documents
- Horizontal scaling: Easier to scale across multiple servers
- Query language: Uses document-based queries instead of SQL
Example:
// Document store (MongoDB)
{
_id: "user123",
name: "John",
orders: [
{order_id: "o1", total: 100},
{order_id: "o2", total: 200}
]
}
// Relational database (SQL)
// Requires separate tables: users, orders, order_items
2. Intermediate Question
Q: When should you embed data vs. reference it in a document store?
A:
Embed when:
- Data is accessed together (e.g., user and their profile)
- One-to-few relationship (e.g., user and their addresses)
- Data doesn't change frequently
- Data size is small
Reference when:
- One-to-many or many-to-many relationship
- Data is large (e.g., large arrays)
- Data is shared across documents
- Data changes independently
Example:
// Embed: User and their preferences (small, accessed together)
{
_id: "user123",
name: "John",
preferences: {theme: "dark", notifications: true}
}
// Reference: User and their orders (many orders, large data)
{
_id: "user123",
name: "John",
order_ids: ["o1", "o2", "o3"] // Reference to orders collection
}
Trade-off: Embedding improves read performance but can cause data duplication. Referencing reduces duplication but requires additional queries.
3. Senior-Level System Question
Q: Design a content management system (CMS) using MongoDB. The system needs to handle 1M articles, support versioning, handle comments, and enable full-text search. How would you model the data?
A:
Data modeling strategy:
-
Articles collection (embedded approach for content):
{ _id: ObjectId("..."), slug: "article-title", title: "Article Title", content: "...", // Full article content author_id: "user123", published_at: ISODate("..."), status: "published", // draft, published, archived tags: ["tech", "database"], metadata: { reading_time: 5, word_count: 1000 }, // Embedded: Current version version: 2, // Reference: Previous versions version_history: [ {version: 1, created_at: ISODate("..."), snapshot_id: "..."} ] } -
Comments collection (reference approach):
{ _id: ObjectId("..."), article_id: ObjectId("..."), // Reference to article user_id: "user456", content: "Great article!", created_at: ISODate("..."), parent_id: null // For nested comments } -
Search index (Elasticsearch):
- Denormalized article data for fast search
- Full-text indexing on title, content, tags
- Updated asynchronously when articles change
-
Indexes:
// Fast lookups db.articles.createIndex({slug: 1}, {unique: true}) db.articles.createIndex({author_id: 1, published_at: -1}) db.articles.createIndex({status: 1, published_at: -1}) db.articles.createIndex({tags: 1}) // Array index // Comments db.comments.createIndex({article_id: 1, created_at: -1}) -
Versioning strategy:
// Store versions in separate collection for large articles { article_id: ObjectId("..."), version: 1, content: "...", created_at: ISODate("...") } -
Query patterns:
// Get article with comments const article = await db.articles.findOne({slug: "..."}) const comments = await db.comments.find({article_id: article._id}) .sort({created_at: -1}) .limit(50) // Search (use Elasticsearch) const results = await elasticsearch.search({ index: "articles", body: {query: {match: {content: "search term"}}} })
Optimizations:
- Caching: Cache popular articles in Redis
- Pagination: Use cursor-based pagination for comments
- Async updates: Update search index asynchronously
- Sharding: Shard articles by date or author for scale
-
Document stores use flexible schemas—documents can have different structures in the same collection
-
Embed vs. reference—embed for one-to-few, reference for one-to-many or large data
-
No JOINs—related data is embedded or requires application-level joins
-
Schema design matters—even without strict schemas, plan your document structure
-
Indexes are still important—index frequently queried fields for performance
-
Denormalization is common—store redundant data to avoid multiple queries
-
Horizontal scaling—document stores scale well across multiple servers
-
Query patterns—design documents based on how you'll query them
-
Versioning—use separate collections or embedded arrays for document versioning
-
Full-text search—often requires separate search engine (Elasticsearch) for complex queries
-
NoSQL Basics - Document stores are a type of NoSQL database. Understanding NoSQL basics helps understand document store characteristics.
-
Key-Value Stores - Simpler NoSQL alternative for simple lookups. Understanding key-value stores helps choose between NoSQL types.
-
Normalization - Document stores often denormalize data. Understanding normalization helps understand document store trade-offs.
-
Query Optimization - Document store queries have different optimization strategies. Understanding query optimization helps optimize document queries.
-
Data Replication - Document stores use replication for availability. Understanding replication helps design distributed document stores.
Key Takeaways
Document stores use flexible schemas—documents can have different structures in the same collection
Embed vs. reference—embed for one-to-few, reference for one-to-many or large data
No JOINs—related data is embedded or requires application-level joins
Schema design matters—even without strict schemas, plan your document structure
Indexes are still important—index frequently queried fields for performance
Denormalization is common—store redundant data to avoid multiple queries
Horizontal scaling—document stores scale well across multiple servers
Query patterns—design documents based on how you'll query them
Versioning—use separate collections or embedded arrays for document versioning
Full-text search—often requires separate search engine (Elasticsearch) for complex queries
Related Topics
NoSQL Basics
Document stores are a type of NoSQL database. Understanding NoSQL basics helps understand document store characteristics.
Key-Value Stores
Simpler NoSQL alternative for simple lookups. Understanding key-value stores helps choose between NoSQL types.
Normalization
Document stores often denormalize data. Understanding normalization helps understand document store trade-offs.
Query Optimization
Document store queries have different optimization strategies. Understanding query optimization helps optimize document queries.
Data Replication
Document stores use replication for availability. Understanding replication helps design distributed document stores.
Keep exploring
Database concepts build on each other. Explore related topics to deepen your understanding of how data systems work.