Back to blog
Data Engineeringintermediate

MongoDB: Complete Guide to Document Databases

Master MongoDB from data modeling to production deployment — schema design, aggregation pipelines, indexing, Atlas, Cosmos DB MongoDB API, and AWS DocumentDB.

LearnixoApril 17, 20268 min read
MongoDBNoSQLDatabaseAtlasCosmos DBDocumentDBAggregation
Share:𝕏

Why MongoDB?

MongoDB stores data as BSON documents (Binary JSON), allowing each document in a collection to have a completely different structure. This makes it ideal for:

  • Product catalogs with category-specific attributes
  • User profiles with optional fields
  • CMS content with rich, nested structures
  • Event logs with variable payloads
  • Applications that evolve rapidly (schema changes are additive, not migrations)

Used by: Airbnb, Adobe, eBay, Bosch, Forbes, Toyota.


Setup

Bash
# Docker (recommended for dev)
docker run -d \
  --name mongo-dev \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=devpassword \
  -p 27017:27017 \
  mongo:7

# Connect with mongosh
mongosh "mongodb://admin:devpassword@localhost:27017"

Data Modeling: The Core Skill

Unlike SQL, there are no JOINs in MongoDB. You must decide upfront whether to embed or reference.

Embed When:

  • Data is always accessed together
  • One-to-few relationship (order → 5 line items)
  • Nested data doesn't grow unboundedly
JAVASCRIPT
// Good: embed line items inside the order
{
  _id: ObjectId("..."),
  orderId: "ORD-2026-001",
  customer: { id: "u_99", name: "Sarah K.", email: "sarah@example.com" },
  items: [
    { sku: "P001", name: "Laptop Stand", qty: 1, price: 49.99 },
    { sku: "P002", name: "USB Hub", qty: 2, price: 29.99 }
  ],
  totalCents: 10997,
  status: "shipped",
  shippedAt: ISODate("2026-04-16T09:00:00Z"),
  address: {
    line1: "123 Main St",
    city: "Berlin",
    country: "DE",
    postcode: "10115"
  }
}

Reference When:

  • Data is accessed independently
  • One-to-many with large cardinality (user → thousands of orders)
  • Data is shared across documents
JAVASCRIPT
// Product referenced by ID — not duplicated in every order
{
  _id: ObjectId("..."),
  name: "Pro Gaming Mouse",
  sku: "MOUSE-PRO-01",
  category: "peripherals",
  specs: {
    dpi: 25600,
    buttons: 11,
    wireless: true,
    weight_grams: 95
  },
  variants: [
    { color: "black", stock: 150 },
    { color: "white", stock: 42 }
  ],
  price: 89.99,
  tags: ["gaming", "wireless", "ergonomic"]
}

CRUD Operations

JAVASCRIPT
// Use a database
use myapp

// Insert
db.products.insertOne({
  sku: "LAPTOP-001",
  name: "ThinkPad X1 Carbon",
  price: 1299.99,
  tags: ["laptop", "business", "lightweight"]
})

db.products.insertMany([
  { sku: "KB-001", name: "Mechanical Keyboard", price: 149.99 },
  { sku: "MOUSE-001", name: "Wireless Mouse", price: 59.99 }
])

// Read
db.products.findOne({ sku: "LAPTOP-001" })

db.products.find({
  price: { $gt: 100, $lt: 500 },
  tags: "laptop"
}).sort({ price: 1 }).limit(10)

// Update
db.products.updateOne(
  { sku: "LAPTOP-001" },
  {
    $set: { price: 1199.99, "specs.ssd": true },
    $push: { tags: "sale" },
    $currentDate: { updatedAt: true }
  }
)

// Delete
db.products.deleteOne({ sku: "DISCONTINUED-99" })
db.products.deleteMany({ stock: 0, createdAt: { $lt: new Date("2024-01-01") } })

Query Operators

JAVASCRIPT
// Comparison
{ price: { $gt: 100 } }        // greater than
{ price: { $gte: 100 } }       // >=
{ price: { $lt: 500 } }        // less than
{ price: { $ne: 0 } }          // not equal
{ status: { $in: ["active", "pending"] } }

// Array
{ tags: "laptop" }             // array contains value
{ tags: { $all: ["laptop", "sale"] } }  // contains all
{ tags: { $size: 3 } }         // array has exactly 3 elements

// Element
{ phone: { $exists: true } }   // field exists
{ age: { $type: "number" } }   // field type

// Logical
{ $and: [{ price: { $gt: 50 } }, { stock: { $gt: 0 } }] }
{ $or:  [{ category: "laptop" }, { category: "tablet" }] }
{ $not: { status: "cancelled" } }

// Regex
{ name: { $regex: /^ThinkPad/i } }

// Array element match
{ "items.price": { $gt: 100 } }
{ items: { $elemMatch: { qty: { $gt: 5 }, price: { $lt: 50 } } } }

Aggregation Pipeline

The aggregation pipeline is MongoDB's equivalent of SQL GROUP BY, JOIN, and analytics. Each stage transforms the documents.

JAVASCRIPT
// Sales report: revenue by category, last 30 days
db.orders.aggregate([
  // Stage 1: Filter
  {
    $match: {
      status: "delivered",
      createdAt: { $gte: new Date(Date.now() - 30 * 24 * 3600 * 1000) }
    }
  },

  // Stage 2: Unwind array
  { $unwind: "$items" },

  // Stage 3: Lookup product details (like SQL JOIN)
  {
    $lookup: {
      from: "products",
      localField: "items.sku",
      foreignField: "sku",
      as: "productInfo"
    }
  },
  { $unwind: "$productInfo" },

  // Stage 4: Group and calculate
  {
    $group: {
      _id: "$productInfo.category",
      totalRevenue: { $sum: { $multiply: ["$items.price", "$items.qty"] } },
      orderCount:   { $sum: 1 },
      avgOrderValue: { $avg: "$totalCents" }
    }
  },

  // Stage 5: Sort
  { $sort: { totalRevenue: -1 } },

  // Stage 6: Project output shape
  {
    $project: {
      category: "$_id",
      totalRevenue: { $round: ["$totalRevenue", 2] },
      orderCount: 1,
      avgOrderValue: { $round: ["$avgOrderValue", 2] },
      _id: 0
    }
  }
])

Indexing

JAVASCRIPT
// Single field
db.users.createIndex({ email: 1 })         // ascending
db.users.createIndex({ email: 1 }, { unique: true })

// Compound
db.orders.createIndex({ tenantId: 1, status: 1, createdAt: -1 })

// Text search
db.products.createIndex({ name: "text", description: "text" })
db.products.find({ $text: { $search: "wireless mechanical keyboard" } })

// TTL — auto-delete documents after expiry
db.sessions.createIndex(
  { createdAt: 1 },
  { expireAfterSeconds: 3600 }   // delete after 1 hour
)

// Partial index
db.orders.createIndex(
  { userId: 1, createdAt: -1 },
  { partialFilterExpression: { status: { $in: ["pending", "processing"] } } }
)

// Wildcard — index all fields in a subdocument
db.products.createIndex({ "specs.$**": 1 })

// Explain a query
db.orders.find({ tenantId: "abc", status: "shipped" })
  .explain("executionStats")

Transactions (Multi-Document ACID)

Since MongoDB 4.0, multi-document ACID transactions are supported (replica sets and sharded clusters).

JAVASCRIPT
const session = client.startSession()
session.startTransaction({
  readConcern: { level: "snapshot" },
  writeConcern: { w: "majority" }
})

try {
  await db.collection("inventory")
    .updateOne({ sku: "P001" }, { $inc: { stock: -1 } }, { session })

  await db.collection("orders")
    .insertOne({ sku: "P001", userId: "u_99", status: "confirmed" }, { session })

  await session.commitTransaction()
} catch (err) {
  await session.abortTransaction()
  throw err
} finally {
  session.endSession()
}

Cloud MongoDB Services

MongoDB Atlas (Official Cloud)

The fully managed MongoDB service from MongoDB Inc., available on all three clouds.

JAVASCRIPT
// Atlas connection string
const uri = "mongodb+srv://user:pass@cluster0.abcd.mongodb.net/myapp?retryWrites=true"

// Atlas Search — full-text powered by Lucene
db.products.aggregate([{
  $search: {
    index: "products_search",
    text: {
      query: "wireless keyboard",
      path: ["name", "description"],
      fuzzy: { maxEdits: 1 }
    }
  }
}])

Atlas features: Automatic sharding, global multi-region clusters, Atlas Search (Lucene), Atlas Vector Search (AI), Atlas Data Federation (query S3/Atlas together), Charts.


Azure Cosmos DB — MongoDB API

Cosmos DB is Microsoft's multi-model globally distributed database. The MongoDB API lets you use existing MongoDB drivers and queries against Cosmos DB.

Connection string:
mongodb://myaccount:KEY@myaccount.mongo.cosmos.azure.com:10255/?ssl=true

Key differences from native MongoDB:
- Throughput is provisioned in Request Units (RU/s)
- Global distribution with multi-master writes
- Supports Mongo wire protocol 4.0+
- No $where operator (security restriction)

When to choose Cosmos DB MongoDB API: You're already on Azure, need sub-10ms P99 globally, or need to combine it with other Cosmos DB APIs (SQL, Cassandra, Gremlin, Table) on the same account.


AWS DocumentDB (MongoDB-compatible)

AWS DocumentDB is a fully managed document database compatible with MongoDB 5.0 wire protocol but is not actually MongoDB — it's a proprietary AWS engine.

Bash
aws docdb create-db-cluster \
  --db-cluster-identifier myapp-docdb \
  --engine docdb \
  --master-username admin \
  --master-user-password $DOCDB_PASS \
  --engine-version 5.0.0

# Connect via TLS
mongosh "mongodb://admin:pass@myapp-docdb.cluster-xxx.us-east-1.docdb.amazonaws.com:27017/?tls=true&tlsCAFile=global-bundle.pem"

Limitation: DocumentDB doesn't support all MongoDB operators. Test your aggregation pipelines before migrating.


Schema Validation

JAVASCRIPT
// Enforce schema while keeping NoSQL flexibility
db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["customerId", "status", "items"],
      properties: {
        customerId: { bsonType: "string" },
        status: {
          bsonType: "string",
          enum: ["pending", "processing", "shipped", "delivered", "cancelled"]
        },
        items: {
          bsonType: "array",
          minItems: 1,
          items: {
            bsonType: "object",
            required: ["sku", "qty", "price"],
            properties: {
              qty:   { bsonType: "int", minimum: 1 },
              price: { bsonType: "double", minimum: 0 }
            }
          }
        }
      }
    }
  },
  validationAction: "error"   // reject invalid documents
})

Key Takeaways

  • Model for your queries, not for your domain — unlike SQL, your schema follows your access patterns.
  • Embed for locality (one-to-few), reference for scale (one-to-many, many-to-many).
  • Aggregation pipeline is extremely powerful — learn $match, $group, $lookup, $unwind, $project.
  • Always index your query fields — an unindexed collection scan on 10M documents will time out.
  • Use Atlas for new production systems — it handles sharding, backups, search, and vector search in one platform.
  • Cosmos DB MongoDB API is the right choice when you're already in Azure and need global distribution.

Enjoyed this article?

Explore the Data Engineering learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.