MongoDB Enterprise Advanced

Enterprise-grade features for security, management, and support. Everything in Community plus advanced capabilities for mission-critical deployments.

MongoDB EA Component Stack CLICK COMPONENTS FOR DETAILS

Application Layer

🖥️ Application Drivers (all languages)

🔌 BI Connector (SQL → MQL translation)

🧭 Compass GUI

🐚 mongosh

TLS / HTTPS

Management Layer — Ops Manager

⚙️ Automation

📊 Monitoring

💾 Backup

🔔 Alerting

🔧 REST API

🖥️ Web UI

Agent Protocol

Database Layer — MongoDB Enterprise Server

WiredTiger (Encrypted)

In-Memory Engine

🔐 LDAP / Kerberos / x.509

📋 Auditing

🔑 CSFLE / Queryable Encryption

🔗 Cluster-to-Cluster Sync

Process Mgmt

Infrastructure Layer

☸️ Kubernetes Operator

🐳 Container Support

🖧 Bare Metal / VM

☁️ Any Cloud (AWS, GCP, Azure)

EA vs Community Comparison

Key differentiators that make Enterprise Advanced the choice for production workloads.

Feature	Community	Enterprise Advanced
Storage Engine - WiredTiger	✓	✓
Encryption at Rest (Native)	✗	✓ KMIP, AWS KMS, Azure Key Vault, GCP KMS
In-Memory Storage Engine	✗	✓
LDAP Authentication & Authorization	✗	✓
Kerberos Authentication	✗	✓
Audit Logging	✗	✓ Configurable filters
Client-Side Field Level Encryption	⚠ Manual only	✓ Automatic + Queryable
Ops Manager	✗	✓ Full management platform
BI Connector	✗	✓ SQL → MQL
Kubernetes Operator	⚠ Community Operator	✓ Enterprise Operator
Enterprise Support	✗	✓ 24/7 SLA-backed
Cluster-to-Cluster Sync	✗	✓ mongosync
SNMP Monitoring	✗	✓

EA Product Components

⚙️ Ops Manager

Self-hosted management platform. Deploy, monitor, back up, and scale MongoDB clusters. Includes automation, alerting, and a web UI. The on-premises equivalent of Atlas.

🔐 Enterprise Server

Enhanced mongod/mongos with encryption at rest, LDAP, Kerberos, auditing, in-memory engine, SNMP, FIPS 140-2 compliance, and Queryable Encryption.

☸️ Kubernetes Operator

CustomResourceDefinitions for MongoDB, MongoDBOpsManager, MongoDBUser. Manages lifecycle via StatefulSets. Integrates with Ops Manager for automation.

🔌 BI Connector

Translates SQL queries to MQL. Enables BI tools (Tableau, Power BI, Looker) to query MongoDB directly via a MySQL-compatible wire protocol.

🔗 Cluster-to-Cluster Sync

mongosync enables continuous data synchronization between clusters. Supports migrations, DR, and active-passive topologies.

🛡️ Enterprise Support

24/7 SLA-backed support with <1 hour P1 response time. Access to MongoDB field engineers, proactive health checks, and upgrade advisory.

Ops Manager Architecture

Ops Manager is a self-hosted management platform. It consists of the Application Server, a backing Application Database, agents deployed on every managed host, and dedicated backup infrastructure.

High-Level Ops Manager Topology CLICK COMPONENTS FOR DETAILS

Users & Integrations

👤 Ops Manager Web UI

🔧 REST API v2.0

🔗 Terraform Provider

📧 Alert Webhooks / Email / PagerDuty / Slack

HTTPS :8443

Ops Manager Application Servers (2+ for HA)

🖥️ HTTP Server (Jetty)

💾 Backup Daemon

🔔 Alert Engine

⚙️ Goal State Engine

📊 Metric Aggregator

👥 Auth (LDAP/SAML/SCRAM)

MongoDB Wire Protocol

Ops Manager Backing Databases

Application DB (3-node RS)

Blockstore (snapshot chunks)

Oplog Store (oplog slices)

Head DB (staging)

Agent ← Poll → OM

Managed MongoDB Hosts (Agents)

🤖 MongoDB Agent (unified)

→ Automation Module

→ Monitoring Module

→ Backup Module

Ops Manager Subservices — Deep Dive

Click each subservice to expand its technical details, internal workings, and interaction patterns.

⚙️

Automation Agent

Runs on every managed host

Manages the entire lifecycle of MongoDB processes — provisioning, configuration, upgrades, and topology changes.

Pull-based model: Agent polls Ops Manager every 10s (configurable) for the "goal state" — a JSON document describing desired topology
Convergence engine: Compares current state to goal state and takes actions: start/stop mongod, modify configs, initiate replica set reconfig, add shards
Upgrade orchestration: Rolling upgrades one member at a time. Steps down primary last. Waits for secondaries to catch up before proceeding
Process management: Starts mongod/mongos with correct flags. Monitors process health. Restarts on crash with backoff
Configuration: Generates mongod.conf from goal state. Handles TLS certs, keyfiles, LDAP config, audit config
Auth bootstrap: Creates first admin user, configures keyfile auth, enables auth on replica set
Port: Outbound HTTPS to Ops Manager (port 8443). No inbound ports required
Log: /var/log/mongodb-mms-automation/automation-agent.log
Failure mode: If agent is down, existing MongoDB processes keep running. No new changes applied until agent reconnects

📊

Monitoring Agent

Runs on every managed host

Collects granular performance metrics from every MongoDB process and pushes them to Ops Manager for visualization and alerting.

Data collection methods: Runs serverStatus, replSetGetStatus, dbStats, collStats, top, currentOp, connPoolStats
Collection interval: Default every 60 seconds. Configurable down to 10 seconds
Hardware metrics: CPU usage, disk IOPS, disk utilization, memory (RSS, mapped, virtual), network I/O — collected via host agent
Replication lag: Measures optime difference between primary and each secondary. Alerts on configurable thresholds
Push model: Batches metrics and sends compressed payload to Ops Manager HTTP endpoint
Metric retention: 1-minute granularity for 48 hours → 5-min for 7 days → 1-hour for 90 days → daily for 2 years
Custom metrics: Can define custom serverStatus-based metrics for dashboards
Profiler integration: Can collect slow query logs (from profiler level 1/2) and display in Ops Manager
Data sent: ~2-5 KB per mongod per collection interval (compressed)

💾

Backup Agent

Runs on every managed host

Handles oplog tailing for continuous backup and coordinates with the Backup Daemon for snapshot creation.

Oplog tailing: Connects to each replica set member and tails the local.oplog.rs collection continuously
Initial sync: On first backup, performs a full data copy. Uses mongodump internally or filesystem snapshots
Oplog slicing: Divides oplog into time-based slices and sends compressed slices to the Oplog Store
Coordination: Reports status to Backup Daemon on Ops Manager. Daemon orchestrates when to take snapshots
Sharded cluster backup: Coordinates across all shard agents to create a consistent checkpoint using balancer pause + config server oplog position
Compression: Oplog and snapshot data compressed with zstd before transmission and storage
Bandwidth control: Configurable max oplog transfer rate per agent to avoid saturating network
Encryption: Data encrypted in transit (TLS). Optional encryption at rest in blockstore/oplog store

🏗️

Backup Daemon

Runs on Ops Manager App Server

Central coordinator for all backup operations. Manages snapshot scheduling, retention policies, and restore orchestration.

Snapshot scheduling: Creates snapshots at configurable intervals (default: every 6 hours). Base + incremental approach
Head Database: Maintains a staging copy of each backed-up replica set. Applies oplog slices to keep it current. Used as the base for snapshots
Snapshot creation: Takes a point-in-time copy of the Head DB. Chunks the data and stores in the Blockstore
Incremental snapshots: After initial full snapshot, subsequent snapshots only store changed blocks (deduplication)
Retention policy: Configurable per-project. Example: 24 hourly, 7 daily, 4 weekly, 12 monthly snapshots
Restore types: Download snapshot (.tar.gz), Restore to another cluster (automated), Point-in-time restore (to any second), Queryable backup (mount as read-only)
HA: Only ONE active backup daemon per deployment. If the active Ops Manager node fails, another takes over automatically
Storage backends: Blockstore (MongoDB RS), S3-compatible (AWS S3, MinIO, etc.), Filesystem store (NFS/SAN)

🔔

Alert Engine

Runs on Ops Manager App Server

Evaluates alert conditions against collected metrics and triggers notifications via multiple integration channels.

Alert conditions: Threshold-based (e.g., connections > 500), rate-based (e.g., page faults/sec), boolean (e.g., primary step-down)
Built-in alerts: Host down, replication lag, disk usage, oplog window, election events, backup delay, agent disconnect
Custom alerts: Create on any collected metric with AND/OR conditions
Integrations: Email, PagerDuty, Slack, OpsGenie, VictorOps, Webhook (custom), SNMP traps, HipChat
Evaluation cycle: Checks conditions every monitoring interval (default 60s)
Alert states: OPEN → ACKNOWLEDGED → CLOSED. Configurable auto-resolution
Maintenance windows: Suppress alerts during planned downtime

🖥️

HTTP Server & API

Runs on Ops Manager App Server

Jetty-based HTTP server serving the Web UI and REST API. Handles authentication, authorization, and session management.

Web UI: React-based SPA served by embedded Jetty. Provides dashboards, cluster management, user management, project settings
REST API v2.0: Full CRUD for all resources — organizations, projects, clusters, users, alerts, backup configs. Digest or API key auth
Default port: 8080 (HTTP) / 8443 (HTTPS). TLS strongly recommended for production
Authentication: Local accounts (SCRAM), LDAP bind, SAML 2.0 (SSO with Okta, ADFS, etc.), x.509 client certs
Authorization: RBAC with roles: Global Owner, Org Admin, Project Owner, Project Read Only, etc.
Session: HTTP sessions stored in Application Database. Configurable timeout (default 12 hours)
Rate limiting: Configurable per-user and per-API-key rate limits to protect against abuse
Load balancer: For HA, place 2+ Ops Manager app servers behind an L7 LB with sticky sessions

Agent ↔ Ops Manager Interaction Flow

Automation Agent Convergence Loop

Agent Polls

GET /api/agent/v1/goalState
Every 10s via HTTPS

→

Goal State Received

JSON describing desired
topology & config

→

Compare

Diff current state vs
goal state

→

Converge

Apply changes: start/stop
processes, reconfig RS

→

Report

POST status back to
Ops Manager

Monitoring Data Flow

mongod / mongos

serverStatus, replSetGetStatus
dbStats, currentOp

→

Monitoring Module

Collects every 60s
+ host hardware stats

→

Compressed Push

HTTPS POST to
Ops Manager

→

Metric Aggregator

Stores in App DB
with time-based rollup

→

UI / Alerts

Dashboards, charts
Alert evaluation

💡

Unified Agent (4.0+): Since Ops Manager 4.0, all three functions (Automation, Monitoring, Backup) are bundled into a single mongodb-agent binary. The agent runs as a single process and enables/disables modules based on Ops Manager configuration. This simplifies deployment from 3 agents to 1 per host.

Ops Manager Backing Databases

Ops Manager relies on several dedicated MongoDB instances for its own operation. These are critical infrastructure.

Application Database

Purpose: Stores all Ops Manager metadata — users, organizations, projects, cluster configs, alert definitions, metric data, automation goal states, audit logs.

Sizing: Requires 3+ node replica set. SSD recommended. Size depends on # of managed hosts: ~2 GB per 100 servers.

Requirements: Must use WiredTiger. Must have oplog sized for at least 24 hours. Dedicated hardware recommended (not co-located with managed clusters).

Blockstore

Purpose: Stores snapshot data as compressed, deduplicated chunks. Each snapshot is a series of blocks referencing a base snapshot plus deltas.

Sizing: Depends on total data being backed up. Rule of thumb: 2-3x the total dataset size (for retention + overhead).

Options: MongoDB replica set (default), S3-compatible (AWS S3, GCS, MinIO), filesystem (NFS/SAN). Can have multiple blockstores for distribution.

Oplog Store

Purpose: Stores oplog slices between snapshots. These slices enable point-in-time restore to any second between two snapshots.

Sizing: Grows based on write volume. Typically 10-20% of blockstore size. Oplog slices are compressed.

Retention: Slices are garbage-collected after the next snapshot covers that time range plus the retention window.

Head Database

Purpose: Temporary staging database per backed-up replica set. The Backup Daemon applies oplog slices to keep it up-to-date, then takes periodic snapshots from it.

Lifecycle: Created automatically. One per backed-up replica set. Stored on Backup Daemon host or dedicated storage.

Note: High I/O component. Place on fast SSD. Size equals the dataset of the replica set being backed up.

Backup & Recovery Architecture

Ops Manager provides enterprise-grade backup with continuous point-in-time recovery, automated scheduling, and multiple restore options.

Backup Data Flow HOVER LAYERS TO FOCUS

Source: Managed MongoDB Clusters

Primary (initial sync source)

Secondary (oplog tailing)

📋 local.oplog.rs (capped collection)

Oplog Tail

Processing: Backup Agent + Daemon

🤖 Backup Agent tails oplog continuously

📦 Slices oplog into time-based chunks

🏗️ Backup Daemon applies slices to Head DB

📸 Takes periodic snapshots from Head DB

Compressed Write

Storage: Blockstore + Oplog Store

💾 Blockstore: Compressed snapshot chunks

📋 Oplog Store: Oplog slices (PIT restore)

☁️ Optional: S3-compatible backend

📁 Optional: Filesystem (NFS/SAN)

📸

Snapshot Backup

Periodic full + incremental snapshots with configurable retention

▾

How Snapshots Work

Initial Sync

Full copy of dataset from primary to Head Database. Uses mongodump or filesystem snapshot.

↓

Continuous Oplog Tailing

Backup Agent tails oplog from a secondary. Sends compressed slices to Oplog Store every few seconds.

↓

Head DB Catch-up

Backup Daemon applies oplog slices to Head DB, keeping it within seconds of the live cluster.

↓

Snapshot Creation

At scheduled intervals, Daemon snapshots the Head DB. First snapshot is full; subsequent ones are incremental (block-level dedup).

↓

Storage in Blockstore

Snapshot chunks compressed with zstd and stored in Blockstore. Deduplication across snapshots saves 60-80% space.

Default Retention Policy

Frequency	Retention	Example
Every 6 hours	24 snapshots	6 days of 6-hourly snapshots
Daily	7 snapshots	One week of daily snapshots
Weekly	4 snapshots	One month of weekly snapshots
Monthly	12 snapshots	One year of monthly snapshots

💡

Sharded Cluster Snapshots: The Backup Daemon pauses the balancer, records the config server oplog position, and coordinates all shard agents to capture a consistent cluster-wide checkpoint. This ensures cross-shard consistency without stopping writes.

🔄

Continuous Backup (Point-in-Time Recovery)

Restore to any second within the oplog window

▾

How Point-in-Time Recovery Works

Closest Prior Snapshot

System identifies the most recent snapshot before the target timestamp

↓

Restore Snapshot Base

Reconstructs dataset from Blockstore chunks

↓

Replay Oplog Slices

Applies stored oplog slices from Oplog Store up to the exact target timestamp

↓

Deliver Restore

Download as archive or automated restore to a cluster

RPO (Recovery Point Objective)

Depends on oplog tailing lag. Typically <5 seconds under normal load. Worst case bounded by oplog window size.

                                    # Oplog window requirements

                                    Minimum recommended: 24 hours

                                    Ideal for backup: 48+ hours

                                    Check with: rs.printReplicationInfo()

RTO (Recovery Time Objective)

Depends on dataset size and oplog volume. Automated restore provisions new cluster and streams data.

                                    # Approximate restore times

                                    10 GB dataset: ~5-10 minutes

                                    100 GB dataset: ~30-60 minutes

                                    1 TB dataset: ~2-6 hours

🔧

Restore Options

Four distinct methods for different recovery scenarios

▾

📥 Download Snapshot

Download a compressed .tar.gz archive of a snapshot. Manually restore using mongorestore. Best for: dev/test environments, selective data recovery.

🔄 Automated Restore

Ops Manager provisions target cluster and streams snapshot data. Fully automated. Handles replica set initialization and shard configuration. Best for: DR, cluster rebuild.

⏱️ Point-in-Time Restore

Restore to any specific second within the PIT window. Combines closest snapshot + oplog replay. Best for: accidental data deletion, corruption recovery.

🔍 Queryable Backup

Mount any snapshot as a read-only MongoDB instance. Query it like a regular database without full restore. Best for: auditing, data verification, selective retrieval.

⚠️

Queryable Backup: Exposes a temporary read-only mongod on a configured port. TLS and auth are required for production use. Snapshot is mounted read-only — no writes possible. Automatically dismounts after configurable timeout.

🗄️

Storage Backend Options

MongoDB RS, S3-compatible, or filesystem stores

▾

Backend	Best For	Pros	Cons
MongoDB Replica Set	Default, most features	Deduplication, queryable backup, fast restore	Requires dedicated MongoDB infrastructure
S3-Compatible	Cloud, cost optimization	Cheap storage, AWS S3 / GCS / MinIO, scalable	No queryable backup, slower restore
Filesystem (NFS/SAN)	Air-gapped, on-prem	Works with existing storage infra, no extra MongoDB needed	No queryable backup, no dedup, manual capacity management

Security & Encryption

MongoDB Enterprise Advanced provides defense-in-depth security across authentication, authorization, encryption, and auditing.

Security Layers HOVER LAYERS TO FOCUS

Network Security

🔒 TLS/SSL (all connections)

🌐 VPC / VPN / Private Link

🛡️ IP Allowlists

🔗 x.509 mutual TLS

Authentication

🔐 SCRAM-SHA-256

📂 LDAP (Active Directory)

🎫 Kerberos (GSSAPI)

📜 x.509 Certificates

🔑 AWS IAM (Atlas)

Authorization

👤 RBAC (100+ built-in roles)

📂 LDAP Group → Role Mapping

🔧 Custom Roles (collection-level)

🔍 Privilege Actions (180+)

Encryption

💽 Encryption at Rest (WiredTiger)

🔑 KMIP / AWS KMS / Azure / GCP

🔐 Client-Side FLE (CSFLE)

🔍 Queryable Encryption

Auditing & Compliance

📋 Database Audit Log

🔍 Configurable Audit Filters

📜 FIPS 140-2

🏛️ SOC2, HIPAA, PCI-DSS, GDPR

💽

Encryption at Rest

WiredTiger native encryption with external key management

▾

Key Hierarchy

Master Key

Stored externally in KMIP server, AWS KMS, Azure Key Vault, or GCP KMS. Never stored on disk.

↓ wraps

Database Key

Per-database encryption key. Encrypted (wrapped) by the Master Key. Stored on disk alongside data.

↓ encrypts

Data Files

WiredTiger data files, indexes, journal, oplog — all encrypted with AES-256-CBC or AES-256-GCM.

                            # mongod.conf — Encryption at Rest with KMIP

                            security:

                              enableEncryption: true

                              kmip:

                                serverName: "kmip.example.com"

                                port: 5696

                                clientCertificateFile: "/etc/ssl/kmip-client.pem"

                                serverCAFile: "/etc/ssl/kmip-ca.pem"

                            # Alternative: AWS KMS

                            security:

                              enableEncryption: true

                              encryptionKeyFile: ""

                              kmip:

                                serverName: "kms.us-east-1.amazonaws.com"

🔄

Key Rotation: Master key rotation is online (no downtime). Use db.adminCommand({rotateMasterKey: 1}). The old master key decrypts database keys, which are then re-encrypted with the new master key. Data files are NOT re-encrypted (only the wrapping changes).

🔒

Encryption in Transit (TLS/SSL)

All connections encrypted with TLS 1.2+

▾

Client → mongod/mongos

Application drivers connect over TLS. Certificate validation, hostname verification, and optional client cert auth (x.509 mutual TLS).

Member → Member (Intra-cluster)

Replica set members and shards communicate over TLS with internal auth using keyFile or x.509 member certificates.

Agent → Ops Manager

All agent communication to Ops Manager over HTTPS (port 8443). Agent validates Ops Manager's TLS certificate.

mongod → KMIP Server

Key management traffic encrypted with mutual TLS. Client certificate authenticates the mongod to the KMIP server.

                            # mongod.conf — TLS Configuration

                            net:

                              tls:

                                mode: requireTLS

                                certificateKeyFile: "/etc/ssl/mongod.pem"

                                CAFile: "/etc/ssl/ca.pem"

                                clusterFile: "/etc/ssl/mongod-internal.pem"

                                disabledProtocols: TLS1_0,TLS1_1

                                allowConnectionsWithoutCertificates: false

🔐

Client-Side Field Level Encryption (CSFLE) & Queryable Encryption

Encrypt sensitive fields before they leave the application

▾

CSFLE (Client-Side Field Level Encryption)

Fields encrypted/decrypted by the driver. Server never sees plaintext.

Deterministic: Same input → same ciphertext. Supports equality queries
Randomized: Same input → different ciphertext each time. No query support but stronger security
Key vault: Collection storing Data Encryption Keys (DEKs), wrapped by a Customer Master Key (CMK)
CMK providers: AWS KMS, Azure Key Vault, GCP KMS, KMIP, Local (dev only)
Automatic mode: Schema map in driver config defines which fields to encrypt. Driver handles transparently
mongocryptd: Enterprise-only helper process for automatic encryption schema validation

Queryable Encryption

Query encrypted data without the server ever decrypting it. Uses structured encryption.

Equality queries: Find documents where encrypted field equals a value
Range queries: (Preview) Query encrypted numeric/date fields with $gt, $lt
Server-side: MongoDB stores encrypted index metadata. Queries evaluated on encrypted tokens
Contention factor: Configurable trade-off between query performance and information leakage
Compaction: Periodic metadata cleanup to reduce index overhead
Use cases: PII, SSN, medical records, financial data — regulatory compliance

Application

{ ssn: "123-45-6789" }

→

Driver + CMK

Encrypts ssn field

→

MongoDB Server

{ ssn: Binary("encrypted...") }

→

On Disk

Encrypted at field level
+ encrypted at rest

📋

Audit Logging

Granular audit trail for compliance and forensics

▾

                            # mongod.conf — Audit Configuration

                            auditLog:

                              destination: file # file | syslog | console

                              format: JSON # JSON | BSON

                              path: "/var/log/mongodb/audit.json"

                              filter: '{ atype: { $in: ["authenticate", "createCollection", "dropCollection", "createUser", "dropUser", "insert", "update", "delete"] } }'

Auditable Events

Auth Events

authenticate, authCheck, createUser, dropUser, grantRolesToUser, revokeRolesFromUser, updateUser

CRUD Events

insert, update, delete, find — with audit filter to select specific databases/collections

DDL Events

createCollection, dropCollection, createIndex, dropIndex, createDatabase, dropDatabase

Admin Events

replSetReconfig, enableSharding, addShard, shutdown, applicationMessage

Monitoring & Logging

Ops Manager provides deep monitoring of every MongoDB process with real-time dashboards, alerting, and log management.

Metrics Collected

📊

Server Metrics (from serverStatus)

100+ metrics collected per mongod/mongos process

▾

🔗 Connections

connections.current, connections.available, connections.totalCreated — Track connection pool usage, detect connection leaks or exhaustion.

📝 Operations

opcounters.insert/query/update/delete/getmore/command — Per-second operation rates. Baseline for capacity planning.

🧠 Memory

mem.resident, mem.virtual, mem.mapped, WiredTiger cache: bytesCurrentlyInCache, bytesReadIntoCache, bytesWrittenFromCache, cache eviction rates.

🔒 Locks

globalLock.activeClients.readers/writers, locks.Global/Database/Collection — Identify lock contention and concurrency bottlenecks.

💾 WiredTiger

Checkpoint duration, cache utilization %, eviction rates, bytes written to journal, log sync latency, concurrent transactions (read/write tickets).

🌐 Network

network.bytesIn/bytesOut, network.numRequests, network.compression.*.compressor — Network throughput and compression efficiency.

🔄

Replication Metrics

Lag, oplog window, member state tracking

▾

⏱️ Replication Lag

Time difference between primary optime and each secondary's optime. Measured in seconds. Alert threshold: typically 10-60 seconds.

📋 Oplog Window

Time span of operations in the oplog. If a secondary falls behind more than this, it needs a full resync. Alert if < 24 hours.

🔀 Member State

PRIMARY, SECONDARY, ARBITER, RECOVERING, STARTUP2, DOWN, ROLLBACK. Track state transitions and election events.

📊 Replication Throughput

Oplog entries applied per second on secondaries. replSetGetStatus.optimes. Tracks how fast secondaries are catching up.

🖥️

Host Hardware Metrics

System-level metrics from the agent host

▾

⚡ CPU

User %, system %, iowait %, idle %. Per-core breakdown. Normalized to mongod process CPU usage.

💾 Disk

IOPS (read/write), throughput (MB/s), latency (ms/op), queue depth, disk utilization %, free space.

🧠 System Memory

Total, free, cached, buffers, swap usage. Compare with WiredTiger cache to detect memory pressure.

🌐 Network

Bytes in/out per interface, packet errors, dropped packets. Critical for replica set and sharded cluster health.

Metric Retention & Granularity

Time Range	Granularity	Retention
Last 48 hours	1 minute	2 days
Last 7 days	5 minutes	7 days
Last 90 days	1 hour	90 days
Last 2 years	1 day	730 days

Logging Architecture

📄 MongoDB Server Logs

Structured JSON logs (since 4.4). Components: ACCESS, COMMAND, CONTROL, ELECTION, GEO, INDEX, NETWORK, QUERY, REPL, SHARDING, STORAGE, WRITE.

Configurable verbosity per component (0-5). Slow query logging via profiler or slowOpThresholdMs (default 100ms).

🤖 Agent Logs

/var/log/mongodb-mms-automation/

automation-agent.log — Goal state changes, process starts/stops, config changes.
monitoring-agent.log — Metric collection events, connection issues.
backup-agent.log — Oplog tailing status, sync progress.

🖥️ Ops Manager Logs

/opt/mongodb/mms/logs/

mms0.log — Application server logs (HTTP requests, API calls).
daemon.log — Backup daemon operations.
mms0-audit.log — Ops Manager user actions audit.

🔗 External Integration

Forward logs to external systems via:

Syslog: systemLog.destination: syslog
File → Fluentd/Logstash: Ship JSON logs to ELK/Splunk
SNMP: Enterprise SNMP traps for integration with Nagios, Zabbix

High Availability — From Nodes to Datacenters

MongoDB EA HA isn't a single feature — it's a layered defense. Each level protects against a broader class of failure, and they all build on the same replica set foundation. A production deployment should address every level.

Level 1 — Process

A single mongod crashes, runs OOM, or is killed. Replica set auto-elects a new primary in ~5-10s. Zero data loss with journaling.

Level 2 — Host / Rack

A physical server dies or a rack loses power. Members in other racks/AZs elect a new primary. Spread members across failure domains within a DC.

Level 3 — Datacenter

An entire DC goes offline — network, power, or disaster. Replica set members in surviving DCs elect a new primary. Requires members distributed across DCs.

Level 4 — Region / Global

A region-wide outage or you need zero write-latency in each region. Zone sharding pins data locally; mongosync enables standby clusters for manual failover.

💡

The same rs.initiate() and w: "majority" mechanics work identically whether members are on the same rack or spread across continents. MongoDB does not distinguish between "local HA" and "geo HA" — it's all replica set replication. The only things that change are network latency, member placement, and election configuration.

Component HA Matrix

Every component in the EA stack must be resilient. Here's the HA strategy for each, applicable whether running in a single DC or across multiple.

Component	HA Strategy	Min Nodes	Failover
Managed MongoDB RS	3+ member replica set. Spread across AZs (single DC) or DCs (multi-DC).	3 (5 for multi-DC)	Automatic election, ~5-10s within DC, ~10-30s cross-DC.
Ops Manager App Server	2+ instances behind L7 load balancer (sticky sessions). Can span DCs.	2	LB routes to healthy instance. 0 downtime.
Application Database	Standard MongoDB RS. Should span DCs in multi-DC deployments.	3	Automatic election. ~5-10s single DC, ~10-30s cross-DC.
Backup Daemon	Active-passive across Ops Manager instances.	2 (OM servers)	Auto failover to another OM instance. 1-2 min.
Blockstore / Oplog Store	MongoDB RS(s). Multiple stores for load. Should span DCs for DR.	3 per RS	RS automatic election. Multiple stores for distribution.
MongoDB Agent	Systemd auto-restart. Ops Manager alerts on disconnect.	1 per host	Agent down = no new automation. Running processes unaffected.
mongos (Sharded)	Stateless. Multiple instances behind app-level LB.	2+	Driver reconnects to available mongos. No state to lose.
Config Servers	3-member RS (CSRS). Spread across DCs for sharded multi-DC.	3	Automatic election. Cluster read-only if primary lost during metadata ops.

Production Deployment Topology

Whether running in one datacenter or three, the stack is the same — only the member placement changes.

Single-DC Production Topology HOVER LAYERS TO FOCUS

Load Balancer (L7, sticky sessions)

🌐 HAProxy / Nginx / AWS ALB / F5

🔒 TLS termination or passthrough

❤️ Health check: GET /user/login (HTTP 200)

Round Robin

Ops Manager Application Servers (2+)

🖥️ OM Server 1 (HTTP + Backup Daemon ACTIVE)

🖥️ OM Server 2 (HTTP + Backup Daemon STANDBY)

🖥️ OM Server 3 (HTTP + Backup Daemon STANDBY)

Wire Protocol

Backing Databases (All Replica Sets — spread across AZs/racks)

App DB: P + S + S (3 nodes)

Blockstore 1: P + S + S

Blockstore 2: P + S + S (optional)

Oplog Store: P + S + S

HTTPS Outbound

Managed MongoDB Infrastructure

🤖 Agent on every host (auto-restart via systemd)

📦 MongoDB RS / Sharded — members across AZs or racks

⚠️

Critical: The Ops Manager Application Database should never be managed by Ops Manager itself. Deploy it independently on separate hardware with its own backup strategy (e.g., filesystem snapshots or mongodump cron).

💡

Backup Daemon HA: Only one Backup Daemon is active at any time. If the active OM node goes down, another instance automatically takes over within 1-2 minutes. Head DB files must be on shared storage (NFS/SAN) or the new daemon rebuilds from the blockstore.

The topology above protects against node and rack failures within a single DC. But what happens if the entire datacenter goes down? The same replica set just needs its members placed across DCs.

Extending to Multi-DC: Same Replica Set, Wider Placement

There is no separate "multi-DC mode." You take the same replica set and distribute members across datacenters. Elections, replication, and write concerns all work identically — the only difference is network latency between members.

Multi-DC Replica Set — 5 Members Across 3 Datacenters CLICK NODES FOR DETAILS

Datacenter A — Primary Region (e.g., us-east-1)

P (Primary) — Priority: 10

S1 (Secondary) — Priority: 8

Oplog Replication (~5-50ms cross-DC)

Datacenter B — Secondary Region (e.g., us-west-2)

S2 (Secondary) — Priority: 6

S3 (Secondary) — Priority: 6

Heartbeat / Vote (~10-100ms)

Datacenter C — Tiebreaker (e.g., eu-west-1)

S4 or Arbiter — Priority: 0 (never primary)

💡

Why 3 fault domains? With only 2 DCs and a 2+3 member split, losing the 3-member DC means the remaining 2 members cannot form a majority (2 of 5). The cluster becomes read-only. A 3rd DC tiebreaker ensures any single DC failure always leaves a majority — enabling automatic failover with zero data loss.

⏱️ Election Timing

Default electionTimeoutMillis: 10000 (10s). In cross-DC setups, increase to 15-30s to avoid spurious elections from transient network blips between DCs.

🏷️ Priority-Based DC Preference

Set higher priority for members in the preferred primary DC (priority: 10). Lower priority for remote DCs (priority: 6). Highest priority + most current oplog wins elections.

🔀 Network Partition Handling

If DC-A partitions from DC-B+C: the 3-member side forms a majority and elects a new primary. DC-A members step down (no majority). On heal, they rejoin and rollback any uncommitted w:1 writes.

📊 Rollback Protection

w: "majority" ensures writes replicate across DCs before ack. These survive any single DC failure with zero rollback. Only w:1 writes that haven't replicated will be rolled back.

What Happens When a Datacenter Fails

The election process is the same whether one node crashes or an entire DC goes dark. The only difference is how many members are lost at once.

DC-A Goes Offline

DC-A loses connectivity — Primary (P) and Secondary (S1) become unreachable simultaneously. In-flight writes to the primary are interrupted.

↓

Heartbeat Timeout (10s)

S2, S3 (DC-B) and S4 (DC-C) stop receiving heartbeats from P and S1. After electionTimeoutMillis (10s default), secondaries detect the primary is unreachable — same mechanism as a single node failure.

↓

Majority Check: 3 of 5 Alive

Surviving members: S2 + S3 + S4 = 3 of 5 votes. This is a majority. S2 (highest priority among survivors, most current oplog) calls an election.

↓

S2 Elected Primary in DC-B (~10-15s total)

S3 and S4 vote for S2. Election completes. S2 is the new primary, now accepting writes in DC-B. S3 continues as secondary. S4 remains tiebreaker.

↓

Drivers Reconnect Automatically

MongoDB drivers detect the topology change via SDAM (Server Discovery & Monitoring). Writes redirect to S2. Reads with nearest preference already target the local DC. Applications resume within ~1-3s after election.

↓

DC-A Recovers & Rejoins

When DC-A comes back, P and S1 contact S2 (new primary). P steps down to secondary. Any w:1 writes not replicated before the failure are rolled back to a file. All w:majority writes are intact — they were acknowledged across DCs before the failure.

Write Concern & Read Preference — The Knobs That Control Durability vs Latency

These settings work the same in single-DC and multi-DC. The difference is that in multi-DC, w: "majority" must wait for cross-DC replication — adding network latency to every acknowledged write.

Write Concern	Single-DC Latency	Multi-DC Latency	Durability on DC Failure
`w: 1`	~1ms (local ack)	~1ms (local ack)	⚠ May lose writes if primary DC fails before replication
`w: "majority"`	~2-5ms (local RS)	~50-100ms (cross-DC round-trip)	✓ Zero data loss — survives any single DC failure
`w: "majority", j: true`	~5-10ms	~50-150ms	✓ Maximum — survives simultaneous power loss at any DC
`w: 3` (numeric)	~2-5ms	Variable	✓ 3 copies, but may not guarantee cross-DC spread

Read Preference	Behavior	Best For
`primary`	Always reads from primary. Cross-DC latency if app is in a different DC.	Strong consistency required
`primaryPreferred`	Primary if available; secondary during failover.	Prefer consistency, tolerate stale during failover
`nearest`	Lowest-latency member. Always reads from local DC if a member exists there.	✓ Recommended for multi-DC — local reads everywhere
`nearest` + `tag_sets`	Nearest member matching tags (e.g., `dc: "us-west"`).	Explicit DC-aware routing, data locality
`secondary`	Only secondaries. Can target local DC secondary for analytics.	Offload reads from primary for reporting

⚠️

The latency trade-off: In a multi-DC RS with 50ms cross-DC latency, w: "majority" adds ~50-100ms to every write. For latency-sensitive workloads, you have three options: (1) Accept w:1 with some RPO risk, (2) Use an Active-Standby pattern with mongosync, or (3) Use Zone Sharding to keep writes local.

Alternative Multi-DC Patterns

A 3-DC replica set is ideal but not always possible. Here are the alternatives and their trade-offs.

🔗

Active-Standby with mongosync (Cluster-to-Cluster Sync)

Two independent clusters, one active, one standby. Works with only 2 DCs. Manual failover.

▾

Architecture

Cluster A (Active): Full 3-node RS in DC-A. All reads & writes
Cluster B (Standby): Independent 3-node RS in DC-B. Receives changes via mongosync
mongosync: Continuous replication via change streams. Handles initial sync + tailing
Failover: Manual — reverse mongosync direction or promote Cluster B

Trade-offs

RPO: Seconds to minutes (async)
RTO: Minutes (manual failover + DNS switch)
Pro: Zero cross-DC latency on writes. Full local performance
Pro: Works with only 2 DCs
Con: Not automatic. Potential data loss
Use case: On-prem with 2 DCs, or cross-DC latency too high for w:majority

📦

Zone Sharding — Data Locality Across Regions

Pin data to specific DCs for sovereignty and local writes while maintaining global distribution.

▾

How It Works

Zones: Tag shards with zone names ("US", "EU", "APAC"). Each zone maps to a DC/region
Zone ranges: Map shard key ranges to zones. Data for US users stays on US shards
Balancer: Automatically migrates chunks to the correct zone. Sovereignty enforced at DB level
Local reads: App connects to local mongos → reads from local shard (sub-ms)
Global queries: Scatter-gather if query doesn't include the shard key/zone field

                                # Tag shards to zones

                                sh.addShardTag("shard-us-1", "US")

                                sh.addShardTag("shard-eu-1", "EU")

                                sh.addShardTag("shard-ap-1", "APAC")

                                # Pin data ranges to zones

                                sh.addTagRange("mydb.users",

                                  { region: "US" },

                                  { region: "US\uffff" },

                                  "US")

Making the Entire Stack Multi-DC

It's not just the data clusters — every supporting component should also span DCs for true datacenter-level resilience.

🖥️ Ops Manager App Servers

Deploy 2+ OM instances across DCs behind a global load balancer (AWS Global Accelerator, Cloudflare, F5 GTM). If the primary DC's OM fails, the LB routes agents and users to the surviving DC's OM.

🗄️ Ops Manager App DB

The Application Database RS should span DCs — same as managed clusters. Use w: "majority" so OM metadata survives a DC failure.

💾 Backup Stores

Blockstore and Oplog Store RS should span DCs too. Or use S3/GCS with cross-region replication. Backup data must survive a DC failure independently.

🤖 Agent Connectivity

Agents connect to OM via HTTPS. Configure mmsBaseUrl to a global LB endpoint so agents automatically reach whichever OM instance is healthy, regardless of which DC it's in.

Deployment Models

MongoDB Enterprise Advanced supports both traditional bare metal/VM deployments and modern Kubernetes-native deployments via the Enterprise Operator.

Comparison: Kubernetes vs Bare Metal / VM

Aspect	Bare Metal / VM	Kubernetes
Deployment	RPM/DEB packages, manual or Ansible/Terraform	Enterprise Operator + CRDs (kubectl apply)
Process Lifecycle	Automation Agent manages mongod processes	Operator manages StatefulSets → Pods → mongod
Storage	Local SSD, SAN, NAS — full control	PersistentVolumeClaims (PVC) — StorageClass dependent
Networking	Static IPs, DNS, direct port access	Kubernetes Services, Headless Services, optional Ingress/LoadBalancer
Scaling	Provision new VMs, install agent, add to topology	`kubectl patch` to change replica count. Operator handles the rest
Upgrades	Ops Manager rolling upgrade via Automation Agent	Operator performs rolling update of StatefulSet pods
HA / Anti-affinity	Manual rack/AZ placement	Pod anti-affinity rules, topology spread constraints
Resource Isolation	Dedicated hardware, cgroups	Resource limits/requests, QoS classes, node selectors
TLS/Cert Management	Manual cert deployment or Vault integration	cert-manager integration, automatic rotation
Monitoring	Ops Manager Agent (native)	Ops Manager Agent in sidecar + Prometheus endpoints
Backup	Ops Manager Backup Agent (native)	Ops Manager Backup Agent in sidecar
Ops Manager itself	Installed on dedicated VMs	Can run in K8s via MongoDBOpsManager CRD
Best For	Max control, air-gapped, regulatory, legacy infra	Cloud-native, GitOps, auto-scaling, dev/staging

🖧

Bare Metal / VM Deployment

Traditional deployment with full infrastructure control

▾

Deployment Steps

Provision Infrastructure

Allocate VMs/bare metal for: Ops Manager (2+), App DB (3), Blockstore (3), Oplog Store (3), Managed hosts (N)

↓

Install Ops Manager

RPM/DEB package. Configure gen.key, mms.conf. Point to Application DB connection string

↓

Configure Backup Stores

Via Ops Manager UI: add Blockstore RS, Oplog Store RS, or S3 backend. Set retention policies

↓

Install Agents on Managed Hosts

Download agent from Ops Manager. Install via RPM/DEB. Agent auto-registers with Ops Manager

↓

Deploy MongoDB Clusters

Via Ops Manager UI or API: define topology, agents converge to deploy RS/sharded clusters

Hardware Recommendations

Component	CPU	RAM	Storage	Network
Ops Manager Server	8+ cores	16+ GB	100 GB SSD	1 Gbps+
Application DB node	4+ cores	16+ GB	SSD, sized by metadata	1 Gbps+
Blockstore node	4+ cores	16+ GB	SSD, 2-3x total backup data	10 Gbps
Oplog Store node	4+ cores	8+ GB	SSD, ~10-20% of blockstore	1 Gbps+
Managed mongod host	Per workload	Per workload	SSD/NVMe recommended	10 Gbps

                            # /opt/mongodb/mms/conf/conf-mms.properties (key settings)

                            mongo.mongoUri=mongodb://appdb1:27017,appdb2:27017,appdb3:27017/?replicaSet=appdbRS

                            mms.centralUrl=https://opsmanager.example.com:8443

                            mms.fromEmailAddr=ops-manager@example.com

                            mms.adminEmailAddr=admin@example.com

                            mms.mail.transport=smtp

                            mms.mail.hostname=smtp.example.com

                            mms.mail.port=587

☸️

Kubernetes Deployment

Cloud-native deployment with MongoDB Enterprise Kubernetes Operator

▾

Operator Architecture

Kubernetes Operator Components HOVER LAYERS TO FOCUS

Custom Resource Definitions (CRDs)

📦 MongoDB — Replica sets and sharded clusters

⚙️ MongoDBOpsManager — Ops Manager itself in K8s

👤 MongoDBUser — Database user management

🔗 MongoDBMultiCluster — Multi-K8s clusters

Operator Controller (Deployment)

🎯 Watches CRDs for changes

🔄 Reconciliation loop

📋 Creates/updates StatefulSets, Services, ConfigMaps, Secrets

🔗 Communicates with Ops Manager API

Kubernetes Resources Created

📦 StatefulSet (one per RS / shard)

💾 PersistentVolumeClaim per pod

🌐 Headless Service (member discovery)

🔐 Secrets (certs, keyfiles, credentials)

🤖 Init containers (agent setup)

📊 Sidecar containers (agent process)

Example: Deploy a 3-Node Replica Set

                            # mongodb-rs.yaml

                            apiVersion: mongodb.com/v1

                            kind: MongoDB

                            metadata:

                              name: my-replica-set

                              namespace: mongodb

                            spec:

                              members: 3

                              version: "7.0.12-ent"

                              type: ReplicaSet

                              opsManager:

                                configMapRef:

                                  name: ops-manager-connection

                              credentials: ops-manager-org-credentials

                              persistent: true

                              podSpec:

                                podTemplate:

                                  spec:

                                    affinity:

                                      podAntiAffinity:

                                        requiredDuringSchedulingIgnoredDuringExecution:

                                        - labelSelector:

                                            matchExpressions:

                                            - key: app

                                              operator: In

                                              values: ["my-replica-set"]

                                          topologyKey: "topology.kubernetes.io/zone"

                                persistence:

                                  single:

                                    storage: 50Gi

                                    storageClass: gp3-encrypted

                              security:

                                tls:

                                  enabled: true

                                authentication:

                                  modes: ["SCRAM"]

Ops Manager in Kubernetes

                            # ops-manager.yaml — Deploy Ops Manager itself in K8s

                            apiVersion: mongodb.com/v1

                            kind: MongoDBOpsManager

                            metadata:

                              name: ops-manager

                            spec:

                              replicas: 2 # HA: 2 Ops Manager instances

                              version: "7.0.4"

                              adminCredentials: ops-manager-admin

                              applicationDatabase:

                                members: 3

                                podSpec:

                                  persistence:

                                    single:

                                      storage: 20Gi

                              backup:

                                enabled: true

                                s3Stores:

                                - name: s3-backup

                                  s3BucketEndpoint: s3.amazonaws.com

                                  s3BucketName: mongodb-backups

                                  s3SecretRef:

                                    name: s3-credentials

☸️

Multi-Kubernetes Cluster: The Enterprise Operator supports deploying a single MongoDB replica set across multiple Kubernetes clusters (e.g., one member per AZ/region). Uses MongoDBMultiCluster CRD. Requires network connectivity between clusters (e.g., VPC peering, service mesh).

When to Use Which?

Choose Bare Metal / VM when...

Air-gapped / disconnected environments
Strict regulatory requirements (gov, financial)
Maximum control over hardware and networking
Existing VM infrastructure (VMware, OpenStack)
Very high-performance workloads needing NVMe/local SSD
Team has limited Kubernetes expertise

Choose Kubernetes when...

Cloud-native infrastructure strategy
GitOps / Infrastructure-as-Code workflows
Rapid provisioning of dev/staging/test clusters
Auto-healing and self-service for developers
Multi-cloud or hybrid deployments
Integration with service mesh (Istio, Linkerd)