MongoDB Enterprise Advanced
Enterprise-grade features for security, management, and support. Everything in Community plus advanced capabilities for mission-critical deployments.
EA vs Community Comparison
Key differentiators that make Enterprise Advanced the choice for production workloads.
| Feature | Community | Enterprise Advanced |
|---|---|---|
| Storage Engine - WiredTiger | β | β |
| Encryption at Rest (Native) | β | β KMIP, AWS KMS, Azure Key Vault, GCP KMS |
| In-Memory Storage Engine | β | β |
| LDAP Authentication & Authorization | β | β |
| Kerberos Authentication | β | β |
| Audit Logging | β | β Configurable filters |
| Client-Side Field Level Encryption | β Manual only | β Automatic + Queryable |
| Ops Manager | β | β Full management platform |
| BI Connector | β | β SQL β MQL |
| Kubernetes Operator | β Community Operator | β Enterprise Operator |
| Enterprise Support | β | β 24/7 SLA-backed |
| Cluster-to-Cluster Sync | β | β mongosync |
| SNMP Monitoring | β | β |
EA Product Components
mongosync enables continuous data synchronization between clusters. Supports migrations, DR, and active-passive topologies.Ops Manager Architecture
Ops Manager is a self-hosted management platform. It consists of the Application Server, a backing Application Database, agents deployed on every managed host, and dedicated backup infrastructure.
Ops Manager Subservices β Deep Dive
Click each subservice to expand its technical details, internal workings, and interaction patterns.
- Pull-based model: Agent polls Ops Manager every 10s (configurable) for the "goal state" β a JSON document describing desired topology
- Convergence engine: Compares current state to goal state and takes actions: start/stop mongod, modify configs, initiate replica set reconfig, add shards
- Upgrade orchestration: Rolling upgrades one member at a time. Steps down primary last. Waits for secondaries to catch up before proceeding
- Process management: Starts mongod/mongos with correct flags. Monitors process health. Restarts on crash with backoff
- Configuration: Generates mongod.conf from goal state. Handles TLS certs, keyfiles, LDAP config, audit config
- Auth bootstrap: Creates first admin user, configures keyfile auth, enables auth on replica set
- Port: Outbound HTTPS to Ops Manager (port 8443). No inbound ports required
- Log:
/var/log/mongodb-mms-automation/automation-agent.log - Failure mode: If agent is down, existing MongoDB processes keep running. No new changes applied until agent reconnects
- Data collection methods: Runs
serverStatus,replSetGetStatus,dbStats,collStats,top,currentOp,connPoolStats - Collection interval: Default every 60 seconds. Configurable down to 10 seconds
- Hardware metrics: CPU usage, disk IOPS, disk utilization, memory (RSS, mapped, virtual), network I/O β collected via host agent
- Replication lag: Measures optime difference between primary and each secondary. Alerts on configurable thresholds
- Push model: Batches metrics and sends compressed payload to Ops Manager HTTP endpoint
- Metric retention: 1-minute granularity for 48 hours β 5-min for 7 days β 1-hour for 90 days β daily for 2 years
- Custom metrics: Can define custom serverStatus-based metrics for dashboards
- Profiler integration: Can collect slow query logs (from profiler level 1/2) and display in Ops Manager
- Data sent: ~2-5 KB per mongod per collection interval (compressed)
- Oplog tailing: Connects to each replica set member and tails the local.oplog.rs collection continuously
- Initial sync: On first backup, performs a full data copy. Uses mongodump internally or filesystem snapshots
- Oplog slicing: Divides oplog into time-based slices and sends compressed slices to the Oplog Store
- Coordination: Reports status to Backup Daemon on Ops Manager. Daemon orchestrates when to take snapshots
- Sharded cluster backup: Coordinates across all shard agents to create a consistent checkpoint using balancer pause + config server oplog position
- Compression: Oplog and snapshot data compressed with zstd before transmission and storage
- Bandwidth control: Configurable max oplog transfer rate per agent to avoid saturating network
- Encryption: Data encrypted in transit (TLS). Optional encryption at rest in blockstore/oplog store
- Snapshot scheduling: Creates snapshots at configurable intervals (default: every 6 hours). Base + incremental approach
- Head Database: Maintains a staging copy of each backed-up replica set. Applies oplog slices to keep it current. Used as the base for snapshots
- Snapshot creation: Takes a point-in-time copy of the Head DB. Chunks the data and stores in the Blockstore
- Incremental snapshots: After initial full snapshot, subsequent snapshots only store changed blocks (deduplication)
- Retention policy: Configurable per-project. Example: 24 hourly, 7 daily, 4 weekly, 12 monthly snapshots
- Restore types: Download snapshot (.tar.gz), Restore to another cluster (automated), Point-in-time restore (to any second), Queryable backup (mount as read-only)
- HA: Only ONE active backup daemon per deployment. If the active Ops Manager node fails, another takes over automatically
- Storage backends: Blockstore (MongoDB RS), S3-compatible (AWS S3, MinIO, etc.), Filesystem store (NFS/SAN)
- Alert conditions: Threshold-based (e.g., connections > 500), rate-based (e.g., page faults/sec), boolean (e.g., primary step-down)
- Built-in alerts: Host down, replication lag, disk usage, oplog window, election events, backup delay, agent disconnect
- Custom alerts: Create on any collected metric with AND/OR conditions
- Integrations: Email, PagerDuty, Slack, OpsGenie, VictorOps, Webhook (custom), SNMP traps, HipChat
- Evaluation cycle: Checks conditions every monitoring interval (default 60s)
- Alert states: OPEN β ACKNOWLEDGED β CLOSED. Configurable auto-resolution
- Maintenance windows: Suppress alerts during planned downtime
- Web UI: React-based SPA served by embedded Jetty. Provides dashboards, cluster management, user management, project settings
- REST API v2.0: Full CRUD for all resources β organizations, projects, clusters, users, alerts, backup configs. Digest or API key auth
- Default port: 8080 (HTTP) / 8443 (HTTPS). TLS strongly recommended for production
- Authentication: Local accounts (SCRAM), LDAP bind, SAML 2.0 (SSO with Okta, ADFS, etc.), x.509 client certs
- Authorization: RBAC with roles: Global Owner, Org Admin, Project Owner, Project Read Only, etc.
- Session: HTTP sessions stored in Application Database. Configurable timeout (default 12 hours)
- Rate limiting: Configurable per-user and per-API-key rate limits to protect against abuse
- Load balancer: For HA, place 2+ Ops Manager app servers behind an L7 LB with sticky sessions
Agent β Ops Manager Interaction Flow
Every 10s via HTTPS
topology & config
goal state
processes, reconfig RS
Ops Manager
dbStats, currentOp
+ host hardware stats
Ops Manager
with time-based rollup
Alert evaluation
mongodb-agent binary. The agent runs as a single process and enables/disables modules based on Ops Manager configuration. This simplifies deployment from 3 agents to 1 per host.Ops Manager Backing Databases
Ops Manager relies on several dedicated MongoDB instances for its own operation. These are critical infrastructure.
Sizing: Requires 3+ node replica set. SSD recommended. Size depends on # of managed hosts: ~2 GB per 100 servers.
Requirements: Must use WiredTiger. Must have oplog sized for at least 24 hours. Dedicated hardware recommended (not co-located with managed clusters).
Sizing: Depends on total data being backed up. Rule of thumb: 2-3x the total dataset size (for retention + overhead).
Options: MongoDB replica set (default), S3-compatible (AWS S3, GCS, MinIO), filesystem (NFS/SAN). Can have multiple blockstores for distribution.
Sizing: Grows based on write volume. Typically 10-20% of blockstore size. Oplog slices are compressed.
Retention: Slices are garbage-collected after the next snapshot covers that time range plus the retention window.
Lifecycle: Created automatically. One per backed-up replica set. Stored on Backup Daemon host or dedicated storage.
Note: High I/O component. Place on fast SSD. Size equals the dataset of the replica set being backed up.
Backup & Recovery Architecture
Ops Manager provides enterprise-grade backup with continuous point-in-time recovery, automated scheduling, and multiple restore options.
Security & Encryption
MongoDB Enterprise Advanced provides defense-in-depth security across authentication, authorization, encryption, and auditing.
Monitoring & Logging
Ops Manager provides deep monitoring of every MongoDB process with real-time dashboards, alerting, and log management.
Metrics Collected
Metric Retention & Granularity
| Time Range | Granularity | Retention |
|---|---|---|
| Last 48 hours | 1 minute | 2 days |
| Last 7 days | 5 minutes | 7 days |
| Last 90 days | 1 hour | 90 days |
| Last 2 years | 1 day | 730 days |
Logging Architecture
Configurable verbosity per component (0-5). Slow query logging via profiler or
slowOpThresholdMs (default 100ms).
/var/log/mongodb-mms-automation/automation-agent.log β Goal state changes, process starts/stops, config changes.
monitoring-agent.log β Metric collection events, connection issues.
backup-agent.log β Oplog tailing status, sync progress.
/opt/mongodb/mms/logs/mms0.log β Application server logs (HTTP requests, API calls).
daemon.log β Backup daemon operations.
mms0-audit.log β Ops Manager user actions audit.
Syslog:
systemLog.destination: syslogFile β Fluentd/Logstash: Ship JSON logs to ELK/Splunk
SNMP: Enterprise SNMP traps for integration with Nagios, Zabbix
High Availability β From Nodes to Datacenters
MongoDB EA HA isn't a single feature β it's a layered defense. Each level protects against a broader class of failure, and they all build on the same replica set foundation. A production deployment should address every level.
mongod crashes, runs OOM, or is killed. Replica set auto-elects a new primary in ~5-10s. Zero data loss with journaling.rs.initiate() and w: "majority" mechanics work identically whether members are on the same rack or spread across continents. MongoDB does not distinguish between "local HA" and "geo HA" β it's all replica set replication. The only things that change are network latency, member placement, and election configuration.Component HA Matrix
Every component in the EA stack must be resilient. Here's the HA strategy for each, applicable whether running in a single DC or across multiple.
| Component | HA Strategy | Min Nodes | Failover |
|---|---|---|---|
| Managed MongoDB RS | 3+ member replica set. Spread across AZs (single DC) or DCs (multi-DC). | 3 (5 for multi-DC) | Automatic election, ~5-10s within DC, ~10-30s cross-DC. |
| Ops Manager App Server | 2+ instances behind L7 load balancer (sticky sessions). Can span DCs. | 2 | LB routes to healthy instance. 0 downtime. |
| Application Database | Standard MongoDB RS. Should span DCs in multi-DC deployments. | 3 | Automatic election. ~5-10s single DC, ~10-30s cross-DC. |
| Backup Daemon | Active-passive across Ops Manager instances. | 2 (OM servers) | Auto failover to another OM instance. 1-2 min. |
| Blockstore / Oplog Store | MongoDB RS(s). Multiple stores for load. Should span DCs for DR. | 3 per RS | RS automatic election. Multiple stores for distribution. |
| MongoDB Agent | Systemd auto-restart. Ops Manager alerts on disconnect. | 1 per host | Agent down = no new automation. Running processes unaffected. |
| mongos (Sharded) | Stateless. Multiple instances behind app-level LB. | 2+ | Driver reconnects to available mongos. No state to lose. |
| Config Servers | 3-member RS (CSRS). Spread across DCs for sharded multi-DC. | 3 | Automatic election. Cluster read-only if primary lost during metadata ops. |
Production Deployment Topology
Whether running in one datacenter or three, the stack is the same β only the member placement changes.
The topology above protects against node and rack failures within a single DC. But what happens if the entire datacenter goes down? The same replica set just needs its members placed across DCs.
Extending to Multi-DC: Same Replica Set, Wider Placement
There is no separate "multi-DC mode." You take the same replica set and distribute members across datacenters. Elections, replication, and write concerns all work identically β the only difference is network latency between members.
electionTimeoutMillis: 10000 (10s). In cross-DC setups, increase to 15-30s to avoid spurious elections from transient network blips between DCs.priority: 10). Lower priority for remote DCs (priority: 6). Highest priority + most current oplog wins elections.w:1 writes.w: "majority" ensures writes replicate across DCs before ack. These survive any single DC failure with zero rollback. Only w:1 writes that haven't replicated will be rolled back.What Happens When a Datacenter Fails
The election process is the same whether one node crashes or an entire DC goes dark. The only difference is how many members are lost at once.
electionTimeoutMillis (10s default), secondaries detect the primary is unreachable β same mechanism as a single node failure.nearest preference already target the local DC. Applications resume within ~1-3s after election.w:1 writes not replicated before the failure are rolled back to a file. All w:majority writes are intact β they were acknowledged across DCs before the failure.Write Concern & Read Preference β The Knobs That Control Durability vs Latency
These settings work the same in single-DC and multi-DC. The difference is that in multi-DC, w: "majority" must wait for cross-DC replication β adding network latency to every acknowledged write.
| Write Concern | Single-DC Latency | Multi-DC Latency | Durability on DC Failure |
|---|---|---|---|
w: 1 |
~1ms (local ack) | ~1ms (local ack) | β May lose writes if primary DC fails before replication |
w: "majority" |
~2-5ms (local RS) | ~50-100ms (cross-DC round-trip) | β Zero data loss β survives any single DC failure |
w: "majority", j: true |
~5-10ms | ~50-150ms | β Maximum β survives simultaneous power loss at any DC |
w: 3 (numeric) |
~2-5ms | Variable | β 3 copies, but may not guarantee cross-DC spread |
| Read Preference | Behavior | Best For |
|---|---|---|
primary |
Always reads from primary. Cross-DC latency if app is in a different DC. | Strong consistency required |
primaryPreferred |
Primary if available; secondary during failover. | Prefer consistency, tolerate stale during failover |
nearest |
Lowest-latency member. Always reads from local DC if a member exists there. | β Recommended for multi-DC β local reads everywhere |
nearest + tag_sets |
Nearest member matching tags (e.g., dc: "us-west"). |
Explicit DC-aware routing, data locality |
secondary |
Only secondaries. Can target local DC secondary for analytics. | Offload reads from primary for reporting |
w: "majority" adds ~50-100ms to every write. For latency-sensitive workloads, you have three options: (1) Accept w:1 with some RPO risk, (2) Use an Active-Standby pattern with mongosync, or (3) Use Zone Sharding to keep writes local.Alternative Multi-DC Patterns
A 3-DC replica set is ideal but not always possible. Here are the alternatives and their trade-offs.
Making the Entire Stack Multi-DC
It's not just the data clusters β every supporting component should also span DCs for true datacenter-level resilience.
w: "majority" so OM metadata survives a DC failure.mmsBaseUrl to a global LB endpoint so agents automatically reach whichever OM instance is healthy, regardless of which DC it's in.Deployment Models
MongoDB Enterprise Advanced supports both traditional bare metal/VM deployments and modern Kubernetes-native deployments via the Enterprise Operator.
Comparison: Kubernetes vs Bare Metal / VM
| Aspect | Bare Metal / VM | Kubernetes |
|---|---|---|
| Deployment | RPM/DEB packages, manual or Ansible/Terraform | Enterprise Operator + CRDs (kubectl apply) |
| Process Lifecycle | Automation Agent manages mongod processes | Operator manages StatefulSets β Pods β mongod |
| Storage | Local SSD, SAN, NAS β full control | PersistentVolumeClaims (PVC) β StorageClass dependent |
| Networking | Static IPs, DNS, direct port access | Kubernetes Services, Headless Services, optional Ingress/LoadBalancer |
| Scaling | Provision new VMs, install agent, add to topology | kubectl patch to change replica count. Operator handles the rest |
| Upgrades | Ops Manager rolling upgrade via Automation Agent | Operator performs rolling update of StatefulSet pods |
| HA / Anti-affinity | Manual rack/AZ placement | Pod anti-affinity rules, topology spread constraints |
| Resource Isolation | Dedicated hardware, cgroups | Resource limits/requests, QoS classes, node selectors |
| TLS/Cert Management | Manual cert deployment or Vault integration | cert-manager integration, automatic rotation |
| Monitoring | Ops Manager Agent (native) | Ops Manager Agent in sidecar + Prometheus endpoints |
| Backup | Ops Manager Backup Agent (native) | Ops Manager Backup Agent in sidecar |
| Ops Manager itself | Installed on dedicated VMs | Can run in K8s via MongoDBOpsManager CRD |
| Best For | Max control, air-gapped, regulatory, legacy infra | Cloud-native, GitOps, auto-scaling, dev/staging |
When to Use Which?
- Air-gapped / disconnected environments
- Strict regulatory requirements (gov, financial)
- Maximum control over hardware and networking
- Existing VM infrastructure (VMware, OpenStack)
- Very high-performance workloads needing NVMe/local SSD
- Team has limited Kubernetes expertise
- Cloud-native infrastructure strategy
- GitOps / Infrastructure-as-Code workflows
- Rapid provisioning of dev/staging/test clusters
- Auto-healing and self-service for developers
- Multi-cloud or hybrid deployments
- Integration with service mesh (Istio, Linkerd)