A deep dive into MongoDB's specialized storage for time-stamped data: bucketing, compression, and internal architecture
Understanding what Time Series collections are and when to use them
The Problem: Time series data (IoT sensors, metrics, logs) generates massive volumes - often millions of documents per day. Regular MongoDB collections store each document separately, leading to:
The Solution: Time Series collections automatically bucket, compress, and optimize this data - giving you 90%+ storage savings and 10x faster queries with zero application changes.
Optimized for data that arrives sequentially over time: IoT sensors, metrics, logs, stock prices, and any timestamped events.
MongoDB groups documents into "buckets" based on time intervals and metadata, dramatically reducing storage overhead.
Data within buckets is stored in columnar format with delta encoding, achieving up to 90%+ compression ratios.
Temperature, humidity, pressure readings
Stock prices, trades, order book
CPU, memory, latency metrics
Application logs, audit trails
How buckets, granularity, data flow, and compression work together
Buckets are the secret sauce. Instead of storing 1 million individual documents, MongoDB groups them into ~1,000 buckets. Each bucket holds documents with the same metadata within a time window. This means: fewer documents to scan, better compression, and indexes that are 1000x smaller. Understanding bucket structure helps you choose the right granularity and metadata fields.
MongoDB automatically determines bucket boundaries based on the granularity setting (seconds, minutes, hours). Documents with the same metadata that arrive within the same time window go into the same bucket.
"seconds" granularity does NOT mean buckets close every second!
Granularity controls the bucket time span and timestamp rounding, not how often buckets close.
A bucket closes when ANY of these conditions is met:
NOT every second! The bucket stays open until it fills up or hits the 1-hour time boundary.
Under the hood, MongoDB stores time series data in a special system.buckets.<collection> collection. Each bucket is a single document with a specific structure:
{
"_id": ObjectId("..."), // Encodes bucket start time
"control": {
"version": 1, // Bucket format version
"min": {
"_id": ObjectId("..."), // Min values for pruning
"timestamp": ISODate("2024-01-01T10:00:00Z"),
"temperature": 22.1
},
"max": {
"timestamp": ISODate("2024-01-01T10:59:59Z"),
"temperature": 28.7
},
"closed": false, // Is bucket accepting writes?
"count": 847 // Number of measurements
},
"meta": { // Your metaField value
"sensorId": "sensor_001",
"location": "NYC"
},
"data": { // Columnar compressed data
"timestamp": { "0": ISODate(...), "1": ISODate(...), ... },
"temperature": { "0": 23.5, "1": 23.6, "2": 23.7, ... },
"humidity": { "0": 45, "1": 46, "2": 44, ... }
}
}
Instead of storing each measurement as a separate document, values are stored column-by-column, enabling efficient compression:
Compression directly impacts your costs. At scale, storing 1TB of raw sensor data might cost $100/month. With 90% compression, that drops to $10/month. Time series collections achieve this automatically through delta encoding - no application changes needed. Understanding how it works helps you design schemas that compress even better.
Time series data is highly compressible because consecutive values are often similar. MongoDB uses delta encoding to store only the differences between values:
Deep dive into how MongoDB manages buckets, memory, and storage
Your app writes documents with timestamp, metadata, and measurements
Routes documents, manages bucket lifecycle, handles compression
Creates, closes, and maintains buckets based on time + metadata
Applies delta encoding, RLE, and zstd compression
Internal collection storing compressed bucket documents
Time-based clustering for efficient range queries
Block-compressed storage with efficient I/O patterns
Memory management directly affects write performance. MongoDB keeps "open" buckets in RAM for fast inserts. If you have too many unique metadata combinations, you'll exhaust the bucket catalog and trigger constant disk I/O. Understanding the lifecycle helps you design metadata fields that don't explode your memory usage.
MongoDB maintains an in-memory Bucket Catalog that tracks all open buckets. This is crucial for high-performance writes - instead of searching disk for the right bucket, MongoDB keeps active buckets readily accessible in RAM.
When memory pressure increases, older buckets are closed and flushed to disk.
Every bucket transitions through distinct states. Understanding this lifecycle is key to optimizing time series performance.
When a bucket is closed, it flows through MongoDB's WiredTiger storage engine. Here's the complete journey from memory to disk:
From 1000 KB down to 90 KB. This is why time series collections are so storage-efficient! The combination of columnar storage + delta encoding + RLE + block compression creates massive savings.
Creating collections, querying data, indexes, and aggregations
// Create a time series collection db.createCollection("sensorData", { timeseries: { timeField: "timestamp", // Required: field containing timestamp metaField: "metadata", // Optional: field for grouping granularity: "minutes" // "seconds" | "minutes" | "hours" }, expireAfterSeconds: 86400 * 30 // Optional: auto-delete after 30 days }); // Insert a document db.sensorData.insertOne({ timestamp: new Date(), metadata: { sensorId: "sensor_001", location: "building_A" }, temperature: 23.5, humidity: 45.2, pressure: 1013.25 });
| Granularity | Bucket Time Span | Best For | Max Docs/Bucket |
|---|---|---|---|
| seconds | 1 hour | High-frequency data (100+ docs/min) | ~1000 |
| minutes | 24 hours | Medium frequency (1-100 docs/min) | ~1000 |
| hours | 30 days | Low frequency (<1 doc/min) | ~1000 |
Optimal case: Documents arrive in chronological order. MongoDB efficiently appends to the current open bucket.
✓ Fast: Single bucket lookup, append operation
Out-of-order writes: When documents arrive with timestamps outside the current bucket's range, MongoDB must find or create the appropriate bucket.
⚠️ Creates additional buckets, reducing compression efficiency
MongoDB 6.0+: Closed buckets can be reopened if a new document falls within their time range. This improves handling of late-arriving data.
Late data → Creates new bucket
❌ More buckets, less compression
Late data → Reopens existing bucket
✓ Better bucket utilization
In MongoDB 6.3+, you can fine-tune bucket time spans with bucketMaxSpanSeconds for custom granularity beyond seconds/minutes/hours.
Time-based queries skip entire buckets using the control.min/max bounds. A query for "last 24 hours" only reads recent buckets, not the entire collection!
Bucket pruning is why time series queries are fast. When you query "last 24 hours", MongoDB doesn't scan all your data.
Each bucket has control.min and control.max
fields that let MongoDB skip entire buckets that can't possibly contain matching documents.
A year of data might have 8,760 buckets, but a "last hour" query only touches 1-2 buckets.
MongoDB parses the query and identifies filterable fields.
Index returns bucket IDs that MIGHT contain matching documents (based on metadata match).
Key insight: MongoDB checks each bucket's control.min.timestamp and control.max.timestamp.
If the query range doesn't overlap → bucket is skipped entirely!
Only matching buckets are decompressed. Documents are reconstructed from columnar format and filtered.
Indexes on time series collections are 1000x smaller than on regular collections because they index buckets, not individual documents.
A collection with 10 million documents might have only 10,000 buckets - so your index has 10,000 entries instead of 10 million.
But they work differently: indexes are created on the internal system.buckets.* collection, and query planning uses bucket metadata for pruning.
Time series collections support secondary indexes, but they work differently than regular collections:
Created automatically on the internal bucket collection
// Index on metadata subfields db.sensors.createIndex({ "metadata.region": 1 }) // Compound with time (for range + filter) db.sensors.createIndex({ "metadata.sensorType": 1, "timestamp": -1 }) // Index on measurement fields db.sensors.createIndex({ "temperature": 1 })
Indexes are created on system.buckets.*, not the view
~1000x fewer index entries than regular collections
Bucket bounds enable efficient range pruning
⚠️ Important Distinction: The B-Tree Index (indexing structure) and Buckets (time series storage) are separate concepts. The index stores pointers TO bucket documents, not the buckets themselves.
MongoDB provides powerful aggregation stages designed specifically for time series analysis:
Fill gaps in time series data by generating documents for missing time intervals.
{ $densify: {
field: "timestamp",
range: {
step: 1,
unit: "hour",
bounds: "full"
}
}}
Fill null values using linear interpolation or last observed value (LOCF).
{ $fill: {
sortBy: { timestamp: 1 },
output: {
temperature: {
method: "linear"
}
}
}}
Compute moving averages, running totals, and rankings over time windows.
{ $setWindowFields: {
sortBy: { timestamp: 1 },
output: {
movingAvg: {
$avg: "$temp",
window: { range: [-1, 0], unit: "hour" }
}
}
}}
Understanding when to use time series collections and their limitations
| Aspect | Regular Collection | Time Series Collection |
|---|---|---|
| Document Storage | One document = one BSON document | Many documents = one bucket document |
| Field Storage | Row-oriented (all fields together) | Column-oriented (fields stored separately) |
| Compression | Block-level only (WiredTiger) | Delta + RLE + Block compression |
| Index Entries | One per document | One per bucket (~1000x fewer) |
| Time Range Queries | Scans matching documents | Bucket pruning + clustered access |
| Write Pattern | Individual inserts | Batched into bucket updates |
| Storage for 1M docs | ~500 MB | ~50 MB (90% savings) |
| Feature | MongoDB | InfluxDB | TimescaleDB |
|---|---|---|---|
| Data Model | Document (BSON) | Line Protocol | Relational (SQL) |
| Query Language | MQL + Aggregation | Flux / InfluxQL | SQL |
| Schema Flexibility | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Joins & Relations | $lookup, embedding | Limited | Full SQL joins |
| Ecosystem | Atlas, Charts, Search | Telegraf, Grafana | PostgreSQL ecosystem |
| Best For | Mixed workloads | Pure metrics | SQL shops |
| Compression | ~90% | ~95% | ~85% |
MongoDB lets you store time series data alongside your operational data in one database. No need for a separate TSDB + complex ETL pipelines. Combine sensor readings with device metadata, user profiles, and application state in unified queries.
Time series collections have some restrictions to be aware of. Understanding these helps you design better schemas:
Once inserted, the timestamp field cannot be updated. You must delete and re-insert to change it.
The metadata field is immutable after insertion. Plan your metadata structure carefully upfront.
Time series collections don't support multi-document transactions. Use for append-only workloads.
Change streams weren't supported until MongoDB 6.0. Now available with some limitations.
Individual deletes must decompress buckets. Use TTL instead for automatic expiration.
Can't change granularity or field mappings after creation. Plan schema carefully.
Time series collections work best for append-only, immutable data. If you need frequent updates, consider a regular collection with appropriate indexes.
Best practices, calculators, and interactive demos
Match granularity to your data frequency. Higher frequency = finer granularity for optimal bucketing.
Group related measurements together. High-cardinality metadata creates more buckets = less compression.
Insert documents in time order when possible. Out-of-order inserts may create additional buckets.
Create indexes on metaField sub-fields if you query by those values frequently.
Set expireAfterSeconds to automatically remove old data and keep storage costs down.
Always include time range filters in queries to benefit from bucket pruning optimization.
Estimate your storage savings by switching to time series collections: