MongoDB Source FAQ - Streamkap Docs

What is a MongoDB source in Streamkap?

A MongoDB source in streamkap enables real-time Change Data Capture (CDC) from MongoDB databases, capturing row-level inserts, updates, and deletes with sub-second latency. It uses change streams to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested data. Streamkap offers a serverless setup via UI or API.

What MongoDB versions are supported as sources?

MongoDB 4.0+ for basic CDC; 6.0+ for advanced features like post-images and full document lookups.
Compatible with MongoDB 3.6+ in some modes.

What MongoDB deployments are supported?

Streamkap supports:

Self-hosted (on-prem/VM).
MongoDB Atlas.
Amazon DocumentDB (MongoDB-compatible).
Streamkap also supports sharded clusters and replica sets, with automatic handling of shard additions/removals and membership changes.

What are the key features of MongoDB sources in Streamkap?

CDC: Change streams for inserts/updates/deletes; supports oplog for resume tracking.
Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
Schema Evolution: Automatic handling of document changes; field renaming/exclusion.
Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable), JSON; extended JSON for identifiers.
Ingestion Modes: Inserts (append) or upserts.
Security: SSL, authentication, access control.
Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
Streamkap adds transaction metadata, filtering by collections/fields, and aggregation pipelines.

How does CDC work for MongoDB sources?

Streamkap uses MongoDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (MongoDB 6.0+).

How do snapshots work for MongoDB sources?

Trigger ad-hoc at source/table level. Methods: Incremental (phased, chunked by ID, default 1024 documents) or blocking (stops streaming temporarily). Uses watermarking for progress; supports partial snapshots via conditions.
Modes: initial (default), always, initial_only, no_data, when_needed, configuration_based, custom. Streamkap simplifies triggering via UI.

What are heartbeats and how do they work?

MongoDB uses change streams to track changes via resume tokens. These tokens can expire or become invalidated—particularly on clusters with high write activity, limited oplog retention, or when using custom aggregation pipelines that filter out many events.Heartbeats ensure the Connector receives regular change events, keeping resume tokens fresh and providing liveness monitoring. This is especially important for:

Low-traffic or intermittent databases
Custom aggregation pipelines that filter many events
Scenarios requiring reliable liveness detection

Layer 1 (Connector heartbeats) is enabled by default—the Connector emits heartbeat messages to an internal topic even when no data changes occur.Layer 2 (Source database heartbeats) requires configuring a scheduled job to update a heartbeat collection:

MongoDB Atlas
Self-hosted MongoDB

MongoDB uses change streams to track changes. While change streams use resume tokens to track position, these tokens can expire or become invalidated—particularly on clusters with high write activity or when using custom aggregation pipelines that filter events.Heartbeats ensure the Connector receives regular change events, keeping resume tokens fresh and providing liveness monitoring.There are two layers of heartbeat protection:

Layer 1: Connector heartbeats (enabled by default)

The Connector periodically emits heartbeat messages to an internal topic, even when no actual data changes are detected. This keeps offsets fresh and prevents staleness.No configuration is necessary for this layer; it is automatically enabled. We recommend keeping this layer enabled for all deployments.

Layer 2: Source database heartbeats (recommended)

Why we recommend configuring Layer 2Layer 2 is especially important when:

Your database has low or intermittent traffic
You use custom aggregation pipelines that filter out many events
You need reliable liveness monitoring

We recommend configuring Layer 2 for all deployments to provide additional resilience.

You can configure regular updates to a dedicated heartbeat collection in the source database. This simulates activity, ensuring change events are generated consistently and resume tokens remain valid.Since the MongoDB Connector doesn’t write directly to the database, you must configure a scheduled job to generate artificial traffic. For MongoDB Atlas, use Atlas Scheduled Triggers.

Create the heartbeat collection

Create a collection to store heartbeat documents. This can be in the same database you’re capturing or a dedicated streamkap database.Using MongoDB Atlas UI:

Navigate to your cluster and click Browse Collections
Click Create Database or select an existing database
Create a collection named streamkap_heartbeat

Using MongoDB Shell:

use streamkap
db.createCollection("streamkap_heartbeat")

Grant permissions to the Streamkap user

Ensure the Streamkap user has read access to the heartbeat collection.Using MongoDB Atlas UI:

Go to Security > Database Access
Edit the streamkap_user
Under Specific Privileges, add: read@streamkap.streamkap_heartbeat

Using MongoDB Shell:

db.grantRolesToUser("streamkap_user", [
  { role: "read", db: "streamkap" }
])

Create an Atlas Scheduled Trigger

In MongoDB Atlas, go to App Services (or Triggers in the left menu)
Click Create a Trigger
Select Scheduled trigger type
Configure the trigger:
- Name: streamkap_heartbeat_trigger
- Schedule Type: Basic
- Repeat once by: Minute
- Every: 1 minute(s)
In the Function section, select Function and create a new function:

exports = async function() {
  const serviceName = "mongodb-atlas"; // Your cluster service name
  const database = "streamkap";
  const collection = "streamkap_heartbeat";

  const coll = context.services
    .get(serviceName)
    .db(database)
    .collection(collection);

  const result = await coll.updateOne(
    { _id: "heartbeat" },
    {
      $set: {
        last_update: new Date()
      }
    },
    { upsert: true }
  );

  console.log(`Heartbeat updated: ${JSON.stringify(result)}`);
  return result;
};

Click Save

Atlas App Services usage and costThe scheduled trigger counts as an App Services Request. With a 1-minute interval:

~1,440 requests/day (60/hour × 24 hours)
Daily free tier: 50,000 requests—this heartbeat uses only ~3% of the free allowance
Cost if exceeding free tier: $0.000002 per request

The heartbeat has minimal impact on your App Services usage quota.

Layer 1: Connector heartbeats (enabled by default)

Layer 2: Source database heartbeats (recommended)

Why we recommend configuring Layer 2Layer 2 is especially important when:

Your database has low or intermittent traffic
You use custom aggregation pipelines that filter out many events
You need reliable liveness monitoring

We recommend configuring Layer 2 for all deployments to provide additional resilience.

You can configure regular updates to a dedicated heartbeat collection in the source database. This simulates activity, ensuring change events are generated consistently and resume tokens remain valid.Since the MongoDB Connector doesn’t write directly to the database, you must configure an external scheduler (e.g., cron job, Kubernetes CronJob) to generate artificial traffic.

Create the heartbeat collection

Connect to your MongoDB instance and create the heartbeat collection:

use streamkap
db.createCollection("streamkap_heartbeat")

// Insert initial document
db.streamkap_heartbeat.insertOne({
  _id: "heartbeat",
  last_update: new Date()
})

Grant permissions to the Streamkap user

db.grantRolesToUser("streamkap_user", [
  { role: "read", db: "streamkap" }
])

Create a heartbeat script

Create a script that updates the heartbeat document:

#!/bin/bash
# heartbeat.sh

MONGO_URI="mongodb://heartbeat_user:password@localhost:27017/streamkap?authSource=admin"

mongosh "$MONGO_URI" --eval '
  db.streamkap_heartbeat.updateOne(
    { _id: "heartbeat" },
    { $set: { last_update: new Date() } },
    { upsert: true }
  )
'

Make the script executable:

chmod +x heartbeat.sh

Schedule the heartbeat

Using cron (Linux/macOS):

# Edit crontab
crontab -e

# Add this line to run every minute
* * * * * /path/to/heartbeat.sh >> /var/log/mongodb-heartbeat.log 2>&1

Using Kubernetes CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mongodb-heartbeat
spec:
  schedule: "* * * * *"  # Every minute
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: heartbeat
            image: mongo:latest
            command:
            - mongosh
            - "mongodb://heartbeat_user:password@mongodb-host:27017/streamkap?authSource=admin"
            - --eval
            - |
              db.streamkap_heartbeat.updateOne(
                { _id: "heartbeat" },
                { $set: { last_update: new Date() } },
                { upsert: true }
              )
          restartPolicy: OnFailure

What data types are supported?

Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
Advanced: Arrays, objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (STRING/io.debezium.data.Json).
Identifiers: Integer, float, string, document, ObjectId, binary (extended JSON strict mode).
Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split in 6.0.9+).

How to monitor for MongoDB sources?

Use queries for oplog size; tools like Datadog/New Relic for lag/queue metrics.Best Practices: Retain oplog 3–5 days; alert on growth.

What are common limitations?

Standalone servers unsupported (convert to replica set)
Oplog purging during downtime may lose events
BSON size limits (fail/skip/split)
No transactions in older versions
General: Sharded clusters need careful config; incremental snapshots require stable primary keys (non-strings recommended)

How to handle deletes?

Captures deletes as events with before images; supports full records with pre-images.

What security features are available?

Encrypted connections (SSL), keystore/truststore, authentication; role-based access.

Troubleshooting common issues

Oplog Buildup: Monitor retention; resume from last position.
Connection Failures: Verify firewall, SSL, authentication.
Missing Events: Check include/exclude lists; resnapshot.
Streamkap-Specific: Check logs for resume token issues.

Can CDC capture database Views and other virtual objects?

No, CDC cannot capture Views or most virtual database objects.Why Views cannot be captured:
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.What cannot be captured:

Views: Virtual collections defined by aggregation pipelines, no physical storage or oplog entries
System Collections (system.*, admin.*, config.*): Metadata and internal state, not user data
Time Series Collections (MongoDB 5.0+): These use a specialized internal storage format optimized for time-stamped data (IoT sensors, metrics, logs). Change streams are NOT supported on time series collections because the bucketing storage engine compresses and reorganizes documents internally, making individual document change tracking incompatible with the oplog-based change stream mechanism. Solution: If you need CDC on time series data, store it in regular collections (with appropriate indexes on timestamp fields), or use aggregation pipelines to periodically export data snapshots.
On-Demand Materialized Views ($merge, $out results): Generated data, not original sources

What CAN be captured but with caveats:

Capped Collections (fixed-size collections): These fixed-size collections automatically overwrite oldest documents when full. CDC can capture them, but there’s risk of missing events. If your connector is offline or falls behind, and the oplog position it needs to resume from gets overwritten by newer operations, those change events are lost forever. Capped collections are typically used for high-throughput logs where data loss is acceptable. Recommendation: For mission-critical data requiring guaranteed capture, use regular collections instead.

Solution:
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.MongoDB-specific notes:

Aggregation pipelines on views: Capture the source collections and apply the pipeline logic downstream
Standalone MongoDB servers: Not supported for CDC at all—must use a replica set (minimum 1 member)
Time Series Collections (MongoDB 5.0+): Feature with columnar-like storage for time-stamped data. The internal bucketing mechanism groups documents by time windows and metadata, then compresses them. This storage optimization is incompatible with change streams because individual document inserts/updates/deletes cannot be tracked in the oplog. Change streams require document-level granularity which time series collections don’t provide. Workaround: Use regular collections with compound indexes on {metadata_field: 1, timestamp: 1} for CDC-compatible time series data.

Example:
If you have a view active_users_summary created from the users collection with filters and projections, capture the users collection instead, then apply the same aggregation logic in your destination.

ChangeStreamFatalError (Error 280) — resume token expired

The MongoDB oplog has been recycled before the connector could process all events. The connector’s resume token points to a position that no longer exists in the oplog.Resolution:

Contact Streamkap support for an offset reset
After the offset reset, trigger a snapshot to backfill any missed data

Increase oplog retention to 48 hours or more to prevent recurrence:

db.adminCommand({ replSetResizeOplog: 1, minRetentionHours: 48 })

Oplog retention recommendations

The default oplog retention is typically 24 hours, which may be insufficient for production workloads — especially during maintenance windows, connector restarts, or periods of high write volume.Recommendation: Set oplog retention to 48 hours or more for production environments:

db.adminCommand({ replSetResizeOplog: 1, minRetentionHours: 48 })

For MongoDB Atlas, configure oplog retention in the cluster settings under Additional Settings.

Best practices for MongoDB sources

Use replica sets (min 3 members for production)
Enable pre/post-images for full updates
Limit collections to needed ones
Test snapshots in staging

Planning a database version upgrade?

Database version upgrades require careful planning to avoid data loss and minimize downtime for your CDC pipelines. See our Database Upgrade Guide for step-by-step instructions.

Documentation Index

​Layer 1: Connector heartbeats (enabled by default)

​Layer 2: Source database heartbeats (recommended)

​Layer 1: Connector heartbeats (enabled by default)

​Layer 2: Source database heartbeats (recommended)

Layer 1: Connector heartbeats (enabled by default)

Layer 2: Source database heartbeats (recommended)

Layer 1: Connector heartbeats (enabled by default)

Layer 2: Source database heartbeats (recommended)