Skip to main content

Streamkap Setup

To set up the Connector, you will need to gather connection details and configure your DataBricks cluster. Log in to your Databricks Cloud Account and then follow the steps below.

Get connection details

Streamkap connects to Databricks via a JDBC URL. You can use either an All-Purpose Compute or a SQL Warehouse as the compute resource.

Option A: All-Purpose Compute

  1. Open the Compute page from the sidebar and choose your cluster
  2. Click on Advanced Options
  3. Open the JDBC/ODBC tab
  4. Copy the JDBC Connection URL

Option B: SQL Warehouse

A SQL Warehouse can automatically scale across multiple Spark clusters to handle concurrent workloads, but is generally more expensive than an All-Purpose Cluster. To get the JDBC URL for a SQL Warehouse:
  1. Open the SQL Warehouses page from the sidebar
  2. Select your warehouse
  3. Open the Connection Details tab
  4. Copy the JDBC URL
For both options, you can append ConnCatalog=<your catalog name> to the JDBC URL to select a catalog other than the default.

Generate an access token

For setting the Streamkap DataBricks’ Token:
  1. Open Settings page from the sidebar and then User Settings
  2. Open the Personal Access Tokens tab
  3. Click + Generate New Token
  4. (Optional) Enter a comment and change the token lifetime
  5. Click Generate
  6. Copy the access token

Create a temporary directory

  1. Create tmp directory on the Databricks File System (DBFS)

How it works

As data’s streamed from the source in to topics (think of them as partitioned tables), the Databricks Sink connector will:
  • Check whether tables for the topics exist in Databricks, if not, it creates them
  • Automatically handle schema evolution when the source schema changes (e.g. new columns, data type changes)
  • Stream change data into Parquet files and upload them to the tmp directory on the Databricks File System (DBFS) and:
    • Load data to the target table using SQL bulk import COPY
    • Clean up the Parquet files

Ingestion Modes

Streamkap supports two ingestion modes for writing data to Databricks Delta Lake: Upsert and Append.

Upsert

Upsert mode uses a MERGE INTO statement to insert new records and update existing ones based on the primary key columns from the source table.
  • New records (no matching primary key in the target) are inserted
  • Existing records (matching primary key) are updated with the latest values
  • Deleted records (when hard delete is enabled) are physically removed from the target table
  • Out-of-order protection: Streamkap tracks record timestamps and offsets to ensure older records never overwrite newer data
Upsert is the recommended mode for most use cases, as it keeps your target table in sync with the source and handles updates and deletes automatically.

Append

Append mode uses a simple INSERT INTO statement to add all incoming records as new rows.
  • Every record is inserted regardless of whether a row with the same key already exists
  • No deduplication or update logic is applied
  • Deletes from the source are not reflected in the target
Append is useful for event logs, audit trails, or any scenario where you want to preserve every change as a separate row rather than maintaining a current-state replica.