Streamkap Setup
To set up the Connector, you will need to gather connection details and configure your DataBricks cluster. Log in to your Databricks Cloud Account and then follow the steps below.Get connection details
Streamkap connects to Databricks via a JDBC URL. You can use either an All-Purpose Compute or a SQL Warehouse as the compute resource.Option A: All-Purpose Compute
- Open the Compute page from the sidebar and choose your cluster
- Click on Advanced Options
- Open the JDBC/ODBC tab
- Copy the JDBC Connection URL
Option B: SQL Warehouse
A SQL Warehouse can automatically scale across multiple Spark clusters to handle concurrent workloads, but is generally more expensive than an All-Purpose Cluster. To get the JDBC URL for a SQL Warehouse:- Open the SQL Warehouses page from the sidebar
- Select your warehouse
- Open the Connection Details tab
- Copy the JDBC URL
For both options, you can append
ConnCatalog=<your catalog name> to the JDBC URL to select a catalog other than the default.Generate an access token
For setting the Streamkap DataBricks’ Token:- Open Settings page from the sidebar and then User Settings
- Open the Personal Access Tokens tab
- Click + Generate New Token
- (Optional) Enter a comment and change the token lifetime
- Click Generate
- Copy the access token
Create a temporary directory
- Create
tmpdirectory on the Databricks File System (DBFS)
How it works
As data’s streamed from the source in to topics (think of them as partitioned tables), the Databricks Sink connector will:- Check whether tables for the topics exist in Databricks, if not, it creates them
- Automatically handle schema evolution when the source schema changes (e.g. new columns, data type changes)
-
Stream change data into Parquet files and upload them to the
tmpdirectory on the Databricks File System (DBFS) and:- Load data to the target table using SQL bulk import
COPY - Clean up the Parquet files
- Load data to the target table using SQL bulk import
Ingestion Modes
Streamkap supports two ingestion modes for writing data to Databricks Delta Lake: Upsert and Append.Upsert
Upsert mode uses aMERGE INTO statement to insert new records and update existing ones based on the primary key columns from the source table.
- New records (no matching primary key in the target) are inserted
- Existing records (matching primary key) are updated with the latest values
- Deleted records (when hard delete is enabled) are physically removed from the target table
- Out-of-order protection: Streamkap tracks record timestamps and offsets to ensure older records never overwrite newer data
Append
Append mode uses a simpleINSERT INTO statement to add all incoming records as new rows.
- Every record is inserted regardless of whether a row with the same key already exists
- No deduplication or update logic is applied
- Deletes from the source are not reflected in the target