Deferred Merge
Overview
Deferred Merge package nodes are a simplified implementation of a Lambda architecture, focusing on the integration of an ongoing real-time or near real-time updates with previously loaded data.
Installation
- In Coalesce, open the Workspace where you wish to install the package.
- Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
- Paste the Package ID, and proceed with the installation process.
Description
Deferred Merge Package
The Deferred Merge Package includes mechanisms for handling merge operations with different streaming strategies:
Deferred Merge Append Stream
The Deferred Merge - Append Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture row inserts. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.
Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness.
Deferred Merge Append Stream Node Configuration
The Deferred Merge Append node has the following configuration groups:
- General Options
- Stream Options
- Target Loading Options
- Target Row DML Operations
- Target Delete Options
- Target Clustering Options
- Scheduling Options
Append Stream General Options
Option | Description |
---|---|
Development Mode | True/False toggle that determines task creation: - True: Table created and SQL executes as Run action - False: SQL wraps into task with Scheduling Options |
Create Target As | Choose target object type: - Table: Creates table - Transient Table: Creates transient table |
Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.
Append Stream Options
Option | Description |
---|---|
Source Object | Type of object for stream creation (Required): - Table - View |
Show Initial Rows | True: Returns only rows that existed when stream created False: Returns DML changes since most recent offset |
Redeployment Behavior | Determines stream recreation behavior |
Redeployment Behavior | Stage Executed |
---|---|
Create Stream if not exists | Re-Create Stream at existing offset |
Create or Replace | Create Stream |
Create at existing stream | Re-Create Stream at existing offset |
Append Stream Target Loading Options
Option | Description |
---|---|
Table keys | Business keys columns for merging into target table |
Record Versioning | Date Time or Date and Timestamp column for latest record tracking |
Append Stream Target Row DML Operations
Option | Description |
---|---|
Column Identifier | Column identifying DML operations |
Include Value for Update | For records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table. |
Insert Value | It indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation. |
Delete Value | This value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table. |
Append Stream Target Delete Options
Option | Description |
---|---|
Soft Delete | Toggle to maintain deleted data record |
Retain Last Non-Deleted Values | Preserves most recent non-deleted record, even as other records are marked as deleted or become inactive. |
Append Stream Target Clustering Options
Option | Description |
---|---|
Cluster key | True: Specify clustering column and allow expressions False: No clustering implemented |
Allow Expressions Cluster Key | Aadd an expression to the specified cluster key |
Append Stream Scheduling Options
If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.
Option | Description |
---|---|
Scheduling Mode | Choose compute type: - Warehouse Task: User managed warehouse executes tasks - Serverless Task: Uses serverless compute |
When Source Stream has Data Flag | True/False toggle to check for stream data True - Only run task if source stream has capture change data False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false. |
Select Warehouse | Visible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes. |
Select initial serverless size | Visible when Scheduling Mode is set to Serverless Task. Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times. |
Task Schedule | Choose schedule type: - Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs. - Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax. - Predecessor - Specify dependent tasks |
Enter predecessor tasks separated by a comma | Visible when Task Schedule is set to Predecessor. One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces. |
Root task name | Visible when Task Schedule is set to Predecessor. Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces. |
Append Stream Limitations
🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.
Append Stream Deployment
Append Stream Parameters
It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments.
The parameter name is targetTaskWarehouse
, and the default value is DEV ENVIRONMENT
.
When set to DEV ENVIRONMENT
, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.
{"targetTaskWarehouse": "DEV ENVIRONMENT"}
When set to any value other than DEV ENVIRONMENT
, the node will attempt to create the task using a Snowflake warehouse with the specified value.
For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh
.
{"targetTaskWarehouse": "compute_wh"}
Append Stream Initial Deployment
When deployed for the first time into an environment, executes the following stages:
Stage | Description |
---|---|
Create Stream | Executes CREATE OR REPLACE statement to create Stream in target environment |
Create Target Table | Creates destination table for processed data storage |
Target Table Initial Load | Populates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied |
Create Hybrid View | Provides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates. |
Create Task | Creates merge operation task for stream changes |
Resume Task | Activates the created task for scheduled execution |
Apply Table Clustering | Alters table with cluster key if enabled |
Resume Recluster Clustering | Enables periodic reclustering to maintain optimal clustering |
Append Stream Predecessor Task Deployment
When deploying with predecessor tasks, executes:
Stage | Description |
---|---|
Suspend Root Task | Suspends root task to add task into DAG |
Create Task | Creates task for loading target table |
📘 Task DAG Note
For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.
Append Stream Redeployment
Redeployment Behavior | Stage Executed |
---|---|
Create Stream if not exists | Re-Create Stream at existing offset |
Create or Replace | Create Stream |
Create at existing stream | Re-Create Stream at existing offset |
Append Stream Table Redeployment
The following column/table changes trigger ALTER statements:
- Table name changes
- Column drops
- Column data type alterations
- New column additions
Executes these stages:
Stage | Description |
---|---|
Alter Table Operations | Executes appropriate ALTER statements for schema changes |
Target Initial Load | Executes load based on configuration: - If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE - All other scenarios: Uses INSERT INTO |
Append Stream View Redeployment
Stream or table changes trigger Hybrid View recreation.
Append Stream Task Redeployment
Stream or table changes trigger:
- Task recreation
- Task resumption
🚧 Redeployment Behavior
Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.
Append Stream Undeployment
When node is deleted, executes:
- Drop Stream
- Drop Table/Transient Table
- Drop View
- Drop Current Task
Deferred Merge Delta Stream
The Deferred Merge - Delta Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture changes from the source object to tracks all DML changes to the source object, including inserts, updates, and deletes. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.
Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness
Deferred Merge Delta Stream Node Configuration
The Deferred Merge Append node has the following configuration groups:
- General Options
- Stream Options
- Target Loading Options
- Target Row DML Operations
- Target Delete Options
- Target Clustering Options
- Scheduling Options
Delta Stream Stream General Options
Option | Description |
---|---|
Development Mode | True/False toggle that determines task creation: - True: Table created and SQL executes as Run action - False: SQL wraps into task with Scheduling Options |
Create Target As | Choose target object type: - Table: Creates table - Transient Table: Creates transient table |
Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.
Delta Stream Options
Option | Description |
---|---|
Source Object | Type of object for stream creation (Required): - Table - View |
Show Initial Rows | True: Returns only rows that existed when stream created False: Returns DML changes since most recent offset |
Redeployment Behavior | Determines stream recreation behavior |
Redeployment Behavior | Stage Executed |
---|---|
Create Stream if not exists | Re-Create Stream at existing offset |
Create or Replace | Create Stream |
Create at existing stream | Re-Create Stream at existing offset |
Delta Stream Target Loading Options
Option | Description |
---|---|
Table keys | Business keys columns for merging into target table |
Record Versioning | Date Time or Date and Timestamp column for latest record tracking |
Delta Stream Target Row DML Operations
Option | Description |
---|---|
Column Identifier | Column identifying DML operations |
Include Value for Update | For records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table. |
Insert Value | It indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation. |
Delete Value | This value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table. |
Delta Stream Target Delete Options
Option | Description |
---|---|
Soft Delete | Toggle to maintain deleted data record |
Retain Last Non-Deleted Values | Preserves most recent non-deleted record, even as other records are marked as deleted or become inactive. |
Delta Stream Target Clustering Options
Option | Description |
---|---|
Cluster key | True: Specify clustering column and allow expressions False: No clustering implemented |
Allow Expressions Cluster Key | Aadd an expression to the specified cluster key |
Delta Stream Scheduling Options
If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.
Option | Description |
---|---|
Scheduling Mode | Choose compute type: - Warehouse Task: User managed warehouse executes tasks - Serverless Task: Uses serverless compute |
When Source Stream has Data Flag | True/False toggle to check for stream data True - Only run task if source stream has capture change data False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false. |
Select Warehouse | Visible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes. |
Select initial serverless size | Visible when Scheduling Mode is set to Serverless Task. Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times. |
Task Schedule | Choose schedule type: - Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs. - Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax. - Predecessor - Specify dependent tasks |
Enter predecessor tasks separated by a comma | Visible when Task Schedule is set to Predecessor. One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces. |
Root task name | Visible when Task Schedule is set to Predecessor. Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces. |
Delta Stream Limitations
🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.
Delta Stream Deployment
Delta Stream Parameters
It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments.
The parameter name is targetTaskWarehouse
, and the default value is DEV ENVIRONMENT
.
When set to DEV ENVIRONMENT
, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.
{"targetTaskWarehouse": "DEV ENVIRONMENT"}
When set to any value other than DEV ENVIRONMENT
, the node will attempt to create the task using a Snowflake warehouse with the specified value.
For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh
.
{"targetTaskWarehouse": "compute_wh"}
Delta Stream Initial Deployment
When deployed for the first time into an environment, executes the following stages:
Stage | Description |
---|---|
Create Stream | Executes CREATE OR REPLACE statement to create Stream in target environment |
Create Target Table | Creates destination table for processed data storage |
Target Table Initial Load | Populates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied |
Create Hybrid View | Provides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates. |
Create Task | Creates merge operation task for stream changes |
Resume Task | Activates the created task for scheduled execution |
Apply Table Clustering | Alters table with cluster key if enabled |
Resume Recluster Clustering | Enables periodic reclustering to maintain optimal clustering |
Delta Stream Predecessor Task Deployment
When deploying with predecessor tasks, executes:
Stage | Description |
---|---|
Suspend Root Task | Suspends root task to add task into DAG |
Create Task | Creates task for loading target table |
📘 Task DAG Note
For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.
Delta Stream Redeployment
Redeployment Behavior | Stage Executed |
---|---|
Create Stream if not exists | Re-Create Stream at existing offset |
Create or Replace | Create Stream |
Create at existing stream | Re-Create Stream at existing offset |
Delta Stream Table Redeployment
The following column/table changes trigger ALTER statements:
- Table name changes
- Column drops
- Column data type alterations
- New column additions
Executes these stages:
Stage | Description |
---|---|
Alter Table Operations | Executes appropriate ALTER statements for schema changes |
Target Initial Load | Executes load based on configuration: - If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE - All other scenarios: Uses INSERT INTO |
Delta Stream View Redeployment
Stream or table changes trigger Hybrid View recreation.
Delta Stream Task Redeployment
Stream or table changes trigger:
- Task recreation
- Task resumption
🚧 Redeployment Behavior
Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.
Delta Stream Undeployment
When node is deleted, executes:
- Drop Stream
- Drop Table/Transient Table
- Drop View
- Drop Current Task
Code
Deferred Merge - Append Stream
Deferred Merge - Delta Stream
Versions
Available versions of the package.
Version # | Release Date | Notes |
---|---|---|
1.1.2 | November 14, 2024 |
|
1.1.1 | August 06, 2024 |
|
1.1.0 | June 14, 2024 |
|
Support
If you need help, please see our support section or contact us.