Skip to main content

Deferred Merge

Overview

Deferred Merge package nodes are a simplified implementation of a Lambda architecture, focusing on the integration of an ongoing real-time or near real-time updates with previously loaded data.

Installation

  • Copy the Package ID  @coalesce/deferred-merge
  • In Coalesce, open the Workspace where you wish to install the package.
  • Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
  • Paste the Package ID, and proceed with the installation process.

Description

Deferred Merge Package

The Deferred Merge Package includes mechanisms for handling merge operations with different streaming strategies:


Deferred Merge Append Stream

The Deferred Merge - Append Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture row inserts. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.

Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness.

Deferred Merge Append Stream Node Configuration

The Deferred Merge Append node has the following configuration groups:

Append Stream General Options

Append General Options

OptionDescription
Development ModeTrue/False toggle that determines task creation:
- True: Table created and SQL executes as Run action
- False: SQL wraps into task with Scheduling Options
Create Target AsChoose target object type:
- Table: Creates table
- Transient Table: Creates transient table

Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.

Append Stream Options

Append Stream Options

OptionDescription
Source ObjectType of object for stream creation (Required):
- Table
- View
Show Initial RowsTrue: Returns only rows that existed when stream created
False: Returns DML changes since most recent offset
Redeployment BehaviorDetermines stream recreation behavior
Redeployment BehaviorStage Executed
Create Stream if not existsRe-Create Stream at existing offset
Create or ReplaceCreate Stream
Create at existing streamRe-Create Stream at existing offset

Append Stream Target Loading Options

Append Target Loading Options

OptionDescription
Table keysBusiness keys columns for merging into target table
Record VersioningDate Time or Date and Timestamp column for latest record tracking

Append Stream Target Row DML Operations

Append Target Row DML Operations

OptionDescription
Column IdentifierColumn identifying DML operations
Include Value for UpdateFor records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table.
Insert ValueIt indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation.
Delete ValueThis value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table.

Append Stream Target Delete Options

Append Target Delete Options

OptionDescription
Soft DeleteToggle to maintain deleted data record
Retain Last Non-Deleted ValuesPreserves most recent non-deleted record, even as other records are marked as deleted or become inactive.

Append Stream Target Clustering Options

Append Target Clustering Options

OptionDescription
Cluster keyTrue: Specify clustering column and allow expressions
False: No clustering implemented
Allow Expressions Cluster KeyAadd an expression to the specified cluster key

Append Stream Scheduling Options

If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.

image

OptionDescription
Scheduling ModeChoose compute type:
- Warehouse Task: User managed warehouse executes tasks
- Serverless Task: Uses serverless compute
When Source Stream has Data FlagTrue/False toggle to check for stream data
True - Only run task if source stream has capture change data
False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Select WarehouseVisible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless sizeVisible when Scheduling Mode is set to Serverless Task.
Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task ScheduleChoose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax.
- Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a commaVisible when Task Schedule is set to Predecessor.
One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task nameVisible when Task Schedule is set to Predecessor.
Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Append Stream Limitations

🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.

Append Stream Deployment

Append Stream Parameters

It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments. The parameter name is targetTaskWarehouse, and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.

{"targetTaskWarehouse": "DEV ENVIRONMENT"}

When set to any value other than DEV ENVIRONMENT, the node will attempt to create the task using a Snowflake warehouse with the specified value. For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{"targetTaskWarehouse": "compute_wh"}

Append Stream Initial Deployment

When deployed for the first time into an environment, executes the following stages:

StageDescription
Create StreamExecutes CREATE OR REPLACE statement to create Stream in target environment
Create Target TableCreates destination table for processed data storage
Target Table Initial LoadPopulates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied
Create Hybrid ViewProvides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates.
Create TaskCreates merge operation task for stream changes
Resume TaskActivates the created task for scheduled execution
Apply Table ClusteringAlters table with cluster key if enabled
Resume Recluster ClusteringEnables periodic reclustering to maintain optimal clustering
Append Stream Predecessor Task Deployment

When deploying with predecessor tasks, executes:

StageDescription
Suspend Root TaskSuspends root task to add task into DAG
Create TaskCreates task for loading target table

📘 Task DAG Note

For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.

Append Stream Redeployment
Redeployment BehaviorStage Executed
Create Stream if not existsRe-Create Stream at existing offset
Create or ReplaceCreate Stream
Create at existing streamRe-Create Stream at existing offset
Append Stream Table Redeployment

The following column/table changes trigger ALTER statements:

  • Table name changes
  • Column drops
  • Column data type alterations
  • New column additions

Executes these stages:

StageDescription
Alter Table OperationsExecutes appropriate ALTER statements for schema changes
Target Initial LoadExecutes load based on configuration:
- If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE
- All other scenarios: Uses INSERT INTO
Append Stream View Redeployment

Stream or table changes trigger Hybrid View recreation.

Append Stream Task Redeployment

Stream or table changes trigger:

  1. Task recreation
  2. Task resumption

🚧 Redeployment Behavior

Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.

Append Stream Undeployment

When node is deleted, executes:

  • Drop Stream
  • Drop Table/Transient Table
  • Drop View
  • Drop Current Task

Deferred Merge Delta Stream

The Deferred Merge - Delta Stream Node includes several key steps to ensure efficient and up-to-date data processing. First, a stream is created to capture changes from the source object to tracks all DML changes to the source object, including inserts, updates, and deletes. Then a target table is created and initially loaded with data. A hybrid view is established to provide access to the most current data by combining initial updates.

Finally, a scheduled task manages ongoing updates by merging changes into the target table. This process ensures that the target table remains synchronized with the source data maintaining accuracy and timeliness

Deferred Merge Delta Stream Node Configuration

The Deferred Merge Append node has the following configuration groups:

Delta Stream Stream General Options

Append General Options

OptionDescription
Development ModeTrue/False toggle that determines task creation:
- True: Table created and SQL executes as Run action
- False: SQL wraps into task with Scheduling Options
Create Target AsChoose target object type:
- Table: Creates table
- Transient Table: Creates transient table

Prior to creating a task, it is helpful to test the SQL the task will execute to make sure it runs without errors and returns the expected data.

Delta Stream Options

Append Stream Options

OptionDescription
Source ObjectType of object for stream creation (Required):
- Table
- View
Show Initial RowsTrue: Returns only rows that existed when stream created
False: Returns DML changes since most recent offset
Redeployment BehaviorDetermines stream recreation behavior
Redeployment BehaviorStage Executed
Create Stream if not existsRe-Create Stream at existing offset
Create or ReplaceCreate Stream
Create at existing streamRe-Create Stream at existing offset

Delta Stream Target Loading Options

Append Target Loading Options

OptionDescription
Table keysBusiness keys columns for merging into target table
Record VersioningDate Time or Date and Timestamp column for latest record tracking

Delta Stream Target Row DML Operations

Append Target Row DML Operations

OptionDescription
Column IdentifierColumn identifying DML operations
Include Value for UpdateFor records flagged under Update, the existing records in the target table are updated with the corresponding values from the source table.
Insert ValueIt indicates that the corresponding record is meant to be inserted into the target table. This condition ensures that only records flagged for insertion are actually inserted into the target table during the merge operation.
Delete ValueThis value indicates that the corresponding record should either be soft-deleted (if the condition is met by enabling the soft delete toggle) or hard-deleted from the target table.

Delta Stream Target Delete Options

Append Target Delete Options

OptionDescription
Soft DeleteToggle to maintain deleted data record
Retain Last Non-Deleted ValuesPreserves most recent non-deleted record, even as other records are marked as deleted or become inactive.

Delta Stream Target Clustering Options

Append Target Clustering Options

OptionDescription
Cluster keyTrue: Specify clustering column and allow expressions
False: No clustering implemented
Allow Expressions Cluster KeyAadd an expression to the specified cluster key

Delta Stream Scheduling Options

If development mode is set to false then Scheduling Options can be used to configure how and when the task will run.

image

OptionDescription
Scheduling ModeChoose compute type:
- Warehouse Task: User managed warehouse executes tasks
- Serverless Task: Uses serverless compute
When Source Stream has Data FlagTrue/False toggle to check for stream data
True - Only run task if source stream has capture change data
False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Select WarehouseVisible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless sizeVisible when Scheduling Mode is set to Serverless Task.
Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task ScheduleChoose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax.
- Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a commaVisible when Task Schedule is set to Predecessor.
One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task nameVisible when Task Schedule is set to Predecessor.
Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Delta Stream Limitations

🚧 Appyling Transformation This node can't apply transformations to the columns for this node type.

Delta Stream Deployment

Delta Stream Parameters

It includes an environment parameter that allows you to specify a different warehouse to run a task in different environments. The parameter name is targetTaskWarehouse, and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT, the value entered in the Scheduling Options config "Select Warehouse on which to run the task" will be used when creating the task.

{"targetTaskWarehouse": "DEV ENVIRONMENT"}

When set to any value other than DEV ENVIRONMENT, the node will attempt to create the task using a Snowflake warehouse with the specified value. For example, with the setting below for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{"targetTaskWarehouse": "compute_wh"}

Delta Stream Initial Deployment

When deployed for the first time into an environment, executes the following stages:

StageDescription
Create StreamExecutes CREATE OR REPLACE statement to create Stream in target environment
Create Target TableCreates destination table for processed data storage
Target Table Initial LoadPopulates the target table with the existing data from the source object. This step ensures that the target table starts with a complete set of data before any further changes are applied
Create Hybrid ViewProvides access to the most up-to-date data by combining the initial data load with any subsequent changes captured by the stream. The hybrid view ensures that users always have access to the latest version of the data without waiting for batch updates.
Create TaskCreates merge operation task for stream changes
Resume TaskActivates the created task for scheduled execution
Apply Table ClusteringAlters table with cluster key if enabled
Resume Recluster ClusteringEnables periodic reclustering to maintain optimal clustering
Delta Stream Predecessor Task Deployment

When deploying with predecessor tasks, executes:

StageDescription
Suspend Root TaskSuspends root task to add task into DAG
Create TaskCreates task for loading target table

📘 Task DAG Note

For tasks in a DAG, include Task Dag Resume Root node type to resume root node after dependent task creation. Tasks use CREATE OR REPLACE only, with no ALTER capabilities.

Delta Stream Redeployment
Redeployment BehaviorStage Executed
Create Stream if not existsRe-Create Stream at existing offset
Create or ReplaceCreate Stream
Create at existing streamRe-Create Stream at existing offset
Delta Stream Table Redeployment

The following column/table changes trigger ALTER statements:

  • Table name changes
  • Column drops
  • Column data type alterations
  • New column additions

Executes these stages:

StageDescription
Alter Table OperationsExecutes appropriate ALTER statements for schema changes
Target Initial LoadExecutes load based on configuration:
- If initial load enabled + "Create or Replace": Uses INSERT OVERWRITE
- All other scenarios: Uses INSERT INTO
Delta Stream View Redeployment

Stream or table changes trigger Hybrid View recreation.

Delta Stream Task Redeployment

Stream or table changes trigger:

  1. Task recreation
  2. Task resumption

🚧 Redeployment Behavior

Redeployment with changes in Stream/Table/Task properties will result in execution of all steps mentioned in inital deployment.

Delta Stream Undeployment

When node is deleted, executes:

  • Drop Stream
  • Drop Table/Transient Table
  • Drop View
  • Drop Current Task

Code

Deferred Merge - Append Stream

Deferred Merge - Delta Stream

Versions

Available versions of the package.

Version #Release DateNotes
1.1.2November 14, 2024
 Node description/comment changes 
1.1.1August 06, 2024
 Added support for iceberg tables as source and fix for empty storage location 
1.1.0June 14, 2024
 Deferred Merge Package updated 

Support

If you need help, please see our support section or contact us.