Incremental Loading with Coalesce

Overview

Incremental data loading is a technique used in data warehousing to efficiently update large datasets by only processing and loading new or changed data since the last update, instead of reloading the entire dataset each time.

For example, you're an e-commerce company and only want to load order changes, not all orders every time. You can use the Incremental Node to only update records that have shipped.

Coalesce makes this process easy by offering the Incremental Node Package. By using our Package, you can quickly get started building your pipeline with minimal code and configuration. The package contains the following Node Types:

Incremental Load - Performs updates by loading only new or changed records. It minimizes data duplication and optimizes resource use by focusing only on changes since the last load.
Looped Load - Runs multiple loads in a loop, which allows the system to handle large datasets in smaller batches. Each loop processes a subset of data, reducing memory usage and improving performance for larger tables.
Run View - Creates a view that tracks the most recent runs. The view is refreshed with each execution, allowing for easy identification of the latest updates or changes without reloading the entire dataset.

These Node Types make sure you can follow best practices such as:

Change detection - Identifying new, modified, or deleted records in the source system since the last data load.
Efficiency - Reducing processing time and resource usage by only handling changes instead of the full dataset.
Timestamp or version tracking - Using timestamps or version numbers to determine which records have changed.
Maintaining historical data - Preserving historical information while updating with new data.

You can learn more about Incremental Processing Strategies in our blog post.

High Water Mark

Coalesce uses High Water Mark as a fast and efficient way to process data

Before You Begin

If you don’t have a Coalesce account, you can sign up through Snowflake Partner Connect.
This demo uses the Snowflake sample data sets.
Set up your Storage Locations and Storage Mappings.
Add your source data.

Install the Incremental Loading Package

You're going to install the Incremental Loading Package.

Choose a package from the Coalesce Marketplace.
Make note of the Package ID.
In the Coalesce App, go to Build Settings > Packages. Click Install.
Enter the Package ID you previously copied.
Select the version you want to install.
Enter a Package Alias. The alias is a unique and descriptive name for the package so you can easily identify it later.
1. A package alias can only contain letters, numbers, and dashes. It can’t contain consecutive dashes.
2. Package aliases can’t start or end with a dash.
Install your package.

Click on Node Types to see the three Node Types included in the package:

Incremental Load
Loop Load
Run View

This image shows the Coalesce interface, focusing on the Build Settings tab. The main section displays a list of Node Types, including Incremental Load, Looped Load, and Run View, which are highlighted. These node types are part of the doc-inc-load package version 1.1.0. The left sidebar shows a project structure with various nodes, and the top of the interface includes options for Build, Deploy, and Docs.

Add Your Nodes

Right click on the LINEITEMsource node and select Add Node > Select the Incremental Node Package.
Open the Node Properties.
Take a look at the Storage Location. Make sure it’s not the same location as your source data.

Best Practice
You should create a separate Storage Location for your Incremental data when using this in a production environment.
Click Create, to create the Incremental Node. You'll configure the Node in a later step.

Viewing SQL
You can always see the SQL that is executed by viewing the results.
Go back to the DAG and right click on the Incremental Node and select Add Node > Persistent Stage.
Open the Node Properties and make sure it’s not the same as the source Node. Make note of the Storage Location name.
Create and Run the Persistent Stage Node.

Best Practice

It’s best practice to add a Persistent Stage table to hold the historical data.

At this point your DAG should look similar to the image. The information in the Source Node changes, the Incremental Node only checks the changed information and the Persistent Stage holds the historical data.

This image shows the Coalesce interface for a project named Default Project Incremental Loading. The main view displays a graph of data nodes, with LINEITEM connected to INC_LINEITEM and then to PSTG_LINEITEM. The left sidebar lists various nodes under Source, Incremental Load, and Persistent Stage categories. The interface includes options for viewing the graph, running operations, and searching nodes. The top of the interface shows navigation options and the current focus is on the INC_LINEITEM node.

Configure the Incremental Node

Open the Incremental Node and go to the Options.
Leave it as a view, toggle on the Filter data based on Persistent Table.
Enter the Storage Location of the Persistent Table. In this example, it’s WORK.
Enter the name of the Persistent Table. In this example, it’s PSTG_LINEITEM.
Change the Incremental Load Column to L_SHIPDATE. This is the field that the table will use to determine if the data has changed.
Go to the Join tab, click Generate Join, and Copy to Editor.

The query is looking for new or updated records in LINEITEM that have a ship date more recent than anything currently in PSTG_LINEITEM.

FROM{{ ref('SAMPLE', 'LINEITEM') }} "LINEITEM"
WHERE "LINEITEM"."L_SHIPDATE" > 
(SELECT COALESCE(MAX("L_SHIPDATE"), '1900-01-01') 
            FROM {{ ref_no_link('WORK', 'PSTG_LINEITEM') }} )

This image shows the Coalesce interface focused on the INC_LINEITEM node. The main section displays a SQL-like query that joins and filters data from the LINEITEM table. The query uses functions like ref, COALESCE, and MAX to manipulate the data, specifically working with the L_SHIPDATE column. The interface includes options for mapping and joining data, as well as a Config tab with Node Properties. There are buttons for generating joins, copying to clipboard, and copying to editor. The top of the interface shows navigation options for Build, Deploy, and Docs.

Congratulations

You've gone through the steps of settings up an Incremental Node that updates the data in a Persistent Table. The process is quick and easy.

Add Incremental Loading to an Existing Pipeline (Optional)

Add the Incremental Node from the source Node you want to track.
Either create or change an existing Stage Node to a Persistent Node Type.
If you're updating an existing Node. You’ll need to do a few more things.
1. Update the Persistent Node Type to reference the Incremental Node. You can do this by updating the Join to use the Incremental Node as the Source and changing the source in the columns.

This image shows the Coalesce interface for the PSTG_LINEITEM node. The main section displays a SQL-like query that references the INC_LINEITEM table from the WORK schema. On the right side, there's a Bulk Editor panel showing the node's attributes, with the Source set to INC_LINEITEM in the WORK schema, and the Operation set to Node. The interface includes tabs for Config, Column Editor, and Testing, as well as options for Mapping and Join. There's also a Generate Join button visible, indicating the ability to create join conditions automatically.

Conclusion

Incremental loading is a vital strategy in data warehousing and ETL processes. It enhances efficiency by processing only new or updated data since the last load, significantly reducing load times and data transfer volumes. This approach ensures that target systems stay current with the latest information while minimizing resource usage.

Overview​

Before You Begin​

Install the Incremental Loading Package​

Add Your Nodes​

Configure the Incremental Node​

Add Incremental Loading to an Existing Pipeline (Optional)​

Conclusion​

Resources​