Skip to main content

Incremental Loading with Coalesce

    Overview

    Incremental data loading is a technique used in data warehousing to efficiently update large datasets by only processing and loading new or changed data since the last update, instead of reloading the entire dataset each time.

    For example, you're an e-commerce company and only want to load order changes, not all orders every time. You can use the Incremental Node to only update records that have shipped.

    Coalesce makes this process easy by offering the Incremental Node Package. By using our Package, you can quickly get started building your pipeline with minimal code and configuration. The package contains the following Node Types:

    • Incremental Load - Performs updates by loading only new or changed records. It minimizes data duplication and optimizes resource use by focusing only on changes since the last load.
    • Looped Load - Runs multiple loads in a loop, which allows the system to handle large datasets in smaller batches. Each loop processes a subset of data, reducing memory usage and improving performance for larger tables.
    • Run View - Creates a view that tracks the most recent runs. The view is refreshed with each execution, allowing for easy identification of the latest updates or changes without reloading the entire dataset.

    These Node Types make sure you can follow best practices such as:

    • Change detection - Identifying new, modified, or deleted records in the source system since the last data load.
    • Efficiency - Reducing processing time and resource usage by only handling changes instead of the full dataset.
    • Timestamp or version tracking - Using timestamps or version numbers to determine which records have changed.
    • Maintaining historical data - Preserving historical information while updating with new data.

    You can learn more about Incremental Processing Strategies in our blog post.

    High Water Mark

    Coalesce uses High Water Mark as a fast and efficient way to process data

    Before You Begin

    1. If you don’t have a Coalesce account, you can sign up through Snowflake Partner Connect.
    2. This demo uses the Snowflake sample data sets.
    3. Set up your Storage Locations and Storage Mappings.
    4. Add your source data.

    Install the Incremental Loading Package

    You're going to install the Incremental Loading Package.

    1. Choose a package from the Coalesce Marketplace.

    2. Make note of the Package ID.

      The Dynamic Tables Package streamlines Snowflake data pipelines with tools for creating, deploying, and managing dynamic tables. It enhances efficiency by automating table updates and ensuring data consistency. This package is essential for maintaining robust and scalable data workflows.
    3. In the Coalesce App, go to Build Settings > Packages. Click Install.

      The installation screen for packages requires the user to enter the Package ID and Version, as well as a unique and descriptive Package Alias. For further details and guidelines, users are directed to the Coalesce Documentation. This ensures proper configuration and management of installed packages.
    4. Enter the Package ID you previously copied.

    5. Select the version you want to install.

    6. Enter a Package Alias. The alias is a unique and descriptive name for the package so you can easily identify it later.

      1. A package alias can only contain letters, numbers, and dashes. It can’t contain consecutive dashes.
      2. Package aliases can’t start or end with a dash.
    7. Install your package.

    Click on Node Types to see the three Node Types included in the package:

    • Incremental Load
    • Loop Load
    • Run View
    This image shows the Coalesce interface, focusing on the Build Settings tab. The main section displays a list of Node Types, including Incremental Load, Looped Load, and Run View, which are highlighted. These node types are part of the doc-inc-load package version 1.1.0. The left sidebar shows a project structure with various nodes, and the top of the interface includes options for Build, Deploy, and Docs.

    Add Your Nodes

    1. Right click on the LINEITEMsource node and select Add Node > Select the Incremental Node Package.

      The main view displays a graph of connected nodes representing different data sources and processing steps, with a sidebar listing various nodes and their properties, and a context menu offering options to edit, add, or manipulate nodes.
    2. Open the Node Properties.

    3. Take a look at the Storage Location. Make sure it’s not the same location as your source data.

      This image shows a configuration interface from Coalesce, focusing on the Node Properties section. It displays settings for a node, including the Storage Location set to a work database with incremental tables in the DEV environment, and the Node Type specified as Incremental Load using version 1.1.0 of the doc-inc-load feature.
      Best Practice

      You should create a separate Storage Location for your Incremental data when using this in a production environment.

    4. Click Create, to create the Incremental Node. You'll configure the Node in a later step.

      Viewing SQL

      You can always see the SQL that is executed by viewing the results.

      This image shows the Results tab of a Coalesce interface after running a data pipeline operation. The top of the interface displays action buttons like Validate Select, Create, and Run. The results section shows that the process has been validated and two items have been created. Specifically, it shows the successful creation of a view named INC_ORDERS, with the SQL statement for creating or replacing the view visible. The view includes columns like O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, and others.
    5. Go back to the DAG and right click on the Incremental Node and select Add Node > Persistent Stage.

    6. Open the Node Properties and make sure it’s not the same as the source Node. Make note of the Storage Location name.

    7. Create and Run the Persistent Stage Node.

    Best Practice

    It’s best practice to add a Persistent Stage table to hold the historical data.

    At this point your DAG should look similar to the image. The information in the Source Node changes, the Incremental Node only checks the changed information and the Persistent Stage holds the historical data.

    This image shows the Coalesce interface for a project named Default Project Incremental Loading. The main view displays a graph of data nodes, with LINEITEM connected to INC_LINEITEM and then to PSTG_LINEITEM. The left sidebar lists various nodes under Source, Incremental Load, and Persistent Stage categories. The interface includes options for viewing the graph, running operations, and searching nodes. The top of the interface shows navigation options and the current focus is on the INC_LINEITEM node.

    Configure the Incremental Node

    1. Open the Incremental Node and go to the Options.

    2. Leave it as a view, toggle on the Filter data based on Persistent Table.

    3. Enter the Storage Location of the Persistent Table. In this example, it’s WORK.

    4. Enter the name of the Persistent Table. In this example, it’s PSTG_LINEITEM.

    5. Change the Incremental Load Column to L_SHIPDATE. This is the field that the table will use to determine if the data has changed.

      This image displays a configuration interface for filtering data based on a Persistent Table in Coalesce. It shows a toggle switch to enable or disable the filter, along with an explanation that a Persistent table can be a Persistent stage, Dimension, or Fact node. The interface includes fields for the Persistent Table location (set to WORK) and name (PSTG_LINEITEM). An Incremental Load Column field is also present, with L_SHIPDATE selected as the date column. Some fields have Expand buttons, indicating additional options or details are available.
    6. Go to the Join tab, click Generate Join, and Copy to Editor.

    7. The query is looking for new or updated records in LINEITEM that have a ship date more recent than anything currently in PSTG_LINEITEM.

      FROM{{ ref('SAMPLE', 'LINEITEM') }} "LINEITEM"
      WHERE "LINEITEM"."L_SHIPDATE" >
      (SELECT COALESCE(MAX("L_SHIPDATE"), '1900-01-01')
      FROM {{ ref_no_link('WORK', 'PSTG_LINEITEM') }} )
      This image shows the Coalesce interface focused on the INC_LINEITEM node. The main section displays a SQL-like query that joins and filters data from the LINEITEM table. The query uses functions like ref, COALESCE, and MAX to manipulate the data, specifically working with the L_SHIPDATE column. The interface includes options for mapping and joining data, as well as a Config tab with Node Properties. There are buttons for generating joins, copying to clipboard, and copying to editor. The top of the interface shows navigation options for Build, Deploy, and Docs.
    Congratulations

    You've gone through the steps of settings up an Incremental Node that updates the data in a Persistent Table. The process is quick and easy.

    Add Incremental Loading to an Existing Pipeline (Optional)

    1. Add the Incremental Node from the source Node you want to track.
    2. Either create or change an existing Stage Node to a Persistent Node Type.
    3. If you're updating an existing Node. You’ll need to do a few more things.
      1. Update the Persistent Node Type to reference the Incremental Node. You can do this by updating the Join to use the Incremental Node as the Source and changing the source in the columns.
    This image shows the Coalesce interface for the PSTG_LINEITEM node. The main section displays a SQL-like query that references the INC_LINEITEM table from the WORK schema. On the right side, there's a Bulk Editor panel showing the node's attributes, with the Source set to INC_LINEITEM in the WORK schema, and the Operation set to Node. The interface includes tabs for Config, Column Editor, and Testing, as well as options for Mapping and Join. There's also a Generate Join button visible, indicating the ability to create join conditions automatically.

    Conclusion

    Incremental loading is a vital strategy in data warehousing and ETL processes. It enhances efficiency by processing only new or updated data since the last load, significantly reducing load times and data transfer volumes. This approach ensures that target systems stay current with the latest information while minimizing resource usage.

    Resources