Incremental Loading

With incremental loading, you process only new or changed rows since the last good run instead of reading the full data set every time. This page explains how incremental loading works in general and how you can apply it to your data platform.

Incremental Loading Compared to a Full Refresh

Use this table to compare full refreshes and incremental loads at a high level.

Aspect	Full refresh	Incremental load
Data processed	The full source every run	Only new or changed rows
Cost	Rises with total size	Rises with how much changes
Latency	Often high for large sources	Often lower when changes are small
Complexity	Simpler patterns	You need a watermark, CDC, or careful state tracking
Delete handling	If you compare two full snapshots, missing rows can mean deletes	You usually model deletes with flags, CDC, or reconciliation
Typical fit	Small sources, full exports, planned rebuilds	Large facts, APIs with `UPDATED_AT`, stream-style CDC

Core Concepts

These ideas show up on every warehouse: how you bound each run, where you store loaded rows, what counts as changed data, and when change data capture is the right tool.

Watermark

A watermark is the latest value of a column you've already loaded, often a timestamp. The next run only pulls rows above that line, like WHERE incremental_column > watermark.

Persistent Target

With a persistent table, you keep the business rows you've already loaded from the incremental stage, not just bookkeeping metadata. The next run often takes the watermark from MAX(incremental_column) on that table. Stay consistent about where that watermark value comes from in your graph.

Delta

The delta is what you want this run: inserts, updates, and deletes if you track them. Timestamp-style loads see inserts and updates that appear in the extract. Hard deletes in the source don't show up unless you add CDC, soft-delete columns, snapshot compares, or a delete feed.

Change Data Capture

CDC means the source, or a layer in front of it, sends change events or change rows for inserts, updates, and deletes. Snowflake Streams, Kafka topics, database log CDC, and SaaS event APIs are examples.

Choose an Incremental Strategy

This section will help you figure out which strategy works best for your data.

Batch Jobs and Timestamp Column

This pattern is the most common. You run jobs on a schedule such as hourly or daily and have an UPDATED_AT column.

Use: Watermark with a timestamp column

Snowflake
Databricks
BigQuery

On Snowflake, install the Incremental Loading package and use the Incremental Load Node type.

In Build Settings > Packages, install the Incremental Loading package, then add an Incremental Load Node after your source. Add a Persistent Stage, Dimension, or Fact, or another persistent target, to hold rows you have already loaded.
For the first run, leave Filter data based on Persistent table off, Create the Node, and Run the persistent target so it is populated. For later runs, turn Filter data based on Persistent table on so the template compares staging to that table.
Set Persistent table location and Persistent table name to match your persistent Node. Set Incremental load column to the source column you use as the watermark, for example UPDATED_AT.
On the Join tab, use Generate Join and Copy to Editor so the incremental predicate matches your column names. Use Pre-SQL or Post-SQL when you need extra filters or housekeeping.

Batch Jobs and Very Large Data

You run scheduled jobs, but data is too large for one run. Use when a single pass over the source is too large and you need buckets, history, or parameterized windows:

Use: Looped or batched incremental

Snowflake
Databricks
BigQuery

On Snowflake, use Looped Load from the Incremental Loading package. It builds load buckets from your grouping keys, loops through them, and can maintain a load history table for orchestration.

Install the package in Build Settings > Packages, then add a Looped Load Node downstream of the source or staging you need to chunk.
Configure Load Type and Job Parameters so each run uses the right mode. The Node supports values such as full load, full reload, incremental load, and reprocess load. Use reprocess-style runs when you need to retry failed buckets without reloading successes. See Looped Load Parameters on the Incremental Loading package listing.
Under Looped Load Incremental Load Options, turn on Set Incremental Load Options when you still want a watermark inside each bucket, pick Incremental Load Column (date), and set Group Incremental when you need more than one bucket per run.
Under Looped Load Group Table Options, choose Dedicated Load History Table and Load History Table Location when you want this Node's history isolated from other looped loads.
Use Pre-SQL and Post-SQL for setup and cleanup around each loop when the platform allows.

Build batches with base Nodes and SQL you control.

Keep the Incremental Load Node from Incremental Nodes when each batch is small enough. When it isn't, add Work or Persistent Stage Nodes whose SQL limits each run to a slice of the source, for example a date range, numeric key band, or file group.
Drive the slice with Parameters you set at deploy or refresh so you can change the window without editing the Node again. One Job can process one batch_id value today and another tomorrow.
Chain Nodes or stagger Jobs so heavy batches don't overlap on one cluster. Use Multi Source Nodes when several slices should land in one target with a defined load order.
Reuse the same watermark ideas as a standard incremental load, but apply them inside the batch predicate so you don't widen the scan past the current chunk.

Batch Jobs and Full Dumps Without a Timestamp

You only receive full snapshots of data.

Use: Snapshot Comparison

When each Job receives a full snapshot and you must diff it against what is already in the warehouse:

Snowflake
Databricks
BigQuery

Compare snapshots with base Nodes, optionally combined with the Incremental Loading package when your diff is already a view the package can load.

Land the latest extract in a Work or Persistent Stage Node from Snowflake Base Node Types, usually with Truncate Before so the object holds only the current file or batch.
Add Work Nodes that implement the diff, or use Create with Override SQL on staging paths for the same logic: for example keys present in the snapshot but absent from the target, row-level hashes or column compares for updates, and optional NOT EXISTS patterns for rows that disappeared if your design treats that as a delete signal.
Point Dimension Advanced Deploy or Fact Advanced Deploy from Snowflake Base Node Types Advanced Deploy at the reduced row set. Use documented Insert Strategy, Update Strategy, Unmatched Record Strategy, and MERGE methods so inserts and updates land in one Job.
When the compare SQL lives in a single view you want to treat like an incremental source, you can still add Incremental Load from the Incremental Loading package for load hygiene. You own the predicate or full replace semantics in the view logic.

On Databricks, build the compare entirely in SQL on Databricks Base Node Types Work, Persistent Stage, Dimension, and Fact Nodes.

Stage the full extract in Work with Truncate Before or INSERT OVERWRITE behavior so each run replaces the staging table with the new snapshot.
Join or subtract against the current persistent table in another Work Node. Use patterns your SQL engine supports, for example EXCEPT, ANTI JOIN, or hash compares, to emit insert and update rows.
For deletes inferred from missing keys, branch that logic in SQL or handle it in Post-SQL with a MERGE when you need database-native merge semantics that handle inserts and updates together.
Load the target Fact or Dimension with append or overwrite settings that match whether you are loading a full recomputed table or a delta table. Use Multi Source Nodes when the compare splits into multiple ordered inserts.

BigQuery follows the same staging-and-compare pattern using BigQuery Base Node Types.

Materialize the incoming snapshot in Work or Persistent Stage SQL, keeping grain and keys identical to the persistent model.
Compare to the target using EXCEPT DISTINCT, NOT EXISTS, or a hash column you calculate in SQL. Keep compare sets selective so you don't scan more bytes than needed.
Apply BigQuery Base Node Types Advanced Deploy on the downstream Dimension or Fact so MERGE or other documented load methods apply your inserts and updates. Turn Truncate Before off when you merge into an existing table instead of reloading it wholesale.
Deleting keys missing from the snapshot requires explicit merge delete rules or a separate delete step. See Deleting Records from Fact and Dimension Nodes when conformed models must stay aligned.

Batch Jobs and Full Accuracy With Deletes

You must detect inserts, updates, and deletes without CDC.

Use: Full Reload and Compare

When the staging table represents the full current source and you need MERGE behavior that can insert, update, and remove keys that disappeared from the extract, without CDC:

Snowflake
Databricks
BigQuery

On Snowflake, Snowflake Base Node Types hold the full snapshot and Snowflake Base Node Types Advanced Deploy supplies MERGE with Unmatched Record Strategy.

Load every run's extract into Work or Persistent Stage with Snowflake Base Node Types, typically Truncate Before on so staging matches the file or table your Job just received.
Point Dimension Advanced Deploy or Fact Advanced Deploy at that staging Node. Set Business Key when the listing requires it, choose MERGE as the load method, and set Update Strategy to Merge so unmatched handling is available.
Set Unmatched Record Strategy to NO DELETE, SOFT DELETE, or HARD DELETE depending on governance. Dimension Advanced Deploy documents SOFT DELETE and HARD DELETE for rows no longer in the source. Fact Advanced Deploy documents HARD DELETE and NO DELETE where the listing applies.
Keep dependent Facts and Dimensions in DAG order when deletes propagate. If a Node must run after a delete staging step but doesn't reference it in SQL, enforce order with ref_link(). Read Deleting Records from Fact and Dimension Nodes for patterns that keep facts and dimensions consistent.

Databricks relies on Databricks Base Node Types plus SQL MERGE you own.

Stage the full extract in Work or Persistent Stage with Truncate Before so the table mirrors the latest snapshot.
Write a MERGE into the target Fact or Dimension table using the staging table as the source. Include WHEN MATCHED for updates, WHEN NOT MATCHED for inserts, and a WHEN NOT MATCHED BY SOURCE branch that deletes or flags rows when policy allows keys missing from the snapshot to leave the target.
Run that statement from Post-SQL on the target Node, or split into a Work Node whose sole job is the MERGE, depending on how you split deploy and refresh steps.
Document orphan-key rules for Facts before you HARD DELETE Dimension rows. Use Multi Source Nodes only when multiple pieces of data must land in order before the MERGE runs.

Streaming or Frequent Jobs and Change Events

Data arrives continuously or includes change metadata.

Use: Change data capture, CDC

When change rows or stream metadata are already available and you need near-continuous loads instead of one large batch window:

Snowflake
Databricks
BigQuery

Snowflake CDC in Coalesce centers on the Snowflake Streams and Tasks package: Stream Nodes hold the offset, and Task Nodes run the work on a schedule or when the stream shows data.

Install the package under Build Settings > Packages.
Add a Stream Node on the Snowflake table or view you want to track so Snowflake records inserts, updates, and deletes according to your stream options, for example Append Only or Propagate Deletes where the matrix applies.
Add Stream and Insert or Merge or Delta Stream Merge downstream. Stream and Insert or Merge applies the stream delta with INSERT or MERGE. Delta Stream Merge uses stream metadata such as METADATA$ACTION when you need tighter alignment on updates and deletes.
Attach Work with Task, Dimension with Task, Fact with Task, or Insert or Merge with Task when you need Schedule, Stream Has Data, warehouse sizing, or Multi-Stream orchestration the matrices describe.
Use Task DAG Create Root and Task DAG Resume Root so suspended Snowflake tasks actually resume in the right dependency order.

You run CDC by landing change rows or log tables in BigQuery with your replication tooling, then calling Coalesce often enough that those tables stay small.

Keep CDC tables or change streams in BigQuery current with the product you already use for replication so each Coalesce run sees only recent events.
Schedule Coalesce Jobs at the frequency your SLA allows, whether that is minutes or hours. Shorter intervals mean smaller reads and fresher marts.
In BigQuery Base Node Types, filter on the change timestamp, sequence, or _PARTITIONTIME column your CDC pattern provides. Use ref_no_link() or ref() the same way you would for timestamp-based incremental loads so watermarks stay consistent.
When CDC delivers explicit insert, update, or delete rows, apply BigQuery Base Node Types Advanced Deploy MERGE options or staging Work Nodes that normalize operations before the MERGE.
Use Parameters for look-back windows or source system IDs when one Workspace processes multiple CDC feeds.

Multiple Sources Feeding One Target

You have multiple inputs updating one table.

Use: Multi-source incremental, often paired with another scenario in this section

When more than one source must land in the same conformed table and you still want incremental or parameterized runs, align business keys and grain before you combine branches. If identifiers differ across sources, fix them upstream in Work Nodes.

Snowflake
Databricks
BigQuery

Both patterns live in the Incremental Loading package. Pick Run View for parameterized UNION-style runs across separately named sources. Pick Grouped Incremental Load when the same template should run across Storage Mapping locations and the target stores which location each row came from.

Run View

Use Run View when you need parameterized UNION-style runs across separately named sources.

Add a Run View Node and turn Multi Source on. Choose the Multi Source Strategy the listing documents so branches combine the way you expect.
Name or suffix source Nodes so the sourcesuffix value in your deploy or refresh configuration can select one source, a list, or ALL, as described in Run View Parameters on the Incremental Loading package listing.
Optional: enable Filter data based on source and Incremental load based on Persistent table when a single shared watermark on the persistent target should bound every branch you include in that run.

Grouped Incremental Load

Use Grouped Incremental Load when the same template should run across Storage Mapping locations and you track which location each row came from.

Add Grouped Incremental Load and set Storage Mapping ID to the comma-separated list of Storage Locations this load should visit.
Enter Persistent table location, Persistent table name, Persistent table Column Name for the column that stores the location ID, and Grouped Incremental load column for the date or timestamp that drives increments.
Toggle Batch Load when you need batched passes per location instead of a straight pass. Use Joining Table Column Name when the package's join template needs an extra key between locations.
Generate join and Copy to Editor when the template prompts you so the join matches your schema.

BigQuery Base Node Types exposes the same Multi Source pattern: UNION DISTINCT or UNION ALL across branches.

Enable Multi Source on the Node that should materialize the combined data set, then attach each source branch in the order you want statements to run.
Align schemas across systems in Work Nodes first. Add a literal or computed source or source_system column when downstream MERGE or delete rules need to know which feed produced a row.
Use Parameters in template SQL when each Job should include different sources or date windows without editing the graph.
Pair the combined stage with an incremental or Advanced Deploy MERGE pattern from an earlier scenario in this section so inserts, updates, and deletes stay consistent after the UNION.

Watermark Columns and Pitfalls

Pick and test the column that bounds each run. For Incremental Load, Run View, Looped Load, and Grouped Incremental field-by-field setup, use the scenario subsection that matches your load shape earlier on this page.

Choosing a Watermark Column

Pick a column type that matches how the source updates and how wide your batch windows are.

Type	Notes
DATE	Broad strokes. Good for simple daily batches.
TIMESTAMP or DATETIME	Usual choice when the source keeps it current.
Monotonic version or sequence	Use when timestamps lie or drift.
Avoid	Using only non-monotonic or very dense surrogate keys unless you accept the scan cost.

Use COALESCE(updated_at, created_at) when new rows only fill created_at until the first update.

Watermark Pitfalls

Watch for sparse values, time zones, and first-run behavior before you rely on a filter in production.

Sparse timestamps: If many rows have null in the watermark column, your filter skips them. Fix upstream or normalize with COALESCE.
Time zones: Put warehouse and source on one rule before you compare values.

Snowflake
Databricks
BigQuery

-- Compare in one zone
WHERE CONVERT_TIMEZONE('UTC', 'America/New_York', src_ts) > :watermark_nyc;

-- Use Databricks time zone functions for your runtime
WHERE from_utc_timestamp(src_ts, 'America/New_York') > :watermark_nyc;

-- Civil datetime in the zone that defines your watermark
WHERE DATETIME(src_ts, 'America/New_York') > @watermark_nyc;

First empty persistent table: Use the first-run pattern in Batch Jobs and Timestamp Column, or seed a safe minimum watermark so you don't drop history.

Hard cases, such as custom consensus or special CDC, usually mix custom SQL with Pre-SQL and Post-SQL and ref(), ref_no_link(), and ref_link().

Handle Deletes With Incremental Loads

A high-water mark filter tells you what is new or changed. It doesn't remove rows when the source hard deletes a key unless you add explicit delete logic.

If a Node must run after a deletes staging Node but doesn't reference it in FROM, add a line with ref_link() so the DAG enforces the correct order.

Soft Deletes

Mark rows as inactive instead of removing them. Incremental loads treat these as updates and pass them through merge logic.

Common patterns:

Maintain IS_DELETED or DELETED_AT in the source
Filter active records downstream:
- WHERE IS_DELETED = 0
- WHERE DELETED_AT IS NULL

Apply this filtering as early as possible so downstream Nodes don't depend on implicit rules.

Hard Deletes

Physically remove rows when required for compliance or cost.

This requires explicit handling because incremental filters alone won't catch removed records. Typical approaches include:

Merge logic with delete conditions
Separate delete feeds or CDC streams
Pre-SQL or Post-SQL cleanup

Coordinate carefully with facts and aggregates to avoid orphaned keys.

Delete Feeds and CDC

Capture deletes explicitly from the source system rather than inferring them.

Options include:

Staging delete records separately
Reading CDC metadata, for example operation types
Applying deletes through merge logic or dedicated Packages

This is the most reliable pattern when available.

Snapshot or Full Comparison

Compare a previous extract to the current one and treat missing rows as deletes.

Use when:

No delete flag or CDC exists
Full extracts are available

This approach is more compute-intensive but ensures completeness.

Periodic Rebuilds

Run incremental loads between scheduled full or scoped rebuilds.

Use rebuilds to:

Reconcile missed deletes
Clean up drift over time

This is a practical fallback when real-time delete handling is incomplete.

Advanced Deploy Merge Options

Fact and Dimension Advanced Deploy can apply unmatched-record and delete behavior where supported.

Use these features when:

You want built-in merge handling
Your Node type supports delete strategies

Refer to product documentation for supported configurations.

Cascading Soft Deletes Through the DAG

Soft-deleted dimensions can still appear in reports if facts continue to join to them.

To prevent this:

Filter active rows in dimension outputs, or
Join facts with IS_DELETED = 0, or
Propagate delete flags into facts and aggregates

Apply the rule early to avoid inconsistent downstream behavior.

Multi-Source Delete Consensus

When multiple sources define delete status differently, define a clear policy before implementation:

Any-source delete Delete if any source marks the record deleted, for example MAX(is_deleted)
All-sources delete Delete only if all sources agree
Priority source One system determines delete status

These are custom patterns implemented in staging or merge logic, not default behavior.

Multi-Source Targets and Deletes

Per-source ownership Apply deletes only to rows owned by that source
Conformed keys and tombstones Prefer a single shared key and unified delete logic
Partial runs with sourcesuffix Skipped sources retain their rows until processed again
Orphan cleanup Retired sources may require one-time cleanup

Custom Multi-Source State

Optionally, when sources refresh on different schedules, you can store per-source rows in a small state table for last watermark, last load time, and status. After a failure, move only successful sources forward so the next run retries stale sources without losing good progress.

Resilience

All-or-nothing - One transaction across all sources. Simplest consistency, but one failure can block everything.
Per source - Separate commits. Sources reach eventual consistency. Often easier to run in production.

-- Watermark for one source name
SELECT COALESCE(MAX(last_watermark), TIMESTAMP '1900-01-01')
FROM incremental_source_state
WHERE source_name = :source_name;

Edge Cases and Gotchas

These issues often appear after the design is working for the happy path.

Truncate Before upstream - If you truncate staging in a way that wipes the persistent incremental target, you can reset the watermark and trigger reloads or duplicates. Keep persistent targets away from volatile truncates unless you want a rebuild.
Identity or sequence gaps - If surrogate keys reset after you drop and recreate tables, facts can point at wrong dimensions. Prefer stable keys or the recovery steps your warehouse documents.
NULL keys or NULL hashes - Bad keys make merges unpredictable. Fix keys upstream or drop bad rows on purpose.
Multi-source partial failure - If one branch of a UNION fails mid-run, you can end up with a mixed state. Prefer the Insert strategy with multiple INSERT statements on Multi Source, split Nodes by source, or use transactions your platform supports.
Soft deletes hidden in facts - Join to active dimension versions or carry delete flags on purpose. See Cascading Soft Deletes Through the DAG.

Troubleshooting

Use this section when merges fail, watermarks look wrong, or Pre-SQL and Post-SQL behave unexpectedly.

Non-Deterministic Merge

Duplicate merge keys, unstable row order, or ties on the watermark can cause errors or double updates.

Checks: Run GROUP BY business_key HAVING COUNT(*) > 1 on staging. Remove duplicates upstream. Strengthen the business key.

Watermark Appears Stuck

Check that the persistent table still shows the MAX(incremental_column) you expect, the filter uses the same column and time zone, and no stray truncate cleared history.

Statement Count or Stage Separation

Give each Pre-SQL or Post-SQL stage one clear job when the platform allows one statement per phase, or split work across stages in the graph when supported.

Performance and Cost

Tune warehouse compute and batch sizing so incremental Jobs finish inside your SLA without over-scanning.

On Snowflake, use clustering keys or query pruning on partition-like columns. Snowflake isn't like OLTP B-tree indexes. Keep filters selective.
Batch sizing: Bigger windows mean fewer Jobs. Smaller windows mean fresher data. Balance SLA and spend.
Looped Load - Use it when one full scan won't finish in your window.

Monitoring Ideas

For each source or job, track:

Delay between max timestamp in source and max timestamp loaded
Job failures and last success time
Row counts and error tables if you quarantine bad rows

What's Next?

Incremental loading tutorial, step-by-step
Incremental Loading package, Snowflake Marketplace reference
Incremental Nodes, Databricks package
Snowflake Base Node Types, Marketplace reference
Snowflake Base Node Types Advanced Deploy, merge and load strategies
Databricks Base Node Types, Work through Fact Node types
Databricks Lakeflow Declarative Pipelines
Databricks External Data Package
Databricks Materialized View
BigQuery Base Node Types
BigQuery Base Node Types Advanced Deploy
Multi Source Nodes
Snowflake Streams and Tasks
Deleting Records from Fact and Dimension Nodes
Pre-SQL and Post-SQL
ref() and ref_no_link()
Surrogate Keys
SCD Type 2 change detection
Incremental processing strategies
Parameters, deploy and refresh overrides