Skip to main content

Incremental Loading

With incremental loading, you process only new or changed rows since the last good run instead of reading the full data set every time. This page explains how incremental loading works in general and how you can apply it to your data platform.

Incremental Loading Compared to a Full Refresh

Use this table to compare full refreshes and incremental loads at a high level.

AspectFull refreshIncremental load
Data processedThe full source every runOnly new or changed rows
CostRises with total sizeRises with how much changes
LatencyOften high for large sourcesOften lower when changes are small
ComplexitySimpler patternsYou need a watermark, CDC, or careful state tracking
Delete handlingIf you compare two full snapshots, missing rows can mean deletesYou usually model deletes with flags, CDC, or reconciliation
Typical fitSmall sources, full exports, planned rebuildsLarge facts, APIs with UPDATED_AT, stream-style CDC

Core Concepts

These ideas show up on every warehouse: how you bound each run, where you store loaded rows, what counts as changed data, and when change data capture is the right tool.

Watermark

A watermark is the latest value of a column you've already loaded, often a timestamp. The next run only pulls rows above that line, like WHERE incremental_column > watermark.

Persistent Target

With a persistent table, you keep the business rows you've already loaded from the incremental stage, not just bookkeeping metadata. The next run often takes the watermark from MAX(incremental_column) on that table. Stay consistent about where that watermark value comes from in your graph.

Delta

The delta is what you want this run: inserts, updates, and deletes if you track them. Timestamp-style loads see inserts and updates that appear in the extract. Hard deletes in the source don't show up unless you add CDC, soft-delete columns, snapshot compares, or a delete feed.

Change Data Capture

CDC means the source, or a layer in front of it, sends change events or change rows for inserts, updates, and deletes. Snowflake Streams, Kafka topics, database log CDC, and SaaS event APIs are examples.

Choose an Incremental Strategy

This section will help you figure out which strategy works best for your data.

Batch Jobs and Timestamp Column

This pattern is the most common. You run jobs on a schedule such as hourly or daily and have an UPDATED_AT column.

Use: Watermark with a timestamp column

On Snowflake, install the Incremental Loading package and use the Incremental Load Node type.

  1. In Build Settings > Packages, install the Incremental Loading package, then add an Incremental Load Node after your source. Add a Persistent Stage, Dimension, or Fact, or another persistent target, to hold rows you have already loaded.
  2. For the first run, leave Filter data based on Persistent table off, Create the Node, and Run the persistent target so it is populated. For later runs, turn Filter data based on Persistent table on so the template compares staging to that table.
  3. Set Persistent table location and Persistent table name to match your persistent Node. Set Incremental load column to the source column you use as the watermark, for example UPDATED_AT.
  4. On the Join tab, use Generate Join and Copy to Editor so the incremental predicate matches your column names. Use Pre-SQL or Post-SQL when you need extra filters or housekeeping.

Batch Jobs and Very Large Data

You run scheduled jobs, but data is too large for one run. Use when a single pass over the source is too large and you need buckets, history, or parameterized windows:

Use: Looped or batched incremental

On Snowflake, use Looped Load from the Incremental Loading package. It builds load buckets from your grouping keys, loops through them, and can maintain a load history table for orchestration.

  1. Install the package in Build Settings > Packages, then add a Looped Load Node downstream of the source or staging you need to chunk.
  2. Configure Load Type and Job Parameters so each run uses the right mode. The Node supports values such as full load, full reload, incremental load, and reprocess load. Use reprocess-style runs when you need to retry failed buckets without reloading successes. See Looped Load Parameters on the Incremental Loading package listing.
  3. Under Looped Load Incremental Load Options, turn on Set Incremental Load Options when you still want a watermark inside each bucket, pick Incremental Load Column (date), and set Group Incremental when you need more than one bucket per run.
  4. Under Looped Load Group Table Options, choose Dedicated Load History Table and Load History Table Location when you want this Node's history isolated from other looped loads.
  5. Use Pre-SQL and Post-SQL for setup and cleanup around each loop when the platform allows.

Batch Jobs and Full Dumps Without a Timestamp

You only receive full snapshots of data.

Use: Snapshot Comparison

When each Job receives a full snapshot and you must diff it against what is already in the warehouse:

Compare snapshots with base Nodes, optionally combined with the Incremental Loading package when your diff is already a view the package can load.

  1. Land the latest extract in a Work or Persistent Stage Node from Snowflake Base Node Types, usually with Truncate Before so the object holds only the current file or batch.
  2. Add Work Nodes that implement the diff, or use Create with Override SQL on staging paths for the same logic: for example keys present in the snapshot but absent from the target, row-level hashes or column compares for updates, and optional NOT EXISTS patterns for rows that disappeared if your design treats that as a delete signal.
  3. Point Dimension Advanced Deploy or Fact Advanced Deploy from Snowflake Base Node Types Advanced Deploy at the reduced row set. Use documented Insert Strategy, Update Strategy, Unmatched Record Strategy, and MERGE methods so inserts and updates land in one Job.
  4. When the compare SQL lives in a single view you want to treat like an incremental source, you can still add Incremental Load from the Incremental Loading package for load hygiene. You own the predicate or full replace semantics in the view logic.

Batch Jobs and Full Accuracy With Deletes

You must detect inserts, updates, and deletes without CDC.

Use: Full Reload and Compare

When the staging table represents the full current source and you need MERGE behavior that can insert, update, and remove keys that disappeared from the extract, without CDC:

On Snowflake, Snowflake Base Node Types hold the full snapshot and Snowflake Base Node Types Advanced Deploy supplies MERGE with Unmatched Record Strategy.

  1. Load every run's extract into Work or Persistent Stage with Snowflake Base Node Types, typically Truncate Before on so staging matches the file or table your Job just received.
  2. Point Dimension Advanced Deploy or Fact Advanced Deploy at that staging Node. Set Business Key when the listing requires it, choose MERGE as the load method, and set Update Strategy to Merge so unmatched handling is available.
  3. Set Unmatched Record Strategy to NO DELETE, SOFT DELETE, or HARD DELETE depending on governance. Dimension Advanced Deploy documents SOFT DELETE and HARD DELETE for rows no longer in the source. Fact Advanced Deploy documents HARD DELETE and NO DELETE where the listing applies.
  4. Keep dependent Facts and Dimensions in DAG order when deletes propagate. If a Node must run after a delete staging step but doesn't reference it in SQL, enforce order with ref_link(). Read Deleting Records from Fact and Dimension Nodes for patterns that keep facts and dimensions consistent.

Streaming or Frequent Jobs and Change Events

Data arrives continuously or includes change metadata.

Use: Change data capture, CDC

When change rows or stream metadata are already available and you need near-continuous loads instead of one large batch window:

Snowflake CDC in Coalesce centers on the Snowflake Streams and Tasks package: Stream Nodes hold the offset, and Task Nodes run the work on a schedule or when the stream shows data.

  1. Install the package under Build Settings > Packages.
  2. Add a Stream Node on the Snowflake table or view you want to track so Snowflake records inserts, updates, and deletes according to your stream options, for example Append Only or Propagate Deletes where the matrix applies.
  3. Add Stream and Insert or Merge or Delta Stream Merge downstream. Stream and Insert or Merge applies the stream delta with INSERT or MERGE. Delta Stream Merge uses stream metadata such as METADATA$ACTION when you need tighter alignment on updates and deletes.
  4. Attach Work with Task, Dimension with Task, Fact with Task, or Insert or Merge with Task when you need Schedule, Stream Has Data, warehouse sizing, or Multi-Stream orchestration the matrices describe.
  5. Use Task DAG Create Root and Task DAG Resume Root so suspended Snowflake tasks actually resume in the right dependency order.

Multiple Sources Feeding One Target

You have multiple inputs updating one table.

Use: Multi-source incremental, often paired with another scenario in this section

When more than one source must land in the same conformed table and you still want incremental or parameterized runs, align business keys and grain before you combine branches. If identifiers differ across sources, fix them upstream in Work Nodes.

Both patterns live in the Incremental Loading package. Pick Run View for parameterized UNION-style runs across separately named sources. Pick Grouped Incremental Load when the same template should run across Storage Mapping locations and the target stores which location each row came from.

Run View

Use Run View when you need parameterized UNION-style runs across separately named sources.

  1. Add a Run View Node and turn Multi Source on. Choose the Multi Source Strategy the listing documents so branches combine the way you expect.
  2. Name or suffix source Nodes so the sourcesuffix value in your deploy or refresh configuration can select one source, a list, or ALL, as described in Run View Parameters on the Incremental Loading package listing.
  3. Optional: enable Filter data based on source and Incremental load based on Persistent table when a single shared watermark on the persistent target should bound every branch you include in that run.

Grouped Incremental Load

Use Grouped Incremental Load when the same template should run across Storage Mapping locations and you track which location each row came from.

  1. Add Grouped Incremental Load and set Storage Mapping ID to the comma-separated list of Storage Locations this load should visit.
  2. Enter Persistent table location, Persistent table name, Persistent table Column Name for the column that stores the location ID, and Grouped Incremental load column for the date or timestamp that drives increments.
  3. Toggle Batch Load when you need batched passes per location instead of a straight pass. Use Joining Table Column Name when the package's join template needs an extra key between locations.
  4. Generate join and Copy to Editor when the template prompts you so the join matches your schema.

Watermark Columns and Pitfalls

Pick and test the column that bounds each run. For Incremental Load, Run View, Looped Load, and Grouped Incremental field-by-field setup, use the scenario subsection that matches your load shape earlier on this page.

Choosing a Watermark Column

Pick a column type that matches how the source updates and how wide your batch windows are.

TypeNotes
DATEBroad strokes. Good for simple daily batches.
TIMESTAMP or DATETIMEUsual choice when the source keeps it current.
Monotonic version or sequenceUse when timestamps lie or drift.
AvoidUsing only non-monotonic or very dense surrogate keys unless you accept the scan cost.

Use COALESCE(updated_at, created_at) when new rows only fill created_at until the first update.

Watermark Pitfalls

Watch for sparse values, time zones, and first-run behavior before you rely on a filter in production.

  • Sparse timestamps: If many rows have null in the watermark column, your filter skips them. Fix upstream or normalize with COALESCE.
  • Time zones: Put warehouse and source on one rule before you compare values.
-- Compare in one zone
WHERE CONVERT_TIMEZONE('UTC', 'America/New_York', src_ts) > :watermark_nyc;
  • First empty persistent table: Use the first-run pattern in Batch Jobs and Timestamp Column, or seed a safe minimum watermark so you don't drop history.

Hard cases, such as custom consensus or special CDC, usually mix custom SQL with Pre-SQL and Post-SQL and ref(), ref_no_link(), and ref_link().


Handle Deletes With Incremental Loads

A high-water mark filter tells you what is new or changed. It doesn't remove rows when the source hard deletes a key unless you add explicit delete logic.

If a Node must run after a deletes staging Node but doesn't reference it in FROM, add a line with ref_link() so the DAG enforces the correct order.

Soft Deletes

Mark rows as inactive instead of removing them. Incremental loads treat these as updates and pass them through merge logic.

Common patterns:

  • Maintain IS_DELETED or DELETED_AT in the source

  • Filter active records downstream:

    • WHERE IS_DELETED = 0
    • WHERE DELETED_AT IS NULL

Apply this filtering as early as possible so downstream Nodes don't depend on implicit rules.

Hard Deletes

Physically remove rows when required for compliance or cost.

This requires explicit handling because incremental filters alone won't catch removed records. Typical approaches include:

  • Merge logic with delete conditions
  • Separate delete feeds or CDC streams
  • Pre-SQL or Post-SQL cleanup

Coordinate carefully with facts and aggregates to avoid orphaned keys.

Delete Feeds and CDC

Capture deletes explicitly from the source system rather than inferring them.

Options include:

  • Staging delete records separately
  • Reading CDC metadata, for example operation types
  • Applying deletes through merge logic or dedicated Packages

This is the most reliable pattern when available.

Snapshot or Full Comparison

Compare a previous extract to the current one and treat missing rows as deletes.

Use when:

  • No delete flag or CDC exists
  • Full extracts are available

This approach is more compute-intensive but ensures completeness.

Periodic Rebuilds

Run incremental loads between scheduled full or scoped rebuilds.

Use rebuilds to:

  • Reconcile missed deletes
  • Clean up drift over time

This is a practical fallback when real-time delete handling is incomplete.

Advanced Deploy Merge Options

Fact and Dimension Advanced Deploy can apply unmatched-record and delete behavior where supported.

Use these features when:

  • You want built-in merge handling
  • Your Node type supports delete strategies

Refer to product documentation for supported configurations.

Cascading Soft Deletes Through the DAG

Soft-deleted dimensions can still appear in reports if facts continue to join to them.

To prevent this:

  • Filter active rows in dimension outputs, or
  • Join facts with IS_DELETED = 0, or
  • Propagate delete flags into facts and aggregates

Apply the rule early to avoid inconsistent downstream behavior.

Multi-Source Delete Consensus

When multiple sources define delete status differently, define a clear policy before implementation:

  • Any-source delete Delete if any source marks the record deleted, for example MAX(is_deleted)

  • All-sources delete Delete only if all sources agree

  • Priority source One system determines delete status

These are custom patterns implemented in staging or merge logic, not default behavior.

Multi-Source Targets and Deletes

  • Per-source ownership Apply deletes only to rows owned by that source

  • Conformed keys and tombstones Prefer a single shared key and unified delete logic

  • Partial runs with sourcesuffix Skipped sources retain their rows until processed again

  • Orphan cleanup Retired sources may require one-time cleanup

Custom Multi-Source State

Optionally, when sources refresh on different schedules, you can store per-source rows in a small state table for last watermark, last load time, and status. After a failure, move only successful sources forward so the next run retries stale sources without losing good progress.

Resilience

  • All-or-nothing - One transaction across all sources. Simplest consistency, but one failure can block everything.
  • Per source - Separate commits. Sources reach eventual consistency. Often easier to run in production.
-- Watermark for one source name
SELECT COALESCE(MAX(last_watermark), TIMESTAMP '1900-01-01')
FROM incremental_source_state
WHERE source_name = :source_name;

Edge Cases and Gotchas

These issues often appear after the design is working for the happy path.

  • Truncate Before upstream - If you truncate staging in a way that wipes the persistent incremental target, you can reset the watermark and trigger reloads or duplicates. Keep persistent targets away from volatile truncates unless you want a rebuild.
  • Identity or sequence gaps - If surrogate keys reset after you drop and recreate tables, facts can point at wrong dimensions. Prefer stable keys or the recovery steps your warehouse documents.
  • NULL keys or NULL hashes - Bad keys make merges unpredictable. Fix keys upstream or drop bad rows on purpose.
  • Multi-source partial failure - If one branch of a UNION fails mid-run, you can end up with a mixed state. Prefer the Insert strategy with multiple INSERT statements on Multi Source, split Nodes by source, or use transactions your platform supports.
  • Soft deletes hidden in facts - Join to active dimension versions or carry delete flags on purpose. See Cascading Soft Deletes Through the DAG.

Troubleshooting

Use this section when merges fail, watermarks look wrong, or Pre-SQL and Post-SQL behave unexpectedly.

Non-Deterministic Merge

Duplicate merge keys, unstable row order, or ties on the watermark can cause errors or double updates.

  • Checks: Run GROUP BY business_key HAVING COUNT(*) > 1 on staging. Remove duplicates upstream. Strengthen the business key.

Watermark Appears Stuck

Check that the persistent table still shows the MAX(incremental_column) you expect, the filter uses the same column and time zone, and no stray truncate cleared history.

Statement Count or Stage Separation

Give each Pre-SQL or Post-SQL stage one clear job when the platform allows one statement per phase, or split work across stages in the graph when supported.

Performance and Cost

Tune warehouse compute and batch sizing so incremental Jobs finish inside your SLA without over-scanning.

  • On Snowflake, use clustering keys or query pruning on partition-like columns. Snowflake isn't like OLTP B-tree indexes. Keep filters selective.
  • Batch sizing: Bigger windows mean fewer Jobs. Smaller windows mean fresher data. Balance SLA and spend.
  • Looped Load - Use it when one full scan won't finish in your window.

Monitoring Ideas

For each source or job, track:

  • Delay between max timestamp in source and max timestamp loaded
  • Job failures and last success time
  • Row counts and error tables if you quarantine bad rows

What's Next?