Incremental Loading
With incremental loading, you process only new or changed rows since the last good run instead of reading the full data set every time. This page explains how incremental loading works in general and how you can apply it to your data platform.
Incremental Loading Compared to a Full Refresh
Use this table to compare full refreshes and incremental loads at a high level.
| Aspect | Full refresh | Incremental load |
|---|---|---|
| Data processed | The full source every run | Only new or changed rows |
| Cost | Rises with total size | Rises with how much changes |
| Latency | Often high for large sources | Often lower when changes are small |
| Complexity | Simpler patterns | You need a watermark, CDC, or careful state tracking |
| Delete handling | If you compare two full snapshots, missing rows can mean deletes | You usually model deletes with flags, CDC, or reconciliation |
| Typical fit | Small sources, full exports, planned rebuilds | Large facts, APIs with UPDATED_AT, stream-style CDC |
Core Concepts
These ideas show up on every warehouse: how you bound each run, where you store loaded rows, what counts as changed data, and when change data capture is the right tool.
Watermark
A watermark is the latest value of a column you've already loaded, often a timestamp. The next run only pulls rows above that line, like WHERE incremental_column > watermark.
Persistent Target
With a persistent table, you keep the business rows you've already loaded from the incremental stage, not just bookkeeping metadata. The next run often takes the watermark from MAX(incremental_column) on that table. Stay consistent about where that watermark value comes from in your graph.
Delta
The delta is what you want this run: inserts, updates, and deletes if you track them. Timestamp-style loads see inserts and updates that appear in the extract. Hard deletes in the source don't show up unless you add CDC, soft-delete columns, snapshot compares, or a delete feed.
Change Data Capture
CDC means the source, or a layer in front of it, sends change events or change rows for inserts, updates, and deletes. Snowflake Streams, Kafka topics, database log CDC, and SaaS event APIs are examples.
Choose an Incremental Strategy
This section will help you figure out which strategy works best for your data.
Batch Jobs and Timestamp Column
This pattern is the most common. You run jobs on a schedule such as hourly or daily and have an UPDATED_AT column.
Use: Watermark with a timestamp column
- Snowflake
- Databricks
- BigQuery
On Snowflake, install the Incremental Loading package and use the Incremental Load Node type.
- In Build Settings > Packages, install the Incremental Loading package, then add an Incremental Load Node after your source. Add a Persistent Stage, Dimension, or Fact, or another persistent target, to hold rows you have already loaded.
- For the first run, leave Filter data based on Persistent table off, Create the Node, and Run the persistent target so it is populated. For later runs, turn Filter data based on Persistent table on so the template compares staging to that table.
- Set Persistent table location and Persistent table name to match your persistent Node. Set Incremental load column to the source column you use as the watermark, for example
UPDATED_AT. - On the Join tab, use Generate Join and Copy to Editor so the incremental predicate matches your column names. Use Pre-SQL or Post-SQL when you need extra filters or housekeeping.
On Databricks, install Incremental Nodes and use the Incremental Load Node.
- In Build Settings > Packages, install Incremental Nodes.
- Add an Incremental Load Node downstream of your source. Set Create As to table or view, and set Persistent table location and Persistent table name to the table where loaded history lives.
- For the initial load, set Filter data based on Persistent table to False, deploy, and load. For incremental runs, set it to True so the Node limits rows using the persistent table.
- Set Incremental load column (date) to your watermark column. Review the generated SQL and adjust if your keys or filters need custom handling.
BigQuery uses BigQuery Base Node Types and BigQuery Base Node Types Advanced Deploy, not a single incremental Package.
- Model your source in a Work, Persistent Stage, Dimension, or Fact Node, and point a downstream persistent Node at the table that should keep history.
- Write the incremental slice in Node SQL with a predicate on your watermark column. Read the current high-water mark from the persistent table inside a subquery. Use ref_no_link() when you need the fully qualified name without the DAG edge you would get from ref() alone, and use ref() when downstream order must depend on that target.
- Open Advanced Deploy on Dimensions or Facts when you want Last Modified Comparison or documented merge options that fit incremental batches.
- First run: either disable the incremental filter until the target has data, or seed a safe minimum watermark so you don't drop history by mistake.
Batch Jobs and Very Large Data
You run scheduled jobs, but data is too large for one run. Use when a single pass over the source is too large and you need buckets, history, or parameterized windows:
Use: Looped or batched incremental
- Snowflake
- Databricks
- BigQuery
On Snowflake, use Looped Load from the Incremental Loading package. It builds load buckets from your grouping keys, loops through them, and can maintain a load history table for orchestration.
- Install the package in Build Settings > Packages, then add a Looped Load Node downstream of the source or staging you need to chunk.
- Configure Load Type and Job Parameters so each run uses the right mode. The Node supports values such as full load, full reload, incremental load, and reprocess load. Use reprocess-style runs when you need to retry failed buckets without reloading successes. See Looped Load Parameters on the Incremental Loading package listing.
- Under Looped Load Incremental Load Options, turn on Set Incremental Load Options when you still want a watermark inside each bucket, pick Incremental Load Column (date), and set Group Incremental when you need more than one bucket per run.
- Under Looped Load Group Table Options, choose Dedicated Load History Table and Load History Table Location when you want this Node's history isolated from other looped loads.
- Use Pre-SQL and Post-SQL for setup and cleanup around each loop when the platform allows.
Build batches with base Nodes and SQL you control.
- Keep the Incremental Load Node from Incremental Nodes when each batch is small enough. When it isn't, add Work or Persistent Stage Nodes whose SQL limits each run to a slice of the source, for example a date range, numeric key band, or file group.
- Drive the slice with Parameters you set at deploy or refresh so you can change the window without editing the Node again. One Job can process one
batch_idvalue today and another tomorrow. - Chain Nodes or stagger Jobs so heavy batches don't overlap on one cluster. Use Multi Source Nodes when several slices should land in one target with a defined load order.
- Reuse the same watermark ideas as a standard incremental load, but apply them inside the batch predicate so you don't widen the scan past the current chunk.
On BigQuery, batching is SQL and scheduling.
- Filter on partition columns, ingest timestamps, or business dates so each query prunes storage instead of scanning the full data set.
- Expose batch start, end, or shard id as Parameters and reference them in Work or Persistent Stage SQL. Run the Job once per window, or run parallel Jobs with different parameter sets when your process allows it.
- Land each batch in a staging Node, then merge or append into the persistent Fact or Dimension with BigQuery Base Node Types Advanced Deploy so keys stay consistent across batches.
- Watch bytes processed and slot time when sizing windows. Smaller batches cost more round trips; larger batches risk timeouts.
Batch Jobs and Full Dumps Without a Timestamp
You only receive full snapshots of data.
Use: Snapshot Comparison
When each Job receives a full snapshot and you must diff it against what is already in the warehouse:
- Snowflake
- Databricks
- BigQuery
Compare snapshots with base Nodes, optionally combined with the Incremental Loading package when your diff is already a view the package can load.
- Land the latest extract in a Work or Persistent Stage Node from Snowflake Base Node Types, usually with Truncate Before so the object holds only the current file or batch.
- Add Work Nodes that implement the diff, or use Create with Override SQL on staging paths for the same logic: for example keys present in the snapshot but absent from the target, row-level hashes or column compares for updates, and optional
NOT EXISTSpatterns for rows that disappeared if your design treats that as a delete signal. - Point Dimension Advanced Deploy or Fact Advanced Deploy from Snowflake Base Node Types Advanced Deploy at the reduced row set. Use documented Insert Strategy, Update Strategy, Unmatched Record Strategy, and MERGE methods so inserts and updates land in one Job.
- When the compare SQL lives in a single view you want to treat like an incremental source, you can still add Incremental Load from the Incremental Loading package for load hygiene. You own the predicate or full replace semantics in the view logic.
On Databricks, build the compare entirely in SQL on Databricks Base Node Types Work, Persistent Stage, Dimension, and Fact Nodes.
- Stage the full extract in Work with Truncate Before or INSERT OVERWRITE behavior so each run replaces the staging table with the new snapshot.
- Join or subtract against the current persistent table in another Work Node. Use patterns your SQL engine supports, for example
EXCEPT,ANTI JOIN, or hash compares, to emit insert and update rows. - For deletes inferred from missing keys, branch that logic in SQL or handle it in Post-SQL with a
MERGEwhen you need database-native merge semantics that handle inserts and updates together. - Load the target Fact or Dimension with append or overwrite settings that match whether you are loading a full recomputed table or a delta table. Use Multi Source Nodes when the compare splits into multiple ordered inserts.
BigQuery follows the same staging-and-compare pattern using BigQuery Base Node Types.
- Materialize the incoming snapshot in Work or Persistent Stage SQL, keeping grain and keys identical to the persistent model.
- Compare to the target using
EXCEPT DISTINCT,NOT EXISTS, or a hash column you calculate in SQL. Keep compare sets selective so you don't scan more bytes than needed. - Apply BigQuery Base Node Types Advanced Deploy on the downstream Dimension or Fact so
MERGEor other documented load methods apply your inserts and updates. Turn Truncate Before off when you merge into an existing table instead of reloading it wholesale. - Deleting keys missing from the snapshot requires explicit merge delete rules or a separate delete step. See Deleting Records from Fact and Dimension Nodes when conformed models must stay aligned.
Batch Jobs and Full Accuracy With Deletes
You must detect inserts, updates, and deletes without CDC.
Use: Full Reload and Compare
When the staging table represents the full current source and you need MERGE behavior that can insert, update, and remove keys that disappeared from the extract, without CDC:
- Snowflake
- Databricks
- BigQuery
On Snowflake, Snowflake Base Node Types hold the full snapshot and Snowflake Base Node Types Advanced Deploy supplies MERGE with Unmatched Record Strategy.
- Load every run's extract into Work or Persistent Stage with Snowflake Base Node Types, typically Truncate Before on so staging matches the file or table your Job just received.
- Point Dimension Advanced Deploy or Fact Advanced Deploy at that staging Node. Set Business Key when the listing requires it, choose MERGE as the load method, and set Update Strategy to Merge so unmatched handling is available.
- Set Unmatched Record Strategy to NO DELETE, SOFT DELETE, or HARD DELETE depending on governance. Dimension Advanced Deploy documents SOFT DELETE and HARD DELETE for rows no longer in the source. Fact Advanced Deploy documents HARD DELETE and NO DELETE where the listing applies.
- Keep dependent Facts and Dimensions in DAG order when deletes propagate. If a Node must run after a delete staging step but doesn't reference it in SQL, enforce order with ref_link(). Read Deleting Records from Fact and Dimension Nodes for patterns that keep facts and dimensions consistent.
Databricks relies on Databricks Base Node Types plus SQL MERGE you own.
- Stage the full extract in Work or Persistent Stage with Truncate Before so the table mirrors the latest snapshot.
- Write a
MERGEinto the target Fact or Dimension table using the staging table as the source. IncludeWHEN MATCHEDfor updates,WHEN NOT MATCHEDfor inserts, and aWHEN NOT MATCHED BY SOURCEbranch that deletes or flags rows when policy allows keys missing from the snapshot to leave the target. - Run that statement from Post-SQL on the target Node, or split into a Work Node whose sole job is the MERGE, depending on how you split deploy and refresh steps.
- Document orphan-key rules for Facts before you HARD DELETE Dimension rows. Use Multi Source Nodes only when multiple pieces of data must land in order before the MERGE runs.
BigQuery uses BigQuery Base Node Types for the snapshot and BigQuery Base Node Types Advanced Deploy for MERGE with unmatched handling.
- Materialize the full source in Work or Persistent Stage SQL each run so downstream MERGE always sees the complete current set.
- On Dimension Advanced Deploy or Fact Advanced Deploy, choose MERGE and set Unmatched Record Strategy to NO DELETE, SOFT DELETE on Dimensions where documented, or HARD DELETE so rows absent from the source follow your retention policy.
- Keep Business Key and Truncate Before aligned with the listing: you usually turn Truncate Before off on the MERGE target when you are merging into an existing table rather than replacing it.
- For conformed stars, follow Deleting Records from Fact and Dimension Nodes so Fact and Dimension deletes stay coordinated.
Streaming or Frequent Jobs and Change Events
Data arrives continuously or includes change metadata.
Use: Change data capture, CDC
When change rows or stream metadata are already available and you need near-continuous loads instead of one large batch window:
- Snowflake
- Databricks
- BigQuery
Snowflake CDC in Coalesce centers on the Snowflake Streams and Tasks package: Stream Nodes hold the offset, and Task Nodes run the work on a schedule or when the stream shows data.
- Install the package under Build Settings > Packages.
- Add a Stream Node on the Snowflake table or view you want to track so Snowflake records inserts, updates, and deletes according to your stream options, for example Append Only or Propagate Deletes where the matrix applies.
- Add Stream and Insert or Merge or Delta Stream Merge downstream. Stream and Insert or Merge applies the stream delta with INSERT or MERGE. Delta Stream Merge uses stream metadata such as
METADATA$ACTIONwhen you need tighter alignment on updates and deletes. - Attach Work with Task, Dimension with Task, Fact with Task, or Insert or Merge with Task when you need Schedule, Stream Has Data, warehouse sizing, or Multi-Stream orchestration the matrices describe.
- Use Task DAG Create Root and Task DAG Resume Root so suspended Snowflake tasks actually resume in the right dependency order.
Databricks streaming incremental loads often start with Databricks Lakeflow Declarative Pipelines. That package, abbreviated LDP, creates streaming tables Lakeflow can refresh.
- Install Databricks Lakeflow Declarative Pipelines under Build Settings > Packages and add an LDP Node in the Storage Location where the streaming table should live.
- Set Read Files to True when ingestion starts in cloud storage, or False when the pipeline reads from an existing table source. Match File Location, patterns, and File Type to your landing zone when files are the source.
- Turn on Schedule refresh when you want the listing's schedule options to drive refreshes, and use Refresh Stream when structure changes require a full streaming table rebuild.
- After column inference changes, use Include Columns Inferred and Re-Sync behavior the package describes so the streaming table schema stays aligned with files.
- Downstream Coalesce Nodes on Databricks Base Node Types can consume the streaming table on a Coalesce Job cadence you choose when you need joins or curated layers outside LDP.
You run CDC by landing change rows or log tables in BigQuery with your replication tooling, then calling Coalesce often enough that those tables stay small.
- Keep CDC tables or change streams in BigQuery current with the product you already use for replication so each Coalesce run sees only recent events.
- Schedule Coalesce Jobs at the frequency your SLA allows, whether that is minutes or hours. Shorter intervals mean smaller reads and fresher marts.
- In BigQuery Base Node Types, filter on the change timestamp, sequence, or
_PARTITIONTIMEcolumn your CDC pattern provides. Use ref_no_link() or ref() the same way you would for timestamp-based incremental loads so watermarks stay consistent. - When CDC delivers explicit insert, update, or delete rows, apply BigQuery Base Node Types Advanced Deploy MERGE options or staging Work Nodes that normalize operations before the MERGE.
- Use Parameters for look-back windows or source system IDs when one Workspace processes multiple CDC feeds.
Multiple Sources Feeding One Target
You have multiple inputs updating one table.
Use: Multi-source incremental, often paired with another scenario in this section
When more than one source must land in the same conformed table and you still want incremental or parameterized runs, align business keys and grain before you combine branches. If identifiers differ across sources, fix them upstream in Work Nodes.
- Snowflake
- Databricks
- BigQuery
Both patterns live in the Incremental Loading package. Pick Run View for parameterized UNION-style runs across separately named sources. Pick Grouped Incremental Load when the same template should run across Storage Mapping locations and the target stores which location each row came from.
Run View
Use Run View when you need parameterized UNION-style runs across separately named sources.
- Add a Run View Node and turn Multi Source on. Choose the Multi Source Strategy the listing documents so branches combine the way you expect.
- Name or suffix source Nodes so the
sourcesuffixvalue in your deploy or refresh configuration can select one source, a list, orALL, as described in Run View Parameters on the Incremental Loading package listing. - Optional: enable Filter data based on source and Incremental load based on Persistent table when a single shared watermark on the persistent target should bound every branch you include in that run.
Grouped Incremental Load
Use Grouped Incremental Load when the same template should run across Storage Mapping locations and you track which location each row came from.
- Add Grouped Incremental Load and set Storage Mapping ID to the comma-separated list of Storage Locations this load should visit.
- Enter Persistent table location, Persistent table name, Persistent table Column Name for the column that stores the location ID, and Grouped Incremental load column for the date or timestamp that drives increments.
- Toggle Batch Load when you need batched passes per location instead of a straight pass. Use Joining Table Column Name when the package's join template needs an extra key between locations.
- Generate join and Copy to Editor when the template prompts you so the join matches your schema.
Databricks Base Node Types implements Multi Source directly on Work, Persistent Stage, Dimension, and Fact Nodes. Read Multi Source Nodes for how Insert, UNION, and UNION ALL differ and how load order works.
- Decide whether you need separate INSERT statements with the Insert strategy, duplicate removal with UNION, or retained duplicates with UNION ALL. Order sources top to bottom on the Multi Source screen so failures and retries behave the way you expect.
- Turn Multi Source on in each Node that should own the combine, then wire one upstream branch per source table or view.
- Normalize columns in upstream Work Nodes so every branch shares the same grain and column names before the union.
- Add filters or branch-specific logic with Parameters when only some sources run in a given Job, or when you must stamp a source system column you will use later in MERGE logic.
- If joins fit the problem better than a union, keep Multi Source off and express the join in Node SQL instead.
BigQuery Base Node Types exposes the same Multi Source pattern: UNION DISTINCT or UNION ALL across branches.
- Enable Multi Source on the Node that should materialize the combined data set, then attach each source branch in the order you want statements to run.
- Align schemas across systems in Work Nodes first. Add a literal or computed source or
source_systemcolumn when downstream MERGE or delete rules need to know which feed produced a row. - Use Parameters in template SQL when each Job should include different sources or date windows without editing the graph.
- Pair the combined stage with an incremental or Advanced Deploy MERGE pattern from an earlier scenario in this section so inserts, updates, and deletes stay consistent after the UNION.
Watermark Columns and Pitfalls
Pick and test the column that bounds each run. For Incremental Load, Run View, Looped Load, and Grouped Incremental field-by-field setup, use the scenario subsection that matches your load shape earlier on this page.
Choosing a Watermark Column
Pick a column type that matches how the source updates and how wide your batch windows are.
| Type | Notes |
|---|---|
| DATE | Broad strokes. Good for simple daily batches. |
| TIMESTAMP or DATETIME | Usual choice when the source keeps it current. |
| Monotonic version or sequence | Use when timestamps lie or drift. |
| Avoid | Using only non-monotonic or very dense surrogate keys unless you accept the scan cost. |
Use COALESCE(updated_at, created_at) when new rows only fill created_at until the first update.
Watermark Pitfalls
Watch for sparse values, time zones, and first-run behavior before you rely on a filter in production.
- Sparse timestamps: If many rows have null in the watermark column, your filter skips them. Fix upstream or normalize with
COALESCE. - Time zones: Put warehouse and source on one rule before you compare values.
- Snowflake
- Databricks
- BigQuery
-- Compare in one zone
WHERE CONVERT_TIMEZONE('UTC', 'America/New_York', src_ts) > :watermark_nyc;
-- Use Databricks time zone functions for your runtime
WHERE from_utc_timestamp(src_ts, 'America/New_York') > :watermark_nyc;
-- Civil datetime in the zone that defines your watermark
WHERE DATETIME(src_ts, 'America/New_York') > @watermark_nyc;
- First empty persistent table: Use the first-run pattern in Batch Jobs and Timestamp Column, or seed a safe minimum watermark so you don't drop history.
Hard cases, such as custom consensus or special CDC, usually mix custom SQL with Pre-SQL and Post-SQL and ref(), ref_no_link(), and ref_link().
Handle Deletes With Incremental Loads
A high-water mark filter tells you what is new or changed. It doesn't remove rows when the source hard deletes a key unless you add explicit delete logic.
If a Node must run after a deletes staging Node but doesn't reference it in FROM, add a line with ref_link() so the DAG enforces the correct order.
Soft Deletes
Mark rows as inactive instead of removing them. Incremental loads treat these as updates and pass them through merge logic.
Common patterns:
-
Maintain
IS_DELETEDorDELETED_ATin the source -
Filter active records downstream:
WHERE IS_DELETED = 0WHERE DELETED_AT IS NULL
Apply this filtering as early as possible so downstream Nodes don't depend on implicit rules.
Hard Deletes
Physically remove rows when required for compliance or cost.
This requires explicit handling because incremental filters alone won't catch removed records. Typical approaches include:
- Merge logic with delete conditions
- Separate delete feeds or CDC streams
- Pre-SQL or Post-SQL cleanup
Coordinate carefully with facts and aggregates to avoid orphaned keys.
Delete Feeds and CDC
Capture deletes explicitly from the source system rather than inferring them.
Options include:
- Staging delete records separately
- Reading CDC metadata, for example operation types
- Applying deletes through merge logic or dedicated Packages
This is the most reliable pattern when available.
Snapshot or Full Comparison
Compare a previous extract to the current one and treat missing rows as deletes.
Use when:
- No delete flag or CDC exists
- Full extracts are available
This approach is more compute-intensive but ensures completeness.
Periodic Rebuilds
Run incremental loads between scheduled full or scoped rebuilds.
Use rebuilds to:
- Reconcile missed deletes
- Clean up drift over time
This is a practical fallback when real-time delete handling is incomplete.
Advanced Deploy Merge Options
Fact and Dimension Advanced Deploy can apply unmatched-record and delete behavior where supported.
Use these features when:
- You want built-in merge handling
- Your Node type supports delete strategies
Refer to product documentation for supported configurations.
Cascading Soft Deletes Through the DAG
Soft-deleted dimensions can still appear in reports if facts continue to join to them.
To prevent this:
- Filter active rows in dimension outputs, or
- Join facts with
IS_DELETED = 0, or - Propagate delete flags into facts and aggregates
Apply the rule early to avoid inconsistent downstream behavior.
Multi-Source Delete Consensus
When multiple sources define delete status differently, define a clear policy before implementation:
-
Any-source delete Delete if any source marks the record deleted, for example
MAX(is_deleted) -
All-sources delete Delete only if all sources agree
-
Priority source One system determines delete status
These are custom patterns implemented in staging or merge logic, not default behavior.
Multi-Source Targets and Deletes
-
Per-source ownership Apply deletes only to rows owned by that source
-
Conformed keys and tombstones Prefer a single shared key and unified delete logic
-
Partial runs with
sourcesuffixSkipped sources retain their rows until processed again -
Orphan cleanup Retired sources may require one-time cleanup
Custom Multi-Source State
Optionally, when sources refresh on different schedules, you can store per-source rows in a small state table for last watermark, last load time, and status. After a failure, move only successful sources forward so the next run retries stale sources without losing good progress.
Resilience
- All-or-nothing - One transaction across all sources. Simplest consistency, but one failure can block everything.
- Per source - Separate commits. Sources reach eventual consistency. Often easier to run in production.
-- Watermark for one source name
SELECT COALESCE(MAX(last_watermark), TIMESTAMP '1900-01-01')
FROM incremental_source_state
WHERE source_name = :source_name;
Edge Cases and Gotchas
These issues often appear after the design is working for the happy path.
- Truncate Before upstream - If you truncate staging in a way that wipes the persistent incremental target, you can reset the watermark and trigger reloads or duplicates. Keep persistent targets away from volatile truncates unless you want a rebuild.
- Identity or sequence gaps - If surrogate keys reset after you drop and recreate tables, facts can point at wrong dimensions. Prefer stable keys or the recovery steps your warehouse documents.
- NULL keys or NULL hashes - Bad keys make merges unpredictable. Fix keys upstream or drop bad rows on purpose.
- Multi-source partial failure - If one branch of a UNION fails mid-run, you can end up with a mixed state. Prefer the Insert strategy with multiple INSERT statements on Multi Source, split Nodes by source, or use transactions your platform supports.
- Soft deletes hidden in facts - Join to active dimension versions or carry delete flags on purpose. See Cascading Soft Deletes Through the DAG.
Troubleshooting
Use this section when merges fail, watermarks look wrong, or Pre-SQL and Post-SQL behave unexpectedly.
Non-Deterministic Merge
Duplicate merge keys, unstable row order, or ties on the watermark can cause errors or double updates.
- Checks: Run
GROUP BY business_key HAVING COUNT(*) > 1on staging. Remove duplicates upstream. Strengthen the business key.
Watermark Appears Stuck
Check that the persistent table still shows the MAX(incremental_column) you expect, the filter uses the same column and time zone, and no stray truncate cleared history.
Statement Count or Stage Separation
Give each Pre-SQL or Post-SQL stage one clear job when the platform allows one statement per phase, or split work across stages in the graph when supported.
Performance and Cost
Tune warehouse compute and batch sizing so incremental Jobs finish inside your SLA without over-scanning.
- On Snowflake, use clustering keys or query pruning on partition-like columns. Snowflake isn't like OLTP B-tree indexes. Keep filters selective.
- Batch sizing: Bigger windows mean fewer Jobs. Smaller windows mean fresher data. Balance SLA and spend.
- Looped Load - Use it when one full scan won't finish in your window.
Monitoring Ideas
For each source or job, track:
- Delay between max timestamp in source and max timestamp loaded
- Job failures and last success time
- Row counts and error tables if you quarantine bad rows
What's Next?
- Incremental loading tutorial, step-by-step
- Incremental Loading package, Snowflake Marketplace reference
- Incremental Nodes, Databricks package
- Snowflake Base Node Types, Marketplace reference
- Snowflake Base Node Types Advanced Deploy, merge and load strategies
- Databricks Base Node Types, Work through Fact Node types
- Databricks Lakeflow Declarative Pipelines
- Databricks External Data Package
- Databricks Materialized View
- BigQuery Base Node Types
- BigQuery Base Node Types Advanced Deploy
- Multi Source Nodes
- Snowflake Streams and Tasks
- Deleting Records from Fact and Dimension Nodes
- Pre-SQL and Post-SQL
- ref() and ref_no_link()
- Surrogate Keys
- SCD Type 2 change detection
- Incremental processing strategies
- Parameters, deploy and refresh overrides