Skip to main content

Identity and Surrogate Keys in Coalesce

A surrogate key is a system-generated identifier that uniquely represents a row in a table. Unlike business keys such as customer ID or order number, surrogate keys have no business meaning.

Surrogate keys are commonly used to:

  • Join Fact and Dimension Nodes.
  • Support slowly changing dimensions (SCDs).
  • Guarantee uniqueness for each row.
  • Improve performance when joining tables.

Surrogate Keys in Coalesce

  • Coalesce automatically generates surrogate keys for supported Node Types.
  • Surrogate key behavior is defined at table creation time and is handled through Node Types or templates.

Identity Columns in Coalesce

Coalesce implements surrogate keys using identity columns. An identity column is a numeric column whose values are automatically generated by the data platform when rows are inserted.

Identity columns in Coalesce:

  • Are defined in the CREATE TABLE statement.
  • Generate sequential numeric values.
  • Are excluded from insert and merge statements.
  • Cannot be populated using transforms or SQL expressions.

Because identity generation occurs at the database level, identity columns must be defined when the table is created.

Snowflake Identity Implementation

When using Snowflake, Coalesce generates surrogate keys using NUMBER IDENTITY columns in the table DDL.

Example pattern generated by Coalesce:

SURROGATE_KEY NUMBER IDENTITY

In Snowflake:

  • IDENTITY is a column attribute applied to a numeric type.
  • Values are generated automatically during inserts.
  • The identity column doesn't appear in INFORMATION_SCHEMA.SEQUENCES because it's embedded in the table definition.

Databricks Identity Implementation

When using Databricks, Coalesce generates surrogate keys using the following syntax:

BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)

In Databricks:

  • Identity values are always generated by the engine.
  • The column must be defined in the table DDL.
  • The surrogate key column is excluded from insert logic.
Creating Identity Columns
  • Use a Node Type with an existing identity column.
  • Create a custom Node using isSurrogateKey: true in the Node Template.

Node Types That Use Surrogate Keys

Different Coalesce Node Types use surrogate keys in different ways.

Dimension Nodes

Dimension Nodes include surrogate keys by default. These surrogate keys are required to support SCD Type 2 behavior and to maintain consistent joins with downstream Fact Nodes.

Fact Nodes

Fact Nodes can include surrogate keys when required. In these cases, the surrogate key typically acts as the table's primary key, while foreign keys reference Dimension surrogate keys.

Persistent Stage Nodes

Persistent Stage Nodes can include surrogate keys when they're designed to store data long-term. This ensures row-level uniqueness even when business keys overlap.

Data Vault Components

Data Vault hubs, satellites, and links often use surrogate keys to ensure uniqueness and maintain relationships between entities.

How Surrogate Keys Are Defined

Surrogate keys are identified in Node Definitions using metadata:

isSurrogateKey: true

This metadata tells the Node template to:

  • Generate the appropriate identity syntax in the CREATE TABLE statement.
  • Exclude the surrogate key column from insert and merge operations.

Surrogate key behavior must be handled by the Node Type or template. It cannot be added to Nodes that don't already have it as part of the Node Template.

Alternative Surrogate Key Strategies

While identity columns are the default implementation, Coalesce also supports alternative approaches through templates.

Sequences

Sequences can be used instead of identity columns when you need explicit control, such as during migrations or when coordinating key ranges across environments. Unlike identity columns, sequences appear in INFORMATION_SCHEMA.SEQUENCES.

Hash-Based Keys

Hash-based surrogate keys can be generated by hashing one or more business keys. This approach produces repeatable values that are consistent across environments.

Coalesce supports MD5, SHA1, and SHA256 hash algorithms.

Incremental Load Behavior

During incremental processing:

  • New records receive new surrogate keys.
  • Existing records retain their surrogate keys.
  • For SCD Type 2 dimensions, changed records create new versions with new surrogate keys, while keeping the same business key.
  • Fact Nodes maintain referential integrity by joining on surrogate keys.

Migration Considerations

When loading legacy data:

  • Existing surrogate keys can be inserted directly.
  • New records continue incrementing from the highest existing value.
  • Identity columns cannot be reseeded after creation.

If reseeding is required, templates can be customized to:

  • Use a starting value parameter.
  • Replace identity columns with sequences seeded above the legacy maximum.

When loading legacy data:

  • Existing surrogate keys can be inserted directly.
  • New records continue incrementing from the highest existing value.
  • Identity columns can't be reseeded after creation.

If reseeding is required, templates can be customized to:

  • Use a starting value parameter.
  • Replace identity columns with sequences seeded above the legacy maximum.