Identity and Surrogate Keys in Coalesce
A surrogate key is a system-generated identifier that uniquely represents a row in a table. Unlike business keys such as customer ID or order number, surrogate keys have no business meaning.
Surrogate keys are commonly used to:
- Join Fact and Dimension Nodes.
- Support slowly changing dimensions (SCDs).
- Guarantee uniqueness for each row.
- Improve performance when joining tables.
Surrogate Keys in Coalesce
- Coalesce automatically generates surrogate keys for supported Node Types.
- Surrogate key behavior is defined at table creation time and is handled through Node Types or templates.
Identity Columns in Coalesce
Coalesce implements surrogate keys using identity columns. An identity column is a numeric column whose values are automatically generated by the data platform when rows are inserted.
Identity columns in Coalesce:
- Are defined in the
CREATE TABLEstatement. - Generate sequential numeric values.
- Are excluded from insert and merge statements.
- Cannot be populated using transforms or SQL expressions.
Because identity generation occurs at the database level, identity columns must be defined when the table is created.
Snowflake Identity Implementation
When using Snowflake, Coalesce generates surrogate keys using NUMBER IDENTITY columns in the table DDL.
Example pattern generated by Coalesce:
SURROGATE_KEY NUMBER IDENTITY
In Snowflake:
IDENTITYis a column attribute applied to a numeric type.- Values are generated automatically during inserts.
- The identity column doesn't appear in
INFORMATION_SCHEMA.SEQUENCESbecause it's embedded in the table definition.
Databricks Identity Implementation
When using Databricks, Coalesce generates surrogate keys using the following syntax:
BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
In Databricks:
- Identity values are always generated by the engine.
- The column must be defined in the table DDL.
- The surrogate key column is excluded from insert logic.
- Use a Node Type with an existing identity column.
- Create a custom Node using
isSurrogateKey: truein the Node Template.
Node Types That Use Surrogate Keys
Different Coalesce Node Types use surrogate keys in different ways.
Dimension Nodes
Dimension Nodes include surrogate keys by default. These surrogate keys are required to support SCD Type 2 behavior and to maintain consistent joins with downstream Fact Nodes.
Fact Nodes
Fact Nodes can include surrogate keys when required. In these cases, the surrogate key typically acts as the table's primary key, while foreign keys reference Dimension surrogate keys.
Persistent Stage Nodes
Persistent Stage Nodes can include surrogate keys when they're designed to store data long-term. This ensures row-level uniqueness even when business keys overlap.
Data Vault Components
Data Vault hubs, satellites, and links often use surrogate keys to ensure uniqueness and maintain relationships between entities.
How Surrogate Keys Are Defined
Surrogate keys are identified in Node Definitions using metadata:
isSurrogateKey: true
This metadata tells the Node template to:
- Generate the appropriate identity syntax in the
CREATE TABLEstatement. - Exclude the surrogate key column from insert and merge operations.
Surrogate key behavior must be handled by the Node Type or template. It cannot be added to Nodes that don't already have it as part of the Node Template.
Alternative Surrogate Key Strategies
While identity columns are the default implementation, Coalesce also supports alternative approaches through templates.
Sequences
Sequences can be used instead of identity columns when you need explicit control, such as during migrations or when coordinating key ranges across environments. Unlike identity columns, sequences appear in INFORMATION_SCHEMA.SEQUENCES.
Hash-Based Keys
Hash-based surrogate keys can be generated by hashing one or more business keys. This approach produces repeatable values that are consistent across environments.
Coalesce supports MD5, SHA1, and SHA256 hash algorithms.
Incremental Load Behavior
During incremental processing:
- New records receive new surrogate keys.
- Existing records retain their surrogate keys.
- For SCD Type 2 dimensions, changed records create new versions with new surrogate keys, while keeping the same business key.
- Fact Nodes maintain referential integrity by joining on surrogate keys.
Migration Considerations
When loading legacy data:
- Existing surrogate keys can be inserted directly.
- New records continue incrementing from the highest existing value.
- Identity columns cannot be reseeded after creation.
If reseeding is required, templates can be customized to:
- Use a starting value parameter.
- Replace identity columns with sequences seeded above the legacy maximum.
When loading legacy data:
- Existing surrogate keys can be inserted directly.
- New records continue incrementing from the highest existing value.
- Identity columns can't be reseeded after creation.
If reseeding is required, templates can be customized to:
- Use a starting value parameter.
- Replace identity columns with sequences seeded above the legacy maximum.