What Is a Surrogate Key?
A surrogate key is a system-generated identifier that uniquely represents a row in a table. Unlike business keys such as customer ID or order number, surrogate keys have no business meaning.
Surrogate keys are mainly used to:
- Joins with Fact and Dimension Nodes.
- Support slowly changing dimensions (SCDs).
- Guarantee uniqueness for each row.
- Improve performance when joining tables.
Surrogate Keys in Coalesce
Coalesce automatically generates surrogate keys for Dimension Nodes.
Automatic Generation
- Surrogate keys are added to Dimension Nodes by default.
- In Snowflake, they use identity columns (
IDENTITY
) that auto-increment. - Because identity columns are embedded in the table definition, they don’t appear in
INFORMATION_SCHEMA.SEQUENCES
.
Implementation Methods
- Identity columns are the default because they are embedded in the table, require no additional management, and they auto-increment.
- Sequences can be used instead when you need explicit control. Sequences are often useful for migrations or multi-environment consistency.
Node Types That Use Surrogate Keys
Different Coalesce Node types use surrogate keys in different ways:
- Dimension Nodes include surrogate keys by default. They are critical for maintaining uniqueness and supporting SCD Type 2.
- Fact Nodes can include surrogate keys, although they may require customization. The surrogate key can act as the table’s primary key, while foreign keys reference dimension surrogate keys.
- Persistent stage Nodes can include surrogate keys when designed as persistent tables. This ensures uniqueness even if business keys overlap.
- Data Vault components often use surrogate keys in hubs, satellites, and links.
Surrogate keys are identified in Node Definition with:
isSurrogateKey: true
Configuration and Customization
You can configure surrogate keys in different ways:
- YAML configuration: Add
isSurrogateKey: true
to the Node definition. - Custom templates: Modify templates to change how surrogate keys are created. This can include applying surrogate keys to fact tables, using naming conventions, or replacing
IDENTITY
with a sequence or hash-based key.
Example Use Cases
Surrogate keys can be generated in one node and referenced downstream in other Nodes.
In this example:
- The Dimension Node creates the
CUSTOMER_KEY
. - The Staging Node references
CUSTOMER_KEY
to align customer data with orders. - The Fact Node uses the same
CUSTOMER_KEY
to join sales back to customers.
-- Dimension node with surrogate key
CREATE TABLE DIM_CUSTOMER (
CUSTOMER_KEY NUMBER AUTOINCREMENT,
CUSTOMER_ID VARCHAR,
CUSTOMER_NAME VARCHAR,
PRIMARY KEY (CUSTOMER_KEY)
);
-- Staging node referencing the surrogate key
CREATE TABLE STG_CUSTOMER_ORDERS (
CUSTOMER_KEY NUMBER, -- reference to DIM_CUSTOMER surrogate key
ORDER_ID VARCHAR,
ORDER_DATE DATE
);
-- Fact node referencing both dimension and staging data
CREATE TABLE FACT_SALES (
SALES_KEY NUMBER AUTOINCREMENT,
CUSTOMER_KEY NUMBER, -- reference to DIM_CUSTOMER surrogate key
ORDER_ID VARCHAR,
SALES_AMOUNT FLOAT
);
Hash-Based Keys
You can also create surrogate keys by hashing business keys. This produces repeatable values that are consistent across environments.
{% if col.isSurrogateKey %}
MD5(TO_VARCHAR(business_key_column))
{% endif %}
Coalesce supports MD5, SHA1, and SHA256 hash algorithms.
Incremental Load Behavior
- New records: Coalesce generates new surrogate keys using Snowflake identity columns. The keys continue incrementing from the highest existing value.
- SCD Type 2 dimensions:
- New business entities get new surrogate keys.
- Changed attributes create new versions with new surrogate keys while keeping the same business key.
- Unchanged records keep their existing surrogate keys.
- Process steps:
- Incoming records are compared against existing ones using business keys.
- New surrogate keys are assigned only to new or changed records.
- Existing surrogate keys are preserved when records don’t change.
- Referential integrity is maintained with Fact Nodes.
Migration Considerations
When migrating legacy data into Coalesce:
- Existing surrogate keys can be loaded directly. New records will continue incrementing from the last inserted value, because Coalesce uses Snowflake identity columns by default.
- Identity columns in Snowflake can’t be reseeded or restarted. To avoid overlaps with legacy keys, you can:
- Use a custom dimension Node template with a parameter such as
SK_START_VALUE
to set the starting value. - Use a Snowflake sequence instead of an identity column, seeded above the maximum legacy key.
- Use a custom dimension Node template with a parameter such as
{% if col.isSurrogateKey %}
identity start {{ parameters.SK_START_VALUE }} increment 1
{% endif %}
If you configure a dimension to use a sequence, you can view it in:
SELECT * FROM INFORMATION_SCHEMA.SEQUENCES;
Identity-based surrogate keys don’t appear here because they are defined in the table itself.
Best Practices
- Use surrogate keys in both Dimension and Fact Nodes for consistent joins.
- Ensure there are no overlaps when migrating legacy data.
- Use surrogate keys for SCD Type 2 to handle historical versions.
- Consider hash-based keys for deterministic, reproducible IDs.