Skip to main content

What Is a Surrogate Key?

A surrogate key is a system-generated identifier that uniquely represents a row in a table. Unlike business keys such as customer ID or order number, surrogate keys have no business meaning.

Surrogate keys are mainly used to:

  • Joins with Fact and Dimension Nodes.
  • Support slowly changing dimensions (SCDs).
  • Guarantee uniqueness for each row.
  • Improve performance when joining tables.

Surrogate Keys in Coalesce

Coalesce automatically generates surrogate keys for Dimension Nodes.

Automatic Generation

  • Surrogate keys are added to Dimension Nodes by default.
  • In Snowflake, they use identity columns (IDENTITY) that auto-increment.
  • Because identity columns are embedded in the table definition, they don’t appear in INFORMATION_SCHEMA.SEQUENCES.

Implementation Methods

  • Identity columns are the default because they are embedded in the table, require no additional management, and they auto-increment.
  • Sequences can be used instead when you need explicit control. Sequences are often useful for migrations or multi-environment consistency.

Node Types That Use Surrogate Keys

Different Coalesce Node types use surrogate keys in different ways:

  • Dimension Nodes include surrogate keys by default. They are critical for maintaining uniqueness and supporting SCD Type 2.
  • Fact Nodes can include surrogate keys, although they may require customization. The surrogate key can act as the table’s primary key, while foreign keys reference dimension surrogate keys.
  • Persistent stage Nodes can include surrogate keys when designed as persistent tables. This ensures uniqueness even if business keys overlap.
  • Data Vault components often use surrogate keys in hubs, satellites, and links.

Surrogate keys are identified in Node Definition with:

isSurrogateKey: true

Configuration and Customization

You can configure surrogate keys in different ways:

  • YAML configuration: Add isSurrogateKey: true to the Node definition.
  • Custom templates: Modify templates to change how surrogate keys are created. This can include applying surrogate keys to fact tables, using naming conventions, or replacing IDENTITY with a sequence or hash-based key.

Example Use Cases

Surrogate keys can be generated in one node and referenced downstream in other Nodes.

In this example:

  • The Dimension Node creates the CUSTOMER_KEY.
  • The Staging Node references CUSTOMER_KEY to align customer data with orders.
  • The Fact Node uses the same CUSTOMER_KEY to join sales back to customers.
-- Dimension node with surrogate key
CREATE TABLE DIM_CUSTOMER (
CUSTOMER_KEY NUMBER AUTOINCREMENT,
CUSTOMER_ID VARCHAR,
CUSTOMER_NAME VARCHAR,
PRIMARY KEY (CUSTOMER_KEY)
);

-- Staging node referencing the surrogate key
CREATE TABLE STG_CUSTOMER_ORDERS (
CUSTOMER_KEY NUMBER, -- reference to DIM_CUSTOMER surrogate key
ORDER_ID VARCHAR,
ORDER_DATE DATE
);

-- Fact node referencing both dimension and staging data
CREATE TABLE FACT_SALES (
SALES_KEY NUMBER AUTOINCREMENT,
CUSTOMER_KEY NUMBER, -- reference to DIM_CUSTOMER surrogate key
ORDER_ID VARCHAR,
SALES_AMOUNT FLOAT
);

Hash-Based Keys

You can also create surrogate keys by hashing business keys. This produces repeatable values that are consistent across environments.

{% if col.isSurrogateKey %}
MD5(TO_VARCHAR(business_key_column))
{% endif %}

info

Coalesce supports MD5, SHA1, and SHA256 hash algorithms.

Incremental Load Behavior

  • New records: Coalesce generates new surrogate keys using Snowflake identity columns. The keys continue incrementing from the highest existing value.
  • SCD Type 2 dimensions:
    • New business entities get new surrogate keys.
    • Changed attributes create new versions with new surrogate keys while keeping the same business key.
    • Unchanged records keep their existing surrogate keys.
  • Process steps:
    1. Incoming records are compared against existing ones using business keys.
    2. New surrogate keys are assigned only to new or changed records.
    3. Existing surrogate keys are preserved when records don’t change.
    4. Referential integrity is maintained with Fact Nodes.

Migration Considerations

When migrating legacy data into Coalesce:

  • Existing surrogate keys can be loaded directly. New records will continue incrementing from the last inserted value, because Coalesce uses Snowflake identity columns by default.
  • Identity columns in Snowflake can’t be reseeded or restarted. To avoid overlaps with legacy keys, you can:
    • Use a custom dimension Node template with a parameter such as SK_START_VALUE to set the starting value.
    • Use a Snowflake sequence instead of an identity column, seeded above the maximum legacy key.
{% if col.isSurrogateKey %}
identity start {{ parameters.SK_START_VALUE }} increment 1
{% endif %}

If you configure a dimension to use a sequence, you can view it in:

SELECT * FROM INFORMATION_SCHEMA.SEQUENCES;

Identity-based surrogate keys don’t appear here because they are defined in the table itself.

Best Practices

  • Use surrogate keys in both Dimension and Fact Nodes for consistent joins.
  • Ensure there are no overlaps when migrating legacy data.
  • Use surrogate keys for SCD Type 2 to handle historical versions.
  • Consider hash-based keys for deterministic, reproducible IDs.