Skip to main content

Generate Hash Columns

Generate Hash Column creates a new column containing a hash value calculated from selected columns in your Node.

Creating a Hash Column

To generate a hash column:

  1. Select one or more columns in your Node.
  2. Right-click and choose Generate Hash Column.
  3. Choose your preferred algorithm.
  4. Customize the column name if needed.
  5. Create and Run the Node.
  6. The hash column appears at the bottom of your column list by default.
  7. The selected columns Contained in Hash with the hash column name.
info

A column can belong to more than one hash.

Delete a Hash Column

If you delete the hash column, Coalesce automatically removes it from Contained in Hash.

How It Works

When you select columns and right-click Generate Hash Column, the following happens:

  • Creates a new column at the bottom of your column list.
  • Calculates a hash value using your selected columns as input.
  • Handles proper concatenation, delimiters, and null values automatically.
  • Allows you to choose from multiple hashing algorithms.

The new column uses GH as the default name. You can change this name.

Available Hash Algorithms

You can choose from these hashing algorithms when generating your hash column:

  • MD5: Fast processing, smaller hash size.
  • SHA1: Balanced security and performance.
  • SHA256: Higher security, larger hash size.

Use Cases

Hash columns solve common data problems and make your work faster. Here are a few ways to use them.

Efficient Data Comparison

Hash columns simplify multi-column comparisons. Instead of writing complex WHERE clauses like code = code AND type = type AND data = data, you can compare two hash values to detect changes across multiple columns.

Data Vault Architecture

Hash columns are essential for Data Vault modeling. They enable you to create:

  • Hub hash keys: Business key hashes for hub entities.
  • Hash diffs for satellites: Track changes in attributes over time.
  • Link hash keys: Connect related entities across the data model.

Change Detection

Hash columns excel in incremental loading and change data capture scenarios. They help you identify modified records since the last data load, making your ETL processes more efficient.

Performance Optimization

For wide tables with many columns, pre-calculated hash values improve query performance. You compare a single hash column instead of multiple individual columns, reducing computational overhead.

Best Practices

Follow these tips to get the most out of hash columns in your projects

Column Selection

Choose columns that represent the business key or the attributes you want to monitor for changes. Avoid including frequently changing timestamps unless they're relevant to your use case.

Algorithm Selection

  • Use MD5 for internal processing where speed is prioritized over cryptographic security.
  • Use SHA1 for balanced performance and security requirements.
  • Use SHA256 for scenarios requiring higher security standards.

Naming Conventions

Use descriptive names for your hash columns. Instead of the default "GH", consider names like:

  • customer_hash for customer-related hash keys.
  • product_hash for product dimension hashes.
  • change_hash for change detection scenarios.

When to Use This Feature

Hash columns work well in data warehouse projects. They're especially helpful when you're building Data Vault models or need to spot changes across many columns in big data sets.