Generate Hash Columns
Generate Hash Column creates a new column containing a hash value calculated from selected columns in your Node.
Creating a Hash Column
To generate a hash column:
- Select one or more columns in your Node.
- Right-click and choose Generate Hash Column.
- Choose your preferred algorithm.
- Customize the column name if needed.
- Create and Run the Node.
- The hash column appears at the bottom of your column list by default.
- The selected columns Contained in Hash with the hash column name.
A column can belong to more than one hash.
Delete a Hash Column
If you delete the hash column, Coalesce automatically removes it from Contained in Hash.
How It Works
When you select columns and right-click Generate Hash Column, the following happens:
- Creates a new column at the bottom of your column list.
- Calculates a hash value using your selected columns as input.
- Handles proper concatenation, delimiters, and null values automatically.
- Allows you to choose from multiple hashing algorithms.
The new column uses GH
as the default name. You can change this name.
Available Hash Algorithms
You can choose from these hashing algorithms when generating your hash column:
- MD5: Fast processing, smaller hash size.
- SHA1: Balanced security and performance.
- SHA256: Higher security, larger hash size.
Use Cases
Hash columns solve common data problems and make your work faster. Here are a few ways to use them.
Efficient Data Comparison
Hash columns simplify multi-column comparisons. Instead of writing complex WHERE clauses like code = code AND type = type AND data = data
, you can compare two hash values to detect changes across multiple columns.
Data Vault Architecture
Hash columns are essential for Data Vault modeling. They enable you to create:
- Hub hash keys: Business key hashes for hub entities.
- Hash diffs for satellites: Track changes in attributes over time.
- Link hash keys: Connect related entities across the data model.
Change Detection
Hash columns excel in incremental loading and change data capture scenarios. They help you identify modified records since the last data load, making your ETL processes more efficient.
Performance Optimization
For wide tables with many columns, pre-calculated hash values improve query performance. You compare a single hash column instead of multiple individual columns, reducing computational overhead.
Best Practices
Follow these tips to get the most out of hash columns in your projects
Column Selection
Choose columns that represent the business key or the attributes you want to monitor for changes. Avoid including frequently changing timestamps unless they're relevant to your use case.
Algorithm Selection
- Use MD5 for internal processing where speed is prioritized over cryptographic security.
- Use SHA1 for balanced performance and security requirements.
- Use SHA256 for scenarios requiring higher security standards.
Naming Conventions
Use descriptive names for your hash columns. Instead of the default "GH", consider names like:
customer_hash
for customer-related hash keys.product_hash
for product dimension hashes.change_hash
for change detection scenarios.
When to Use This Feature
Hash columns work well in data warehouse projects. They're especially helpful when you're building Data Vault models or need to spot changes across many columns in big data sets.