Using Regular Expressions with Jinja
Regular expressions are useful for pattern matching and string manipulation. You can combine Jinja's templating features with SQL regex functions to create reusable data transformations.
What Is a Regular Expression?
A regular expression (regex): A sequence of characters that defines a search pattern. Use regex when you need to match patterns in text rather than exact strings. This helps with data cleaning, validation, and transformation tasks such as removing special characters, validating formats, and extracting substrings.
Common Regex Functions
These SQL regex functions work well with Jinja templating. The examples use helper tokens to reference columns dynamically.
REGEXP_REPLACE
Searches for a pattern and replaces it with another value. Use this function when you need to clean, format, or transform string data.
REGEXP_REPLACE(column_name, 'pattern', 'replacement')
Example: Remove all dashes from a phone number column.
REGEXP_REPLACE({{SRC}}, '-', '')
If {{SRC}} resolves to "CUSTOMER"."C_PHONE" with value 555-123-4567, the output is 5551234567.
Example: Replace multiple spaces with a single space.
REGEXP_REPLACE({{SRC}}, '\\s+', ' ')
REGEXP_LIKE
Returns true if a string matches a specified pattern. Use this function for validation or filtering rows based on patterns.
REGEXP_LIKE(column_name, 'pattern')
Example: Check if a column contains only digits.
REGEXP_LIKE({{SRC}}, '^[0-9]+$')
If {{SRC}} resolves to "CUSTOMER"."C_CUSTKEY" with value 12345, the output is TRUE.
Example: Filter rows where city names start with "San".
WHERE REGEXP_LIKE({{SRC}}, '^San.*')
REGEXP_SUBSTR
Extracts the substring that matches a pattern. Use this function to pull specific parts of a string.
REGEXP_SUBSTR(column_name, 'pattern')
Example: Extract the first word from a column.
REGEXP_SUBSTR({{SRC}}, '\\w+')
If {{SRC}} resolves to "CUSTOMER"."C_NAME" with value John Smith, the output is John.
Example: Extract a year from a date string column.
REGEXP_SUBSTR({{SRC}}, '[0-9]{4}')
If {{SRC}} resolves to "ORDERS"."O_COMMENT" with value Order Date: 2024-03-15, the output is 2024.
REGEXP_INSTR
Returns the position of the first character in the substring that matches the pattern. Use this function to find where a pattern occurs.
REGEXP_INSTR(column_name, 'pattern')
Example: Find the position of the first digit in a column.
REGEXP_INSTR({{SRC}}, '[0-9]')
If {{SRC}} resolves to "ORDERS"."O_ORDERSTATUS" with value Order ABC123, the output is 10.
REGEXP_COUNT
Returns the number of times a pattern occurs in a string. Use this function to count matches.
REGEXP_COUNT(column_name, 'pattern')
Example: Count how many words are in a column.
REGEXP_COUNT({{SRC}}, '\\w+')
If {{SRC}} resolves to "CUSTOMER"."C_COMMENT" with value The quick brown fox, the output is 4.
Example: Count digits in a phone number column.
REGEXP_COUNT({{SRC}}, '[0-9]')
If {{SRC}} resolves to "CUSTOMER"."C_PHONE" with value 555-123-4567, the output is 10.
Macro: Remove Non-Numeric Characters
This macro removes all non-numeric characters from a column. It's useful for cleaning phone numbers, IDs, or any field that should contain only digits.
Enter the following macro into Build Settings > Macros.
{%- macro remove_non_numeric(column) -%}
REGEXP_REPLACE({{ column }}, '[^0-9]', '')
{%- endmacro %}
You can use this by entering this in the Transform field {{remove_non_numeric('{{SRC}}')}}
Example output:
| Input | Output |
|---|---|
| 555-123-4567 | 5551234567 |
| (800) 555-1234 | 8005551234 |
| Phone: 123.456.7890 | 1234567890 |
Macro: Clean Email Addresses
This macro removes whitespace and converts email addresses to lowercase for consistent data storage.
{%- macro clean_email(column) -%}
LOWER(REGEXP_REPLACE({{ column }}, '\\s', ''))
{%- endmacro %}
Usage: Enter {{clean_email('{{SRC}}')}} in the transforms field.
Macro: Extract Domain from Email
This macro extracts the domain portion from an email address.
{%- macro extract_email_domain(column) -%}
REGEXP_SUBSTR({{ column }}, '@([A-Za-z0-9.-]+)', 1, 1, 'e')
{%- endmacro %}
Example output:
| Input | Output |
|---|---|
user@example.com | example.com |
john.doe@company.org | company.org |
Macro: Validate Email Format
This macro checks if a column value matches a basic email pattern. Use it in a WHERE clause or conditional logic.
{%- macro is_valid_email(column) -%}
REGEXP_LIKE({{ column }}, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$')
{%- endmacro %}
Usage in a filter:
WHERE {{ is_valid_email('"CUSTOMER"."EMAIL"') }}
Macro: Mask Sensitive Data
This macro masks all but the last 4 characters of sensitive data such as credit card numbers or social security numbers.
{%- macro mask_sensitive(column) -%}
REGEXP_REPLACE({{ column }}, '.(?=.{4})', '*')
{%- endmacro %}
Example output:
| Input | Output |
|---|---|
| 1234567890123456 | ************3456 |
| 123-45-6789 | *******6789 |
Using Regex with Conditional Logic
You can combine regex functions with Jinja conditionals for more complex transformations.
{%- macro categorize_phone(column) -%}
CASE
WHEN REGEXP_LIKE({{ column }}, '^1?800') THEN 'Toll Free'
WHEN REGEXP_LIKE({{ column }}, '^1?900') THEN 'Premium'
ELSE 'Standard'
END
{%- endmacro %}
Common Regex Patterns
Here are some patterns you can use in your macros:
| Pattern | Matches |
|---|---|
[0-9] | Any digit |
[^0-9] | Any non-digit |
[A-Za-z] | Any letter |
\\s | Any whitespace |
\\w | Any word character |
^ | Start of string |
$ | End of string |
. | Any single character |
* | Zero or more of previous |
+ | One or more of previous |
Escaping Special Characters
When you use regex patterns in SQL strings, you need to escape backslashes. Use \\ in your patterns where you would normally use \.
-- Single backslash in regex needs double backslash in SQL
REGEXP_REPLACE({{SRC}}, '\\d+', 'NUMBER')
-- To match a literal backslash, use four backslashes
REGEXP_REPLACE({{SRC}}, '\\\\', '/')