Skip to main content

Using Regular Expressions with Jinja

Regular expressions are useful for pattern matching and string manipulation. You can combine Jinja's templating features with SQL regex functions to create reusable data transformations.

What Is a Regular Expression?

A regular expression (regex): A sequence of characters that defines a search pattern. Use regex when you need to match patterns in text rather than exact strings. This helps with data cleaning, validation, and transformation tasks such as removing special characters, validating formats, and extracting substrings.

Common Regex Functions

These SQL regex functions work well with Jinja templating. The examples use helper tokens to reference columns dynamically.

REGEXP_REPLACE

Searches for a pattern and replaces it with another value. Use this function when you need to clean, format, or transform string data.

REGEXP_REPLACE(column_name, 'pattern', 'replacement')

Example: Remove all dashes from a phone number column.

Transform
REGEXP_REPLACE({{SRC}}, '-', '')

If {{SRC}} resolves to "CUSTOMER"."C_PHONE" with value 555-123-4567, the output is 5551234567.

Example: Replace multiple spaces with a single space.

Transform
REGEXP_REPLACE({{SRC}}, '\\s+', ' ')

REGEXP_LIKE

Returns true if a string matches a specified pattern. Use this function for validation or filtering rows based on patterns.

REGEXP_LIKE(column_name, 'pattern')

Example: Check if a column contains only digits.

Transform
REGEXP_LIKE({{SRC}}, '^[0-9]+$')

If {{SRC}} resolves to "CUSTOMER"."C_CUSTKEY" with value 12345, the output is TRUE.

Example: Filter rows where city names start with "San".

Join or Filter
WHERE REGEXP_LIKE({{SRC}}, '^San.*')

REGEXP_SUBSTR

Extracts the substring that matches a pattern. Use this function to pull specific parts of a string.

REGEXP_SUBSTR(column_name, 'pattern')

Example: Extract the first word from a column.

Transform
REGEXP_SUBSTR({{SRC}}, '\\w+')

If {{SRC}} resolves to "CUSTOMER"."C_NAME" with value John Smith, the output is John.

Example: Extract a year from a date string column.

Transform
REGEXP_SUBSTR({{SRC}}, '[0-9]{4}')

If {{SRC}} resolves to "ORDERS"."O_COMMENT" with value Order Date: 2024-03-15, the output is 2024.

REGEXP_INSTR

Returns the position of the first character in the substring that matches the pattern. Use this function to find where a pattern occurs.

REGEXP_INSTR(column_name, 'pattern')

Example: Find the position of the first digit in a column.

Transform
REGEXP_INSTR({{SRC}}, '[0-9]')

If {{SRC}} resolves to "ORDERS"."O_ORDERSTATUS" with value Order ABC123, the output is 10.

REGEXP_COUNT

Returns the number of times a pattern occurs in a string. Use this function to count matches.

REGEXP_COUNT(column_name, 'pattern')

Example: Count how many words are in a column.

Transform
REGEXP_COUNT({{SRC}}, '\\w+')

If {{SRC}} resolves to "CUSTOMER"."C_COMMENT" with value The quick brown fox, the output is 4.

Example: Count digits in a phone number column.

Transform
REGEXP_COUNT({{SRC}}, '[0-9]')

If {{SRC}} resolves to "CUSTOMER"."C_PHONE" with value 555-123-4567, the output is 10.

Macro: Remove Non-Numeric Characters

This macro removes all non-numeric characters from a column. It's useful for cleaning phone numbers, IDs, or any field that should contain only digits.

Enter the following macro into Build Settings > Macros.

{%- macro remove_non_numeric(column) -%}
REGEXP_REPLACE({{ column }}, '[^0-9]', '')
{%- endmacro %}

You can use this by entering this in the Transform field {{remove_non_numeric('{{SRC}}')}}

Example output:

InputOutput
555-123-45675551234567
(800) 555-12348005551234
Phone: 123.456.78901234567890

Macro: Clean Email Addresses

This macro removes whitespace and converts email addresses to lowercase for consistent data storage.

{%- macro clean_email(column) -%}
LOWER(REGEXP_REPLACE({{ column }}, '\\s', ''))
{%- endmacro %}

Usage: Enter {{clean_email('{{SRC}}')}} in the transforms field.

Macro: Extract Domain from Email

This macro extracts the domain portion from an email address.

{%- macro extract_email_domain(column) -%}
REGEXP_SUBSTR({{ column }}, '@([A-Za-z0-9.-]+)', 1, 1, 'e')
{%- endmacro %}

Example output:

InputOutput
user@example.comexample.com
john.doe@company.orgcompany.org

Macro: Validate Email Format

This macro checks if a column value matches a basic email pattern. Use it in a WHERE clause or conditional logic.

{%- macro is_valid_email(column) -%}
REGEXP_LIKE({{ column }}, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$')
{%- endmacro %}

Usage in a filter:

WHERE {{ is_valid_email('"CUSTOMER"."EMAIL"') }}

Macro: Mask Sensitive Data

This macro masks all but the last 4 characters of sensitive data such as credit card numbers or social security numbers.

{%- macro mask_sensitive(column) -%}
REGEXP_REPLACE({{ column }}, '.(?=.{4})', '*')
{%- endmacro %}

Example output:

InputOutput
1234567890123456************3456
123-45-6789*******6789

Using Regex with Conditional Logic

You can combine regex functions with Jinja conditionals for more complex transformations.

{%- macro categorize_phone(column) -%}
CASE
WHEN REGEXP_LIKE({{ column }}, '^1?800') THEN 'Toll Free'
WHEN REGEXP_LIKE({{ column }}, '^1?900') THEN 'Premium'
ELSE 'Standard'
END
{%- endmacro %}

Common Regex Patterns

Here are some patterns you can use in your macros:

PatternMatches
[0-9]Any digit
[^0-9]Any non-digit
[A-Za-z]Any letter
\\sAny whitespace
\\wAny word character
^Start of string
$End of string
.Any single character
*Zero or more of previous
+One or more of previous

Escaping Special Characters

When you use regex patterns in SQL strings, you need to escape backslashes. Use \\ in your patterns where you would normally use \.

-- Single backslash in regex needs double backslash in SQL
REGEXP_REPLACE({{SRC}}, '\\d+', 'NUMBER')

-- To match a literal backslash, use four backslashes
REGEXP_REPLACE({{SRC}}, '\\\\', '/')

What's Next?