Skip to main content

Data Quality

  • DMF
  • Data quality

Overview

The data quality package is a set of node types which provide the flexibility to test the data quality of source.

Installation

  • Copy the Package ID  @coalesce/data-quality
  • In Coalesce, open the Workspace where you wish to install the package.
  • Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
  • Paste the Package ID, and proceed with the installation process.

Description

Data Quality Nodetypes

The Coalesce Data Quality Nodetypes Package includes:


DMF

The Coalesce DMF node allows you to to monitor the state and integrity of your data. You can use DMFs to measure key metrics, such as, but not limited to, freshness and counts that measure duplicates, NULLs, rows, and unique values.

You can set a DMF on the following kinds of table objects:

  • Dynamic table
  • Event table
  • External table
  • Apache Iceberg™ table
  • Materialized view
  • Table (Including temporary and transient tables)
  • View

Limitations:

  • You cannot set a DMF on a hybrid table or a stream object.
  • You can only have 10,000 total associations of DMFs on objects per account. Each instance of setting a DMF on a table or view counts as one association.
  • You cannot grant privileges on a DMF to share or set a DMF on a shared table or view.
  • Setting a DMF on an object tag is not supported.
  • You cannot set a DMF on objects in a reader account.
  • Trial accounts do not support this feature.

DMF Node Configuration

The Work node type has four configuration groups:

Screenshot 2025-03-31 at 1 30 49 PM

Node Properties

PropertyDescription
Storage LocationStorage Location where the WORK will be created
Node TypeName of template used to create node objects
Deploy EnabledIf TRUE the node will be deployed / redeployed when changes are detected
If FALSE the node will not be deployed or will be dropped during redeployment

Scheduling Options

You can create DMF run schedule to automate the data quality measurements on a table, as:

OptionDescription
Schedule OptionsChoose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between DMF runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the DMF. Supports a subset of standard cron utility syntax.
- TRIGGER_ON_CHANGES - Set the data metric function to run when a general DML operation, such as inserting a new row, modifies the table

Object Level DMFs

Screenshot 2025-03-31 at 1 49 14 PM
PropertyDescription
Add Object Level DMFEnable this to add object level DMF
DMF FunctionSelect the DMF function to use for object

Column Level DMFs

Screenshot 2025-03-31 at 1 52 48 PM
PropertyDescription
Add Column Level DMFEnable this to add object level DMF
DMF FunctionSelect the Column name and DMF function to use

DMF Deployment

DMF Initial Deployment

When deployed for the first time into an environment the DMF node will update the metadata with DMFs.

StageDescription
**Create **This will execute a Alter statement and create a DMF entries in metadata

DMF Redeployment

The subsequent deployment of DMF node with changes in DMFs definition, adding/dropping DMFs results in updating metadata with latest changes.

DMF Undeployment

If a DMF Node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher level environment then the all the DMFs will be dropped from metadata.


Data Profiling

The Data Profiling node type creates a data profiling table containing details such as the object name, column name, metric function applied, and the corresponding profiling result. It supports operations like count, min, max, avg, and other statistical measures on source table columns. Additionally, this node type allows scheduling, enabling automated and periodic data profiling to maintain data quality and integrity. The Data Profiling node type will create the Data Profiling table with the following fixed column structure: OBJECT_NAME, COLUMN_NAME, FUNCTION_APPLIED, and RESULT. The Coalesce Mapping grid will display the source column list; however, the Data Profiling table created in Snowflake will maintain a fixed column structure.

The node name in Coalesce will be used as the name of the Snowflake Data Profiling table.

Data Profiling Node Configuration

The Data Profiling node type has four configuration groups:

image

Node Properties

OptionDescription
Storage LocationStorage Location where the Data Profiling table will be created
Node TypeName of template used to create node objects
DescriptionA description of the node's purpose
Deploy EnabledIf TRUE the node will be deployed / redeployed when changes are detected
If FALSE the node will not be deployed or will be dropped during redeployment

Options

OptionDescription
Schedule TaskTrue / False toggle that determines whether a task will be created or if the SQL to be used in the task will execute as DML as a Run action.
True - The insert data SQL will wrap in a task with specified Scheduling Options
False - A table will be created and SQL will execute as a Run action
Truncate BeforeToggle: True/False
This determines whether a table will be overwritten each time a task executes. True: Uses INSERT OVERWRITE
False: Uses INSERT to append data
Enable testsToggle: True/False
Determines if tests are enabled
Sample ModeOptions: Sample/Full Table
Sample: Options: Percent/Fixed Number of Rows
Full Table: Performs Data Profiling on all records of source table

Counts

OptionDescription
Row CountTrue / False toggle that determines weather to add a record for the row count of the source table.
True - Add a record for the row count of the source table.
False - A record for the row count of the source table is not added.
Distinct CountAdd column/columns to get the count of distinct values of the selected column
Null CountAdd column/columns to get the count of null values of the selected column
Not Null CountAdd column/columns to get the count of not null values of the selected column

Min / Max / Avg

OptionDescription
MAX ValueAdd column/columns to get the Maximum value of the selected column
MIN ValueAdd column/columns to get the Minimum value of the selected column
Average ValueAdd column/columns to get the Average value of the selected column

Column Details

OptionDescription
Top 100 Value DistributionAdd column/ columns to get the JSON array that contains the top 100 distinct values of the column along with their occurrence counts
Max Data LengthAdd column/columns to get the maximum length of non-null values in the selected column
Min Data LengthAdd column/columns to get the minimum length of non-null values in the selected column
Standard DeviationAdd column/columns to get the standard deviation of the lengths of non-null selected column values

Data Profiling Scheduling Options

Task Scheduling Options

OptionDescription
Scheduling ModeChoose compute type:
- Warehouse Task: User managed warehouse executes tasks
- Serverless Task: Uses serverless compute
When Source Stream has Data FlagTrue/False toggle to check for stream data
True - Only run task if source stream has capture change data
False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Multiple Stream has Data LogicAND/OR logic when multiple streams (visible if Stream has Data Flag is true)
AND - If there are multiple streams task will run if all streams have data
OR - If there are multiple streams task will run if one or more streams has data
Select WarehouseVisible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless sizeVisible when Scheduling Mode is set to Serverless Task.
Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task ScheduleChoose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax.
- Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a commaVisible when Task Schedule is set to Predecessor.
One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task nameVisible when Task Schedule is set to Predecessor.
Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Example of Serverless Task with Multiple Predecessors and Root Task

Example_of_Serverless_Task

Example of Warehouse Task With 60 Minute Task Schedule

Example_of_Warehouse_Task

Example of Warehouse Task With Cron Schedule Not Using a Stream

Example_of_Warehouse_Task_with_Cron

Data Profiling Deployment

Deployment Parameters

The Data Profiling with Task includes an environment parameter that allows you to specify a different warehouse used to run a task in different environments.

The parameter name is targetTaskWarehouse and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT the value entered in the Scheduling Options config Select Warehouse on which to run the task will be used when creating the task.

{
"targetTaskWarehouse": "DEV ENVIRONMENT"
}

When set to any value other than DEV ENVIRONMENT the node will attempt to create the task using a Snowflake warehouse with the specified value.

For example, with the below setting for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{
"targetTaskWarehouse": "compute_wh"
}

Data Profiling With Task Initial Deployment

When deployed for the first time into an environment the Data Profiling With Task node will execute the following stages:

For tasks without predecessors:

StageDescription
Create Data Profiling TableCreates a Data Profiling table that will be loaded by the task.Note that the structure of the data profiling table is defined in the create template and will not be same as the columns in the mapping grid of Coalesce node
Create TaskCreates task that will load the Data Profiling table on schedule
Resume TaskResumes the task so it runs on schedule

For tasks with predecessors:

StageDescription
Create Target TableCreates table that will be loaded by the task
Suspend Root TaskSuspends root task for DAG modification
Create TaskCreates task that will load the target table on schedule

If a task is part of a DAG of tasks, the DAG needs to include a node type called "Task DAG Resume Root." This node will resume the root node once all the dependent tasks have been created as part of a deployment.

The task node has no ALTER capabilities. All task-enabled nodes are CREATE OR REPLACE only, though this is subject to change.

Data Profiling With Task Redeployment

After initial deployment, changes in task schedule, warehouse, or scheduling options will result in a CREATE or RESUME

For tasks without predecessors:

StageDescription
Create TaskRecreates task with new schedule
Resume TaskResumes task with new schedule

For tasks with predecessors:

StageDescription
Suspend Root TaskSuspends root task for DAG modification
Create TaskRecreates task with new schedule

Data Profiling With Task Altering Tables

The structure of the data profiling table is defined int he create template and cannot be changed so altering the table is not supported in redeployment.

Data Profiling With Task Undeployment

If a Data Profiling with Task node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher level environment then all objects created by the node in the target environment will be dropped.

For tasks without predecessors:

StageDescription
Drop TableDrop the data profiling table originally created to be loaded by the task.
Drop Current TaskDrop the task

For tasks with predecessors:

StageDescription
Drop TableDrop the data profiling table
Suspend Root TaskDrop a task from a DAG of task the root task needs to be put into a suspended state.
Drop TaskDrops the task

Code

DMF Code

Data Profiling Code

Versions

Available versions of the package.

Version #Release DateNotes
1.1.0April 02, 2025
 Added Data Profiling Node Type 
1.0.0March 31, 2025
 New node type Data Metric Functions added to Data Quality package