Data Quality

‹ View all packages

Latest Version:: 1.1.0

Release Date:: April 02, 2025

Package ID:: @coalesce/data-quality

Coalesce Certified Package

DMF
Data quality

Overview

The data quality package is a set of node types which provide the flexibility to test the data quality of source.

Installation

Copy the Package ID @coalesce/data-quality
In Coalesce, open the Workspace where you wish to install the package.
Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
Paste the Package ID, and proceed with the installation process.

Description

Data Quality Nodetypes

The Coalesce Data Quality Nodetypes Package includes:

DMF
Data Profiling
Code

DMF

The Coalesce DMF node allows you to to monitor the state and integrity of your data. You can use DMFs to measure key metrics, such as, but not limited to, freshness and counts that measure duplicates, NULLs, rows, and unique values.

You can set a DMF on the following kinds of table objects:

Dynamic table
Event table
External table
Apache Iceberg™ table
Materialized view
Table (Including temporary and transient tables)
View

Limitations:

You cannot set a DMF on a hybrid table or a stream object.
You can only have 10,000 total associations of DMFs on objects per account. Each instance of setting a DMF on a table or view counts as one association.
You cannot grant privileges on a DMF to share or set a DMF on a shared table or view.
Setting a DMF on an object tag is not supported.
You cannot set a DMF on objects in a reader account.
Trial accounts do not support this feature.

DMF Node Configuration

The Work node type has four configuration groups:

Node Properties
Scheduling Options
Object Level DMFs
Column Level DMFs

Screenshot 2025-03-31 at 1 30 49 PM

Node Properties

Property	Description
Storage Location	Storage Location where the WORK will be created
Node Type	Name of template used to create node objects
Deploy Enabled	If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment

Scheduling Options

You can create DMF run schedule to automate the data quality measurements on a table, as:

Option	Description
Schedule Options	Choose schedule type: - Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between DMF runs. - Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the DMF. Supports a subset of standard cron utility syntax. - TRIGGER_ON_CHANGES - Set the data metric function to run when a general DML operation, such as inserting a new row, modifies the table

Option

Description

Schedule Options

Choose schedule type:
- Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between DMF runs.
- Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the DMF. Supports a subset of standard cron utility syntax.
- TRIGGER_ON_CHANGES - Set the data metric function to run when a general DML operation, such as inserting a new row, modifies the table

Object Level DMFs

Property	Description
Add Object Level DMF	Enable this to add object level DMF
DMF Function	Select the DMF function to use for object

Column Level DMFs

Property	Description
Add Column Level DMF	Enable this to add object level DMF
DMF Function	Select the Column name and DMF function to use

DMF Deployment

DMF Initial Deployment

When deployed for the first time into an environment the DMF node will update the metadata with DMFs.

Stage	Description
Create	This will execute a Alter statement and create a DMF entries in metadata

DMF Redeployment

The subsequent deployment of DMF node with changes in DMFs definition, adding/dropping DMFs results in updating metadata with latest changes.

DMF Undeployment

If a DMF Node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher level environment then the all the DMFs will be dropped from metadata.

Data Profiling

The Data Profiling node type creates a data profiling table containing details such as the object name, column name, metric function applied, and the corresponding profiling result. It supports operations like count, min, max, avg, and other statistical measures on source table columns. Additionally, this node type allows scheduling, enabling automated and periodic data profiling to maintain data quality and integrity. The Data Profiling node type will create the Data Profiling table with the following fixed column structure: OBJECT_NAME, COLUMN_NAME, FUNCTION_APPLIED, and RESULT. The Coalesce Mapping grid will display the source column list; however, the Data Profiling table created in Snowflake will maintain a fixed column structure.

The node name in Coalesce will be used as the name of the Snowflake Data Profiling table.

Data Profiling Node Configuration

The Data Profiling node type has four configuration groups:

Node Properties
Options
Counts
Min/Max/Avg
Column Details
Scheduling Options

Node Properties

Option	Description
Storage Location	Storage Location where the Data Profiling table will be created
Node Type	Name of template used to create node objects
Description	A description of the node's purpose
Deploy Enabled	If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment

Options

Option	Description
Schedule Task	True / False toggle that determines whether a task will be created or if the SQL to be used in the task will execute as DML as a Run action. True - The insert data SQL will wrap in a task with specified Scheduling Options False - A table will be created and SQL will execute as a Run action
Truncate Before	Toggle: True/False This determines whether a table will be overwritten each time a task executes. True: Uses INSERT OVERWRITE False: Uses INSERT to append data
Enable tests	Toggle: True/False Determines if tests are enabled
Sample Mode	Options: Sample/Full Table Sample: Options: Percent/Fixed Number of Rows Full Table: Performs Data Profiling on all records of source table

Counts

Option	Description
Row Count	True / False toggle that determines weather to add a record for the row count of the source table. True - Add a record for the row count of the source table. False - A record for the row count of the source table is not added.
Distinct Count	Add column/columns to get the count of distinct values of the selected column
Null Count	Add column/columns to get the count of null values of the selected column
Not Null Count	Add column/columns to get the count of not null values of the selected column

Min / Max / Avg

Option	Description
MAX Value	Add column/columns to get the Maximum value of the selected column
MIN Value	Add column/columns to get the Minimum value of the selected column
Average Value	Add column/columns to get the Average value of the selected column

Column Details

Option	Description
Top 100 Value Distribution	Add column/ columns to get the JSON array that contains the top 100 distinct values of the column along with their occurrence counts
Max Data Length	Add column/columns to get the maximum length of non-null values in the selected column
Min Data Length	Add column/columns to get the minimum length of non-null values in the selected column
Standard Deviation	Add column/columns to get the standard deviation of the lengths of non-null selected column values

Data Profiling Scheduling Options

Task Scheduling Options

Option	Description
Scheduling Mode	Choose compute type: - Warehouse Task: User managed warehouse executes tasks - Serverless Task: Uses serverless compute
When Source Stream has Data Flag	True/False toggle to check for stream data True - Only run task if source stream has capture change data False - Run task on schedule regardless of whether the source stream has data. If the source is not a stream should set this to false.
Multiple Stream has Data Logic	AND/OR logic when multiple streams (visible if Stream has Data Flag is true) AND - If there are multiple streams task will run if all streams have data OR - If there are multiple streams task will run if one or more streams has data
Select Warehouse	Visible if Scheduling Mode is set to Warehouse Task. Enter the name of the warehouse you want the task to run on without quotes.
Select initial serverless size	Visible when Scheduling Mode is set to Serverless Task. Select the initial compute size on which to run the task. Snowflake will adjust size from there based on target schedule and task run times.
Task Schedule	Choose schedule type: - Minutes - Specify interval in minutes. Enter a whole number from 1 to 11520 which represents the number of minutes between task runs. - Cron - Uses Cron expressions. Specifies a cron expression and time zone for periodically running the task. Supports a subset of standard cron utility syntax. - Predecessor - Specify dependent tasks
Enter predecessor tasks separated by a comma	Visible when Task Schedule is set to Predecessor. One or more task names that precede the task being created in the current node. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.
Root task name	Visible when Task Schedule is set to Predecessor. Name of the root task that controls scheduling for the DAG of tasks. Task names are case sensitive, should not be quoted and must exist in the same schema in which the current task is being created. If there are multiple predecessor task separate the task names using a comma and no spaces.

Example of Serverless Task with Multiple Predecessors and Root Task

Example_of_Serverless_Task

Example of Warehouse Task With 60 Minute Task Schedule

Example_of_Warehouse_Task

Example of Warehouse Task With Cron Schedule Not Using a Stream

Example_of_Warehouse_Task_with_Cron

Data Profiling Deployment

Deployment Parameters

The Data Profiling with Task includes an environment parameter that allows you to specify a different warehouse used to run a task in different environments.

The parameter name is targetTaskWarehouse and the default value is DEV ENVIRONMENT.

When set to DEV ENVIRONMENT the value entered in the Scheduling Options config Select Warehouse on which to run the task will be used when creating the task.

{
    "targetTaskWarehouse": "DEV ENVIRONMENT"
}

When set to any value other than DEV ENVIRONMENT the node will attempt to create the task using a Snowflake warehouse with the specified value.

For example, with the below setting for the parameter in a QA environment, the task will execute using a warehouse named compute_wh.

{
    "targetTaskWarehouse": "compute_wh"
}

Data Profiling With Task Initial Deployment

When deployed for the first time into an environment the Data Profiling With Task node will execute the following stages:

For tasks without predecessors:

Stage	Description
Create Data Profiling Table	Creates a Data Profiling table that will be loaded by the task.Note that the structure of the data profiling table is defined in the create template and will not be same as the columns in the mapping grid of Coalesce node
Create Task	Creates task that will load the Data Profiling table on schedule
Resume Task	Resumes the task so it runs on schedule

For tasks with predecessors:

Stage	Description
Create Target Table	Creates table that will be loaded by the task
Suspend Root Task	Suspends root task for DAG modification
Create Task	Creates task that will load the target table on schedule

If a task is part of a DAG of tasks, the DAG needs to include a node type called "Task DAG Resume Root." This node will resume the root node once all the dependent tasks have been created as part of a deployment.

The task node has no ALTER capabilities. All task-enabled nodes are CREATE OR REPLACE only, though this is subject to change.

Data Profiling With Task Redeployment

After initial deployment, changes in task schedule, warehouse, or scheduling options will result in a CREATE or RESUME

For tasks without predecessors:

Stage	Description
Create Task	Recreates task with new schedule
Resume Task	Resumes task with new schedule

For tasks with predecessors:

Stage	Description
Suspend Root Task	Suspends root task for DAG modification
Create Task	Recreates task with new schedule

Data Profiling With Task Altering Tables

The structure of the data profiling table is defined int he create template and cannot be changed so altering the table is not supported in redeployment.

Data Profiling With Task Undeployment

If a Data Profiling with Task node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher level environment then all objects created by the node in the target environment will be dropped.

For tasks without predecessors:

Stage	Description
Drop Table	Drop the data profiling table originally created to be loaded by the task.
Drop Current Task	Drop the task

For tasks with predecessors:

Stage	Description
Drop Table	Drop the data profiling table
Suspend Root Task	Drop a task from a DAG of task the root task needs to be put into a suspended state.
Drop Task	Drops the task

Code

DMF Code

Data Profiling Code

Versions

Available versions of the package.

Version #	Release Date	Notes
1.1.0	April 02, 2025	`Added Data Profiling Node Type`
1.0.0	March 31, 2025	`New node type Data Metric Functions added to Data Quality package`

Overview​

Installation​

Description​

Data Quality Nodetypes

DMF​

Limitations:​

DMF Node Configuration​

Node Properties​

Scheduling Options​

Object Level DMFs​

Column Level DMFs​

DMF Deployment​

DMF Initial Deployment​

DMF Redeployment​

DMF Undeployment​

Data Profiling​

Data Profiling Node Configuration​

Node Properties​

Options​

Counts​

Min / Max / Avg​

Column Details​

Data Profiling Scheduling Options​

Example of Serverless Task with Multiple Predecessors and Root Task​

Example of Warehouse Task With 60 Minute Task Schedule​

Example of Warehouse Task With Cron Schedule Not Using a Stream​

Data Profiling Deployment​

Deployment Parameters​

Data Profiling With Task Initial Deployment​

Data Profiling With Task Redeployment​

Data Profiling With Task Altering Tables​

Data Profiling With Task Undeployment​

Code​

DMF Code​

Data Profiling Code​

Versions​

Overview

Installation

Description

DMF

Limitations:

DMF Node Configuration

Node Properties

Scheduling Options

Object Level DMFs

Column Level DMFs

DMF Deployment

DMF Initial Deployment

DMF Redeployment

DMF Undeployment

Data Profiling

Data Profiling Node Configuration

Node Properties

Options

Counts

Min / Max / Avg

Column Details

Data Profiling Scheduling Options

Example of Serverless Task with Multiple Predecessors and Root Task

Example of Warehouse Task With 60 Minute Task Schedule

Example of Warehouse Task With Cron Schedule Not Using a Stream

Data Profiling Deployment

Deployment Parameters

Data Profiling With Task Initial Deployment

Data Profiling With Task Redeployment

Data Profiling With Task Altering Tables

Data Profiling With Task Undeployment

Code

DMF Code

Data Profiling Code

Versions