External Data Package
Overview
Provides tools for loading and managing external data in Snowflake, including schema inference, data copying, and unloading, API integration, and parsing of Excel files.
Installation
- Copy the Package ID:
@coalesce/snowflake/external-data-package - In Coalesce, open the Workspace where you wish to install the package.
- Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
- Paste the Package ID, and proceed with the installation process.
Description
External Data Package
The Coalesce External Data Package includes:
- CopyInto
- Snowpipe
- External Table
- Inferschema
- CopyUnload
- API
- API-NODEUPDATE
- JDBC LOAD
- Parse Excel
- Parse Json
- Code
CopyInto
CopyInto Node Configuration
The Copy-Into node type the following configurations available:
- Node Properties
- General Options
- Source Data
- File Format
- Extended CSV File Format Options
- Copy Options
CopyInto - Node Properties
There are four configs within the Node Properties group.
| Property | Description |
|---|---|
| Storage Location | Storage Location where the Materialized View will be created |
| Node Type | Name of template used to create node objects |
| Description | A description of the node's purpose |
| Deploy Enabled | If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment |
Key points to use CopyInto Node
- CopyInto node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
- CopyInto node can be added on top of an inferred table(table created by running the inferschema node) if you want to load data into specific columns as defined in the files.Refer to Inferschema to know more on how to use the node and add Copy-Into on top of it.
- The data can be reloaded into the table by truncating the data in the table before load using the TruncateBefore option in node config or reload parameter
- The path or subfolder name inside stage where the file is located can be specified using config 'Path or subfolder'.Do not prefix or suffix '/' in path name.Example,one level 'SUBFOLDER',two levels 'SUBFOLDER/INNERFOLDER'.
- Blank options added to Copy-Into and CSV file format options to revert any specific option
Use CopyInto node with InferSchema option
- Set Infer Schema toggle to true
- Hit Create button to Infer Schema
- To choose the file format configs,refer link
- Click on Re-Sync Columns button
- If all looks good, set Infer Schema button to false
- Hit Create button to execute create table based on inferred schema
- This is mainly a test to make sure create will work
- Hit Run button to execute DML
If the above works, it should be deployable as is. Deploy will simply take the columns and execute a create table.
CopyInto - General Options
InferSchema-True
InferSchema-False
| Option | Description |
|---|---|
| Create As | Select from the options to create as Table or Transient Table - Transient table -Table |
| TruncateBefore(Disabled when Inferschema is true) | True / False toggle that determines whether or not a table is to be truncated before reloading - True: Table is truncated and Copy-Into statement is executed to reload the data into target table - False: Data is loaded directly into target table and no truncate action takes place. |
| InferSchema | True / False toggle that determines whether or not to infer the columns of file before loading - True: The node is created with the inferred columns - False: No infer table step is executed |
CopyInto - Source Data
InferSchema-true
InferSchema-false
Internal or External Stage
| Setting | Description |
|---|---|
| Coalesce Storage Location of Stage | A storage location in Coalesce where the stage is located. |
| Stage Name (Required) | Internal or External stage where the files containing data to be loaded are staged |
| File Name(s)(Ex:a.csv,b.csv) | Specifies a list of one or more files names (separated by commas) to be loaded |
| Path or subfolder | Not mandatory.Specifies the path or subfolders inside the stage where the file is located.Ensure that '/' is not pre-fixed before or after the subfolder name |
| File Pattern (Ex:'.hea.[.]csv') | A regular expression pattern string, enclosed in single quotes, specifying the file names or paths to match |
External location
| Setting | Description |
|---|---|
| External URI | Enter the URI of the External location. This URI can point to either a private or public bucket/container. |
| Storage Integration | Specifies the name of the storage integration used to delegate authentication responsibility for external cloud storage to a Snowflake identity and access management (IAM) entity. This field is not mandatory for publicly accessible buckets or when another authentication method (such as presigned URLs or public access) is used. |
CopyInto - File Format
File format definition-File format name
File format definiton-File format values
File format config-Inferschema
- It is not mandatory to create a file format in snowflake to infer the structure of the file
- If the file format definition is set to 'File format values',a temporary file format is created based on the config values and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and is of file type csv,a temporary file format is created to adapt the same to infer the structure and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and the file types except csv,the same file format is used for inferring.No temporary file formts are created.
- The temporary file format created follows the naing convention ==>
TEMP_{{file_type}}_{{node name}}.Ensure that any predefined file formats created in snowflake does not have the same file format name.
File Format Definition - File Format Name
| Setting | Description |
|---|---|
| * File Format Name | Specifies an existing named file format to use for loading data into the table |
| * File Type | CSV JSON ORC AVRO PARQUET XML |
File Format Definition - File Format Values
| Setting | Description |
|---|---|
| File Format Values | Provides file format options for the File Type chosen |
| File Type | Each file type has different configurations available |
| CSV | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Record delimiter-Characters that separate records in an input file Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file Field optionally enclosed by- Character used to enclose strings(octal notation preferred) Number of header lines to skip- Number of lines at the start of the file to skip Parse Header(InferSchema-true)-Boolean that specifies whether to use the first row headers in the data files to determine column names. Trim Space- Boolean that specifies whether to remove white space from fields Replace invalid characters- Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Encoding- Specifies the character set of the source data when loading data into a table Date format- String that defines the format of date values in the data files to be loaded. Time format- String that defines the format of time values in the data files to be loaded Timestamp format- String that defines the format of timestamp values in the data files to be loaded. Refer to extended options here |
| JSON | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character Trim Space - Boolean that specifies whether to remove white space from fields Strip Outer Array- Boolean that instructs the JSON parser to remove outer brackets [ ] Date format- String that defines the format of date values in the data files to be loaded Time format- String that defines the format of time values in the data files to be loaded Timestamp format- String that defines the format of timestamp values in the data files to be loaded |
| ORC | Trim Space - Specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
| AVRO | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
| PARQUET | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character Using vectorised scanner - Boolean that specifies whether to use a vectorized scanner for loading Parquet files |
| XML | Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
Extended CSV File Format Options
| Setting | Description |
|---|---|
| CSV | Skip blank lines- Boolean that specifies to skip any blank lines encountered in the data files Nulls for empty field- When loading data, specifies whether to insert SQL NULL for empty fields in an input file Skip byte order mark-Boolean that specifies whether to skip the BOM (byte order mark), if present in a data file. Parsing error for column mismatch-Boolean that specifies whether to generate a parsing error if the number of delimited columns (i.e. fields) in an input file does not match the number of columns in the corresponding table. Multiple lines-Boolean that specifies whether multiple lines are allowed.If MULTI_LINE is set to FALSE and the specified record delimiter is present within a CSV field, the record containing the field will be interpreted as an error. Binary format-Defines the encoding format for binary input or output. Replace values with NULL-o specify more than one string, enclose the list of strings in parentheses and use commas to separate each value. Escape optionally enclosed by-A singlebyte character string used as the escape character for unenclosed field values only.Octal notation is preferred. Escape enclosed/unenclosed values-A singlebyte character string used as the escape character for enclosed or unenclosed field values.Octal notation is preferred |
CopyInto - Copy Options
| Setting | Description |
|---|---|
| On Error Behavior | -String (constant) that specifies the error handling for the load operation |
| * CONTINUE * SKIP_FILE * SKIP_FILE_num * SKIP_FILE_num% * ABORT_STATEMENT * BLANK OPTION | |
| Specify the number of errors that can be skipped | -Required when On Error Behavior is either SKIP_FILE_num or SKIP_FILE_num%.Specify the number of errors that can be skipped. |
| Size Limit | -Number (> 0) that specifies the maximum size (in bytes) of data to be loaded for a given COPY statement |
| Purge Behavior | -Boolean that specifies whether to remove the data files from the stage automatically after the data is loaded successfully |
| Return Failed Only | -Boolean that specifies whether to return only files that have failed to load in the statement result |
| Force | -Boolean that specifies to load all files, regardless of whether they’ve been loaded previously and have not changed since they were loaded |
| Load Uncertain Files | -Boolean that specifies to load files for which the load status is unknown. The COPY command skips these files by default |
| Enforce Length | -Boolean that specifies whether to truncate text strings that exceed the target column length |
| Truncate Columns | -Boolean that specifies whether to truncate text strings that exceed the target column length |
CopyInto - System Columns
The set of columns which has source data and file metadata information.
- SRC - The data from the file is loaded into this variant column.
- LOAD_TIMESTAMP - Current timestamp when the file gets loaded.
- FILENAME - Name of the staged data file the current row belongs to. Includes the full path to the data file.
- FILE_ROW_NUMBER - Row number for each record in the staged data file.
- FILE_LAST_MODIFIED - Last modified timestamp of the staged data file the current row belongs to
- SCAN_TIME - Start timestamp of operation for each record in the staged data file. Returned as TIMESTAMP_LTZ.
CopyInto Deployment
CopyInto Deployment Parameters
The CopyInto node typeincludes an environment parameter that allows you to specify if you want to perform a full load or a reload based on the load type when you are performing a Copy-Into operation.
The parameter name is loadType and the default value is ``.
{
"loadType": ""
}
When the parameter value is set to Reload, the data is reloaded into the table regardless of whether they’ve been loaded previously and have not changed since they were loaded.As a alternate option we can use TruncateBefore option in node config
CopyInto Initial Deployment
When deployed for the first time into an environment the Copy-into node of materialization type table will execute the below stage:
| Deployment Behavior | Load Type | Stages Executed |
|---|---|---|
| Initial Deployment | `` | Create Table/Transient Table |
| Initial Deployment | Reload | Create Table/Transient Table |
CopyInto Redeployment
Altering the CopyInto Tables
There are few column or table changes like Change in table name,Dropping existing column, Alter Column data type,Adding a new column if made in isolation or all-together will result in an ALTER statement to modify the Work Table in the target environment.
The following stages are executed:
- Rename Table| Alter Column | Delete Column | Add Column | Edit table description: Alter table statement is executed to perform the alter operation.
Copy-Into change in materialization type
When the materialization type of Copy-Into node is changed from table to transient table or viceversa,the below stages are executed:
- Drop table/transient table
- Create transient table/table
Redeployment with no changes
If the nodes are redeployed with no changes compared to previous deployment,then no stages are executed
CopyInto Undeployment
If the CopyInto node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher-level environment then the target table in the target environment will be dropped.
- Drop table/transient table: Target table in Snowflake is dropped
Snowpipe
The Coalesce Snowpipe node is a node that performs two operations. It can be used to load historical data using
CopyInto. Also the Snowpipe node can be used to create a pipe to auto ingest files from AWS, GCP, or Azure.
Snowpipe enables loading data from files as soon as they’re available in a stage.
This means you can load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches.
Key points to use Snowpipe Node
- Snowpipe node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
- Snowpipe node can be added on top of an inferred table(table created by running the inferschema node) if you want to load data into specific columns as defined in the files.Refer to Inferschema to know more on how to use the node and add Copy-Into on top of it.
- The data can be reloaded into the table by truncating the data in the table before load using the TruncateBefore option in node config or reload parameter *The path or subfolder name inside stage where the file is located can be specified using config 'Path or subfolder'.Do not prefix or suffix '/' in path name.Example,one level 'SUBFOLDER',two levels 'SUBFOLDER/INNERFOLDER'.
Use Snowpipe node with InferSchema option
- Set Infer Schema toggle to true
- Hit Create button to Infer Schema
- To choose the file format configs,refer link
- Click on Re-Sync Columns button
- If all looks good, set Infer Schema button to false
- Hit Create button to execute create table based on inferred schema
- This is mainly a test to make sure create will work
- Hit Run button to execute DML
Snowpipe Node Configuration
The Snowpipe node type the following configurations available:
- Node Properties
- General Options
- Snowpipe Options
- File Location
- File Format
- Extended CSV File Format Options
- Copy Options
InferSchema-true
InferSchema-false
Snowpipe Node Properties
There are four configs within the Node Properties group.
| Property | Description |
|---|---|
| Storage Location | Storage Location where the Materialized View will be created |
| Node Type | Name of template used to create node objects |
| Description | A description of the node's purpose |
| Deploy Enabled | If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment |
General Options
| Setting | Description |
|---|---|
| Create As | Dropdown that helps us to create a table or Transient table to load data from external stage. * Table * Transient Table |
| InferSchema | True / False toggle that determines whether or not to infer the columns of file before loading * True -The node is created with the inferred columns * False -No infer table step is executed |
Snowpipe Options
| Setting | Description |
|---|---|
| Enable Snowpipe | Drop down that helps us to create a pipe to auto ingest files from external stage or validate the Copy-Into statement. * Enable Snowpipe - Load data auto ingesting files from AWS, Azure, or GCP. Choose your SNS service Toggle to check if we need to use Notification Service.If false,default Snowflake notification is used Cloud Provider. * AWS - AWS SNS Topic. Specifies the Amazon Resource Name (ARN) for the SNS topic for your S3 bucket. * Azure - Integration. Specifies the existing notification integration used to access the storage queue. * GCP - Integration. Specifies the existing notification integration used to access the storage queue. * Test Copy Statement - To validate the Copy-into statement before we use it to create PIPE |
| Load historical data | Loads the historic data into the target table by executing a COPY_INTO statement |
Snowpipe File Location
InferSchema-true
InferSchema-false
| Setting | Description |
|---|---|
| Coalesce Storage Location of Stage | A storage location in Coalesce where the stage is located. |
| Stage Name (Required) | Internal or External stage where the files containing data to be loaded are staged |
| File Name(s)(Ex:'a.csv','b.csv') | Enabled when 'Enable Snowpipe' under Snowpipe Options is toggled off. Specifies a list of one or more files names (separated by commas) to be loaded. For example, 'a.csv','b.csv' |
| Path or subfolder | Not mandatory.Specifies the path or subfolders inside the stage where the file is located.Ensure that '/' is not pre-fixed before or after the subfolder name |
| File Pattern (Ex:'.hea.[.]csv') | A regular expression pattern string, enclosed in single quotes, specifying the file names or paths to match. For example, *hea.*[.]csv' |
Snowpipe File Format
File format definition-File format name
File format definition-File format values
File format config-Inferschema
- It is not mandatory to create a file format in snowflake to infer the structure of the file
- If the file format definition is set to 'File format values',a temporary file format is created based on the config values and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and is of file type csv,a temporary file format is created to adapt the same to infer the structure and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and the file types except csv,the same file format is used for inferring.No temporary file formts are created.
- The temporary file format created follows the naing convention ==>
TEMP_{{file_type}}_{{node name}}.Ensure that any predefined file formats created in snowflake does not have the same file format name. - Blank options have been added for few config options if u want to exclude any file format option
File Format Definition - File Format Name
| Setting | Description |
|---|---|
| * File Format Name | Specifies an existing named file format to use for loading data into the table |
| * File Type | CSV JSON ORC AVRO PARQUET XML |
File Format Definition - File Format Values
| Setting | Description |
|---|---|
| File Format Values | Provides file format options for the File Type chosen |
| File Type | Each file type has different configurations available |
| CSV | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Record delimiter-Characters that separate records in an input file.Octal notation preferred. Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file Field optionally enclosed by- Character used to enclose strings Number of header lines to skip- Number of lines at the start of the file to skip Parse Header(InferSchema-true)-Boolean that specifies whether to use the first row headers in the data files to determine column names. Trim Space- Boolean that specifies whether to remove white space from fields Replace invalid characters- Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Encoding- Specifies the character set of the source data when loading data into a table Date format- String that defines the format of date values in the data files to be loaded. Time format- String that defines the format of time values in the data files to be loaded Timestamp format- String that defines the format of timestamp values in the data files to be loaded. Refer to extended options here |
| JSON | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Trim Space - Boolean that specifies whether to remove white space from fields Strip Outer Array- Boolean that instructs the JSON parser to remove outer brackets [ ]. Date format- String that defines the format of date values in the data files to be loaded Time format- String that defines the format of time values in the data files to be loaded Timestamp format- String that defines the format of timestamp values in the data files to be loaded |
| ORC | Trim Space - Specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
| AVRO | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. |
| PARQUET | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character Using vectorised scanner - Boolean that specifies whether to use a vectorized scanner for loading Parquet files |
| XML | Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
Snowpipe Extended CSV File Format Options
| Setting | Description |
|---|---|
| CSV | Skip blank lines- Boolean that specifies to skip any blank lines encountered in the data files Nulls for empty field- When loading data, specifies whether to insert SQL NULL for empty fields in an input file Skip byte order mark-Boolean that specifies whether to skip the BOM (byte order mark), if present in a data file. Parsing error for column mismatch-Boolean that specifies whether to generate a parsing error if the number of delimited columns (i.e. fields) in an input file does not match the number of columns in the corresponding table. Multiple lines-Boolean that specifies whether multiple lines are allowed.If MULTI_LINE is set to FALSE and the specified record delimiter is present within a CSV field, the record containing the field will be interpreted as an error. Binary format-Defines the encoding format for binary input or output. Replace values with NULL-o specify more than one string, enclose the list of strings in parentheses and use commas to separate each value. Escape optionally enclosed by-A singlebyte character string used as the escape character for unenclosed field values only Escape enclosed/unenclosed values-A singlebyte character string used as the escape character for enclosed or unenclosed field values. |
Snowpipe Copy Options
| Setting | Description |
|---|---|
| On Error Behavior | String (constant) that specifies the error handling for the load operation * CONTINUE * SKIP_FILE * SKIP_FILE_num * Specify the number of errors that can be skipped. * SKIP_FILE_num% * Specify the number of errors that can be skipped * BLANK OPTION |
| Enforce Length | Boolean that specifies whether to truncate text strings that exceed the target column length |
| Truncate Columns | Boolean that specifies whether to truncate text strings that exceed the target column length |
Snowpipe System Columns
The set of columns which has source data and file metadata information.
- SRC - The data from the file is loaded into this variant column.
- LOAD_TIMESTAMP - Current timestamp when the file gets loaded.
- FILENAME - Name of the staged data file the current row belongs to. Includes the full path to the data file.
- FILE_ROW_NUMBER - Row number for each record in the staged data file.
- FILE_LAST_MODIFIED - Last modified timestamp of the staged data file the current row belongs to
- SCAN_TIME - Start timestamp of operation for each record in the staged data file. Returned as TIMESTAMP_LTZ.
Key points to use Snowpipe node
- Snowpipe node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
- Snowpipe node can be added on top of an inferred table(table created by running the inferschema node) if you want to load data into specific columns as defined in the files.Refer to Inferschema to know more on how to use the node and add Snowpipe node on top of it.
Use Snowpipe node with InferSchema option
- Set Infer Schema toggle to true
- Hit Create button to Infer Schema
- Click on Re-Sync Columns button
- If all looks good, set Infer Schema button to false
- Hit Create button to execute create table based on inferred schema and pipe to ingest data
If the above works, it should be deployable as is. Deploy will simply take the columns and execute a create table and snowpipe
Snowpipe Deployment
Snowpipe Deployment Parameters
The Snowpipe includes twp environment parameters.The parameter loadType allows you to specify if you want to perform a full load or a reload based on the load type
when you are performing a Copy-Into operation.
The another parameter targetIntegration allows you to specify the Integration name to be used for auto ingesting files from Clous providers
The parameter name is loadType and the default value is ``.
{
"loadType": ""
}
When the parameter value is set to Reload, the data is reloaded into the table regardless of whether they’ve been loaded previously
and have not changed since they were loaded.
The parameter name is targetIntegration and the default value is DEV ENVIRONMENT.
When set to DEV ENVIRONMENT the value entered in the Snowpipe Options config Integration is used for ingesting files from Azure/GCP Cloud providers.
{
"targetIntegration": "DEV ENVIRONMENT"
}
When set to any value other than DEV ENVIRONMENT the node will use the specified value for Integration.
For example, the Snowpipe node will use the specific value for auto ingestion.
{
"targetIntegration": "arn:aws:sqs:us-east-1:832620633027:sf-snowpipe-AIDA4DXAOTPB7IF437EPD-lZe8ciYpPqGgC9KwWLmiIQ"
}
The parameter name is targetAWSSNSTopic and the default value is DEV ENVIRONMENT.
When set to DEV ENVIRONMENT the value entered in the Snowpipe Options config AWS SNS Topic is used for ingesting files from AWS Cloud providers where a customer prefers SNS service.
{
"targetAWSSNSTopic": "DEV ENVIRONMENT"
}
When set to any value other than DEV ENVIRONMENT the node will use the specified value for Integration.
For example, the Snowpipe node will use the specific value for auto ingestion.
{
"targetAWSSNSTopic": "arn:aws:sqs:us-east-1:832620633027:sf-snowpipe-AIDA4DXAOTPB7IF437EPD-lZe8ciYpPqGgC9KwWLmiIQ"
}
Snowpipe Initial Deployment
When deployed for the first time into an environment the Snowpipe node will execute the below stages depending on if Enable Snowpipe is enabled,'Load Historical Data' is enabled and the loadType parameter.
| Deployment Behavior | Enable Snowpipe | Historical Load | Load Type | Stages Executed |
|---|---|---|---|---|
| Initial Deployment | Enable Snowpipe | true | `` | Create Table Historical full load using CopyInto Create Pipe Alter Pipe |
| Initial Deployment | Enable Snowpipe | true | Reload | Create table Truncate Target table Historical full load using CopyInto Create Pipe Alter Pipe |
| Initial Deployment | Enable Snowpipe | false | Reload or Empty | Create table Truncate Target table Create Pipe |
| Initial Deployment | Test Copy Statement | false | Reload or Empty | Create table Test Copy Statement-No pipe creation |
Snowpipe Redeployment
Altering the Snowpipe node
There are few column or table changes like Change in table name, Dropping existing column, Alter Column data type, Adding a new column if made in isolation or all-together will result in an ALTER statement to modify the target Table in the target environment.Any table level changes or node config changes results in recreation of pipe
The following stages are executed:
- Rename Table| Alter Column | Delete Column | Add Column| Edit table description |: Alter table statement is executed to perform the alter operation.
- Truncate target table:The target table is truncated in case the Load Type parameter is set to 'Reload'
- Historical full load using CopyInto:Historical data are loaded if 'Load Historical' toggle is on.
- Create Pipe: Pipe is recreated if enable snowpipe option is true
- Alter Pipe:Pipe is refreshed if 'Load Historical' toggle is on.
Redeployment with no changes
If the nodes are redeployed with no changes compared to previous deployment,then no stages are executed
Snowpipe Undeployment
If the Snowpipe node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher-level environment then the target table in the target environment will be dropped.
This is executed in two stages:
- Drop Table: Target table is dropped
- Drop Pipe: Pipe is dropped
External Tables
The Coalesce External Table nodes create a new external table in the current/specified schema or replaces an existing external table.
An external table reads data from a set of one or more files in a specified external stage which can point to AWS,GCP or Azure cloud providers.
External Tables Node Configuration
The External table node type has four configuration groups:
External Tables Node Properties
There are four configs within the Node Properties group.
| Property | Description |
|---|---|
| Storage Location | Storage Location where the Materialized View will be created |
| Node Type | Name of template used to create node objects |
| Description | A description of the node's purpose |
| Deploy Enabled | If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment |
General Options
| Settings | Description |
|---|---|
| InferSchema | True / False toggle that determines whether or not to infer the columns of file before loading * True -The node is created with the inferred columns * False -No infer table step is executed |
| Partition by toggle | When set to true,we get to choose partition column from dropdown.A partition column must evaluate as an expression that parses the path and/or filename information in the METADATA$FILENAME.By default the partitioning is done by FILENAME column |
| Partition Column dropdown(Partition by toggle true) | Get to choose the partition column with appropriate trnasformation from dropdown |
External Tables Node File Location
InferSchema-true
InferSchema-false
| Settings | Description |
|---|---|
| Coalesce Stage Storage Location of Stage(Required) | A storage location in Coalesce where the stage is located |
| Stage Name (Required) | Internal or External stage where the files containing data to be loaded are staged |
| File Name(s)(Ex:'a.csv','b.csv') | Enabled when InferSchema toggle is true. Specifies a list of one or more files names (separated by commas) to be loaded. For example, 'a.csv','b.csv' |
| File Pattern | A regular expression pattern string, enclosed in single quotes, specifying the file names or paths to match. For example, *hea.*[.]csv' |
External Tables File Format
InferSchema-true
InferSchema-false
File Format Definition - File Format Name
| Settings | Description |
|---|---|
| Coalesce Storage Location of File Format | Location in Coalesce pointing to the database and schema,the file format resides.Mandatory when File Format Name is chosen |
| File Format Name | Specifies an existing named file format to use for loading data into the table |
| File Type | * CSV * JSON * ORC * AVRO * PARQUET |
File Format Definition - File Format Values
| Settings | Description |
|---|---|
| File Format Values | -Provides file format options for the File Type chosen |
| File Type | Each file type has different configurations available |
| CSV | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Record delimiter-Characters that separate records in an input file Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file Field optionally enclosed by- Character used to enclose strings Number of header lines to skip- Number of lines at the start of the file to skip Skip blank lines- Boolean that specifies to skip any blank lines encountered in the data files Trim Space- Boolean that specifies whether to remove white space from fields Replace invalid characters- Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Date format- String that defines the format of date values in the data files to be loaded. Time format- String that defines the format of time values in the data files to be loaded Timestamp format-String that defines the format of timestamp values in the data files to be loaded |
| JSON | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Trim Space - Boolean that specifies whether to remove white space from fields Strip Outer Array- Boolean that instructs the JSON parser to remove outer brackets [ ]. Date format- String that defines the format of date values in the data files to be loaded Time format- String that defines the format of time values in the data files to be loaded Timestamp format- String that defines the format of timestamp values in the data files to be loaded |
| ORC | Trim Space - Specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
| AVRO | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. |
| PARQUET | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. |
| XML | Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
External Tables Additional Options
| Settings | Description |
|---|---|
| Auto refresh | -True/False toggle that specifies whether Snowflake should enable triggering automatic refreshes of the external table metadata when new or updated data files are available in the named external stage * False - Auto refresh does not take place * True - Auto refresh takes place. Setting this to true, give you the option to choose a Cloud Provider from AWS, Azure, and GCP. * AWS - AWS SNS Topic, Enabled only when Cloud Provider is AWS and auto refresh is true.Required only when configuring AUTO_REFRESH for Amazon S3 stages using Amazon Simple Notification Service (SNS). * GCP - Integration. Enabled when Cloud Provider is GCP/Azure. Specifies the existing notification integration used to access the storage queue * Azure - Integration. Enabled when Cloud Provider is GCP/Azure. Specifies the existing notification integration used to access the storage queue |
System Columns
The set of columns which has source data and file metadata information.
- VALUE - The data from the file is loaded into this variant column
- FILENAME - Name of the staged data file the current row belongs to. Includes the full path to the data file.
- PARTITION_COLUMN - A column with default transformation split_part(metadata$filename,'/',2) is added to partition external table.
Key points to use External table node
- External table node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
- External node can be added on top of an inferred table(table created by running the inferschema node) if you want to load data into specific columns as defined in the files.Refer to Inferschema to know more on how to use the node and add External node on top of it.
- A column with default transformation split_part(metadata$filename,'/',2) is added to partition external table.The user can rename the column and modify the transformation as per their requirement.Also,we can use any other column with appropriate transformation for partitioning external table
Use External table with InferSchema option
- Set Infer Schema toggle to true
- Hit Create button to Infer Schema
- Click on Re-Sync Columns button
- If all looks good, set Infer Schema button to false
- Hit Create button to execute create table based on inferred schema and pipe to ingest data
If the above works, it should be deployable as is. Deploy will simply take the columns and execute a create table and snowpipe
External table Deployment
Initial Deployment
When deployed for the first time into an environment the External table node will execute the below stage:
- Create External Table - This will execute a CREATE OR REPLACE statement and create a table in the target environment.
Redeployment
Recreating the External table
The subsequent deployment of External table with changes in config options will recreate the table.
The following stages are executed:
- Create External Table
Redeployment with no changes
If the nodes are redeployed with no changes compared to previous deployment,then no stages are executed
Undeployment
If the External table node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher-level environment then the target table in the target environment will be dropped.
- Drop External Table
InferSchema
The Coalesce InferSchema UDN is a versatile node infers schema of the file in internal or external stage and dynamically creates the target table of the same name as inferschema node.
InferSchema in Snowflake automatically detects the file metadata schema in a set of staged data files that contain semi-structured data and retrieves the column definitions.
Key points InferSchema
- A sample file in the internal or external stage.
- An existing fileformat to parse the file
- The file format is also expected to be in the same location as stage
- The mapping grid after creating the inferred node can be Re-Synced with the exact structure of the table/transient table using the Re-Sync Columns button in the mapping grid
InferSchema Node Configuration
The InferSchema node type has two configuration groups:
Go to the node and select the Config tab to see the Node Properties, Dynamic Table Options and General Options.
InferSchema Node Properties
There are four configs within the Node Properties group.
| Property | Description |
|---|---|
| Storage Location | Storage Location where the Materialized View will be created |
| Node Type | Name of template used to create node objects |
| Description | A description of the node's purpose |
| Deploy Enabled | If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment |
InferSchema Source Data
| Settings | Description |
|---|---|
| Create As | Select from the options to create as Table or Transient Table * Transient Table * Table |
| Coalesce Storage Location of Stage(Required) | A storage location in Coalesce where the stage is located |
| Stage Name (Required) | Internal or External stage where the files containing data to be loaded are staged |
| File Names (Required) | Use commas to seperate multiple files. File whose metatdata is to be inferred |
| Coalesce Storage Location of File Format | Location in Coalesce pointing to the database and schema,the file format resides |
| File Format Name (Required) | Name of the file format object that describes the data contained in the staged files.It is expected in the same location as Stage |
| Redeployment Behavior | CREATE OR REPLACE: Dynamically creates target table based on the inferred schema from file in staging area ALTER EXISTING TABLE: Dynamically alters inferred table by comparing the inferred schema of the same file (with changes if any)and created table DROP EXISTING TABLE: Drops the table inferred |
InferSchema File Format Options
File format config-Inferschema
- It is not mandatory to create a file format in snowflake to infer the structure of the file
- If the file format definition is set to 'File format values',a temporary file format is created based on the config values and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and is of file type csv,a temporary file format is created to adapt the same to infer the structure and dropped on successfully inferring the structure of the table
- If the file format definition is set to 'File format name' and the file types except csv,the same file format is used for inferring.No temporary file formts are created.
- The temporary file format created follows the naing convention ==>
TEMP_{{file_type}}_{{node name}}.Ensure that any predefined file formats created in snowflake does not have the same file format name.
File Format Definition - File Format Name
| Settings | Description |
|---|---|
| Coalesce Storage Location of File Format | Location in Coalesce pointing to the database and schema,the file format resides.Mandatory when File Format Name is chosen |
| File Format Name | Specifies an existing named file format to use for loading data into the table |
| File Type | * CSV * JSON * ORC * AVRO * PARQUET |
File Format Definition - File Format Values
| Settings | Description |
|---|---|
| File Format Values | -Provides file format options for the File Type chosen |
| File Type | Each file type has different configurations available |
| CSV | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Record delimiter-Characters that separate records in an input file Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file Field optionally enclosed by- Character used to enclose strings Parse Header-Boolean that specifies whether to use the first row headers in the data files to determine column names Encoding- Specifies the character set of the source data when loading data into a table Refer this link for extended csv format options |
| JSON | Compression- String (constant) that specifies the current compression algorithm for the data files to be loaded Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Trim Space - Boolean that specifies whether to remove white space from fields Strip Outer Array- Boolean that instructs the JSON parser to remove outer brackets [ ]. |
| ORC | Trim Space - Specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
| AVRO | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. |
| PARQUET | Trim Space - Boolean that specifies whether to remove white space from fields Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. |
| XML | Replace invalid characters - Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character |
InferSchema Extended CSV File Format Options
| Setting | Description |
|---|---|
| CSV | Skip blank lines- Boolean that specifies to skip any blank lines encountered in the data files Trim Space- Boolean that specifies whether to remove white space from fields Replace invalid characters- Boolean that specifies whether to replace invalid UTF-8 characters with the Unicode replacement character. Nulls for empty field- When loading data, specifies whether to insert SQL NULL for empty fields in an input file Skip byte order mark-Boolean that specifies whether to skip the BOM (byte order mark), if present in a data file. Parsing error for column mismatch-Boolean that specifies whether to generate a parsing error if the number of delimited columns (i.e. fields) in an input file does not match the number of columns in the corresponding table. Multiple lines-Boolean that specifies whether multiple lines are allowed.If MULTI_LINE is set to FALSE and the specified record delimiter is present within a CSV field, the record containing the field will be interpreted as an error. Binary format-Defines the encoding format for binary input or output. Replace values with NULL-o specify more than one string, enclose the list of strings in parentheses and use commas to separate each value. Escape optionally enclosed by-A singlebyte character string used as the escape character for unenclosed field values only Escape enclosed/unenclosed values-A singlebyte character string used as the escape character for enclosed or unenclosed field values. |
InferSchema Usage
Option1-Using API-NODEUPDATE
- Add a InferSchema node, for example
INFER_JSONand hit create.'Infer and Create table' stage runs and creates a table with the same name as InferSchema node. - Use API-NODEUPDATE node to update the columns of the infer node created in the above step.
- Now we can add a Copy-Into,Snowpipe or External table node on top of the inferred table to load staged files.
- Once we get the required metadata columns in our downstream node(Copy-Into,Snowpipe or External table),bulk edit the source in the node to be blank.Also delete the from clause in the Join tab.These information are not required for Copy-Into,Snowpipe or External table to load data from files.
- Delete the API-NODEUPDATE node ,we created in step2 as we have updated the columns required for the downstream nodes and the data for the same are derived from files.
- Eventually API-NODEUPDATE nodes are not required to be deployed.It is sufficient to deploy the Copy-into,Snowpipe or External table nodes which holds the data from staged files
Option2-Using Re-Sync Columns button
-
Add a InferSchema node, for example
INFER_JSONand hit create.'Infer and Create table' stage runs and creates a table with the same name as InferSchema node. -
Use Re-Sync Columns button in the mapping grid to update the columns of the infer node created in the above step.
-
Now we can add a Copy-Into,Snowpipe or External table node on top of the inferred table to load staged files.
InferSchema Deployment
InferSchema Initial Deployment
When deployed for the first time into an environment the InferSchema node will execute the stage:
- Stage executed: Infer and Create target table