External Data Package
Overview
Node types for data loading from files
Installation
- In Coalesce, open the Workspace where you wish to install the package.
- Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
- Paste the Package ID, and proceed with the installation process.
Description
External Data Package
The Coalesce Databricks External Data Package includes:
CopyInto
CopyInto Node Configuration
The Copy-Into node type the following configurations available:
CopyInto - Node Properties
There are four configs within the Node Properties group.
Property | Description |
---|---|
Storage Location | Storage Location where the Materialized View will be created |
Node Type | Name of template used to create node objects |
Description | A description of the node's purpose |
Deploy Enabled | If TRUE the node will be deployed / redeployed when changes are detected If FALSE the node will not be deployed or will be dropped during redeployment |
Key points to use CopyInto Node
- CopyInto node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
- The data can be reloaded into the table by truncating the data in the table before load using the TruncateBefore option in node config or reload parameter
- The path or subfolder name inside stage where the file is located can be specified using config 'Path or subfolder'.Do not prefix or suffix '/' in path name.Example,one level 'SUBFOLDER',two levels 'SUBFOLDER/INNERFOLDER'.
Use CopyInto node with InferSchema option
- Set Infer Schema toggle to true
- Hit Create button to Infer Schema
- To choose the file format configs,refer link
- Click on Re-Sync Columns button
- If all looks good, set Infer Schema button to false
- Hit Create button to execute create table based on inferred schema
- This is mainly a test to make sure create will work
- Hit Run button to execute DML
If the above works, it should be deployable as is. Deploy will simply take the columns and execute a create table.
CopyInto - General Options
Option | Description |
---|---|
Create As | Select from the options to create as Table -Table |
TruncateBefore | True / False toggle that determines whether or not a table is to be truncated before reloading - True: Table is truncated and Copy-Into statement is executed to reload the data into target table - False: Data is loaded directly into target table and no truncate action takes place. |
InferSchema | True / False toggle that determines whether or not to infer the columns of file before loading - True: The node is created with the inferred columns - False: No infer table step is executed |
Source data
Setting | Description |
---|---|
Path (Ex:gs://bucket/base/path ) | A path can be an internal volume,external volume in databricks or an external location pointing to S3 bucket/gcp container . |
File Name(Ex:a.csv,b.csv) | Specifies a files name to be loaded.Note: It is advised to add either the filename or file pattern file loads |
Path or subfolder | Not mandatory.Specifies the path or subfolders inside the stage where the file is located.Ensure that '/' is not pre-fixed before or after the subfolder name |
File Pattern (Ex:'.hea.[.]csv') | A regular expression pattern string, enclosed in single quotes, specifying the file names or paths to match.Note: It is advised to add either the filename or file pattern file loads |
CopyInto - File Format Options
Setting | Description |
---|---|
File Format Values | Provides file format options for the File Type chosen |
File Type | Each file type has different configurations available.File types supported-CSV,JSON,XML,ORC,AVRO,PARQUET |
GENERIC(ALL FILE TYPES) | ignorecorruptFiles-Whether to ignore corrupt files. ignoremissingfiles-Boolean whether to ignore missing files. |
CSV | Record delimiter-Characters that separate records in an input file Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file Parse Header(InferSchema-true)-Boolean that specifies whether to use the first row headers in the data files to determine column names. Encoding- Specifies the character set of the source data when loading data into a table Date format- String that defines the format of date values in the data files to be loaded. Timestamp format- String that defines the format of timestamp values in the data files to be loaded. Multi-line-Whether the CSV records span multiple lines. |
JSON | Date format- String that defines the format of date values in the data files to be loaded. Timestamp format- String that defines the format of timestamp values in the data files to be loaded Multi-line-Whether the CSV records span multiple lines. Encoding- Specifies the character set of the source data when loading data into a table |
XML | Date format- String that defines the format of date values in the data files to be loaded. |
Timestamp format- String that defines the format of timestamp values in the data files to be loaded | |
Multi-line-Whether the CSV records span multiple lines. rowTag-The row tag of the XML files to treat as a row. In the example XML <books> <book></book>...</books> , the appropriate value is book. This is a required option for xml. | |
rootTag-Root tag of the XML files. For example, in <books> <book><book>...</books> , the appropriate value is books.Encoding- Specifies the character set of the source data when loading data into a table |
CopyInto - Copy Options
Setting | Description |
---|---|
Force | - Boolean, default false. If set to true, idempotency is disabled and files are loaded regardless of whether they've been loaded before. |
Mergeschema | -Boolean, default false. If set to true, the schema can be evolved according to the incoming data. |
CopyInto - System Columns
The set of columns which has source data and file metadata information.
- LOAD_TIMESTAMP - Current timestamp when the file gets loaded.
- FILENAME - Name of the file the current row belongs to.
- FILEPATH - Full path of the file in storage
- FILEBLOCKSTART - Start offset of the file split being read
- FILEBLOCKEND - Length of the file split being read
- FILESIZE - Size of the file in bytes
- FILE_LAST_MODIFIED - Last modified timestamp of the file
CopyInto Deployment
CopyInto Initial Deployment
When deployed for the first time into an environment the Copy-into node of materialization type table will execute the below stage:
Deployment Behavior | Stages Executed |
---|---|
Initial Deployment | Create Table |
CopyInto Redeployment
Altering the CopyInto Tables
There are few column or table changes like Change in table name,Dropping existing column, Alter Column data type,Adding a new column if made in isolation or all-together will result in an ALTER statement to modify the Work Table in the target environment.
The following stages are executed:
- Rename Table| Alter Column | Delete Column | Add Column | Edit table description: Alter table statement is executed to perform the alter operation.
Redeployment with no changes
If the nodes are redeployed with no changes compared to previous deployment,then no stages are executed
CopyInto Undeployment
If the CopyInto node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher-level environment then the target table in the target environment will be dropped.
- Drop table: Target table in Snowflake is dropped
Code
CopyInto
Versions
Available versions of the package.
Version # | Release Date | Notes |
---|---|---|
1.0.0 | September 11, 2025 |
|