Skip to main content

External Data Package

  • Copy-Into

Overview

Node types for data loading from files

Installation

  • Copy the Package ID  @coalesce/databricks/external-data-package
  • In Coalesce, open the Workspace where you wish to install the package.
  • Go to the Build Setting of the Workspace, tab Packages, and click the Install button on the top right of the page.
  • Paste the Package ID, and proceed with the installation process.

Description

External Data Package

The Coalesce Databricks External Data Package includes:

CopyInto

CopyInto Node Configuration

The Copy-Into node type the following configurations available:

CopyInto - Node Properties

There are four configs within the Node Properties group.

PropertyDescription
Storage LocationStorage Location where the Materialized View will be created
Node TypeName of template used to create node objects
DescriptionA description of the node's purpose
Deploy EnabledIf TRUE the node will be deployed / redeployed when changes are detected
If FALSE the node will not be deployed or will be dropped during redeployment

Key points to use CopyInto Node

  • CopyInto node can be created by just clicking on Create node from browser if we want the data from the file to be loaded into single variant column in target table.
  • The data can be reloaded into the table by truncating the data in the table before load using the TruncateBefore option in node config or reload parameter
  • The path or subfolder name inside stage where the file is located can be specified using config 'Path or subfolder'.Do not prefix or suffix '/' in path name.Example,one level 'SUBFOLDER',two levels 'SUBFOLDER/INNERFOLDER'.

Use CopyInto node with InferSchema option

  • Set Infer Schema toggle to true
  • Hit Create button to Infer Schema
  • To choose the file format configs,refer link
  • Click on Re-Sync Columns button
  • If all looks good, set Infer Schema button to false
  • Hit Create button to execute create table based on inferred schema
  • This is mainly a test to make sure create will work
  • Hit Run button to execute DML

Resync

If the above works, it should be deployable as is. Deploy will simply take the columns and execute a create table.

CopyInto - General Options

Dbx-Copy-Into

OptionDescription
Create AsSelect from the options to create as Table
-Table
TruncateBeforeTrue / False toggle that determines whether or not a table is to be truncated before reloading
- True: Table is truncated and Copy-Into statement is executed to reload the data into target table
- False: Data is loaded directly into target table and no truncate action takes place.
InferSchemaTrue / False toggle that determines whether or not to infer the columns of file before loading
- True: The node is created with the inferred columns
- False: No infer table step is executed
Source data
SettingDescription
Path (Ex:gs://bucket/base/path )A path can be an internal volume,external volume in databricks or an external location pointing to S3 bucket/gcp container .
File Name(Ex:a.csv,b.csv)Specifies a files name to be loaded.Note: It is advised to add either the filename or file pattern file loads
Path or subfolderNot mandatory.Specifies the path or subfolders inside the stage where the file is located.Ensure that '/' is not pre-fixed before or after the subfolder name
File Pattern (Ex:'.hea.[.]csv')A regular expression pattern string, enclosed in single quotes, specifying the file names or paths to match.Note: It is advised to add either the filename or file pattern file loads

CopyInto - File Format Options

SettingDescription
File Format ValuesProvides file format options for the File Type chosen
File TypeEach file type has different configurations available.File types supported-CSV,JSON,XML,ORC,AVRO,PARQUET
GENERIC(ALL FILE TYPES)
ignorecorruptFiles-Whether to ignore corrupt files.
ignoremissingfiles-Boolean whether to ignore missing files.
CSVRecord delimiter-Characters that separate records in an input file
Field delimiter- One or more singlebyte or multibyte characters that separate fields in an input file
Parse Header(InferSchema-true)-Boolean that specifies whether to use the first row headers in the data files to determine column names.
Encoding- Specifies the character set of the source data when loading data into a table
Date format- String that defines the format of date values in the data files to be loaded.
Timestamp format- String that defines the format of timestamp values in the data files to be loaded.
Multi-line-Whether the CSV records span multiple lines.
JSONDate format- String that defines the format of date values in the data files to be loaded.
Timestamp format- String that defines the format of timestamp values in the data files to be loaded
Multi-line-Whether the CSV records span multiple lines.
Encoding- Specifies the character set of the source data when loading data into a table
XMLDate format- String that defines the format of date values in the data files to be loaded.

Timestamp format- String that defines the format of timestamp values in the data files to be loaded

Multi-line-Whether the CSV records span multiple lines.
rowTag-The row tag of the XML files to treat as a row. In the example XML <books> <book></book>...</books>, the appropriate value is book. This is a required option for xml.

rootTag-Root tag of the XML files. For example, in <books> <book><book>...</books>, the appropriate value is books.
Encoding- Specifies the character set of the source data when loading data into a table

CopyInto - Copy Options

SettingDescription
Force- Boolean, default false. If set to true, idempotency is disabled and files are loaded regardless of whether they've been loaded before.
Mergeschema-Boolean, default false. If set to true, the schema can be evolved according to the incoming data.

CopyInto - System Columns

The set of columns which has source data and file metadata information.

  • LOAD_TIMESTAMP - Current timestamp when the file gets loaded.
  • FILENAME - Name of the file the current row belongs to.
  • FILEPATH - Full path of the file in storage
  • FILEBLOCKSTART - Start offset of the file split being read
  • FILEBLOCKEND - Length of the file split being read
  • FILESIZE - Size of the file in bytes
  • FILE_LAST_MODIFIED - Last modified timestamp of the file

CopyInto Deployment

CopyInto Initial Deployment

When deployed for the first time into an environment the Copy-into node of materialization type table will execute the below stage:

Deployment BehaviorStages Executed
Initial DeploymentCreate Table

CopyInto Redeployment

Altering the CopyInto Tables

There are few column or table changes like Change in table name,Dropping existing column, Alter Column data type,Adding a new column if made in isolation or all-together will result in an ALTER statement to modify the Work Table in the target environment.

The following stages are executed:

  • Rename Table| Alter Column | Delete Column | Add Column | Edit table description: Alter table statement is executed to perform the alter operation.

Redeployment with no changes

If the nodes are redeployed with no changes compared to previous deployment,then no stages are executed

CopyInto Undeployment

If the CopyInto node is deleted from a Workspace, that Workspace is committed to Git and that commit deployed to a higher-level environment then the target table in the target environment will be dropped.

  • Drop table: Target table in Snowflake is dropped

Code

CopyInto

Versions

Available versions of the package.

Version #Release DateNotes
1.0.0September 11, 2025
 Copy-Into node type to load data from Databricks