Skip to main content

Databricks

Extract Databricks metadata into Catalog using the castor-extractor package.

Prerequisites

Installation Required

Follow the castor-extractor installation instructions before running the extraction.

We recommend creating a dedicated service principal to extract your metadata.

Follow the instructions for creating a service principal on Databricks to create the OAuth credentials.

Run Extraction Script

Once the package has been installed, you should be able to run the following command in your terminal:

castor-extract-databricks [arguments]

The script will run and display logs as following:

INFO - Extracting `CATALOG` ...
INFO - Available databases: ['main', 'samples']
INFO - Extracted 8 databases to /tmp/catalog/1780550718-database.csv
INFO - Extracted 203 schemas to /tmp/catalog/1780550718-schema.csv

...

INFO - Extracted 128 view_ddl to /tmp/catalog/1780550718-view_ddl.csv
INFO - Wrote output file: /tmp/catalog/1780550718-summary.json

Credentials

  • -H, --host: Databricks workspace hostname (for example, https://dbc-abc12345-6789.cloud.databricks.com)
  • -p, --http-path: SQL warehouse path (for example, /sql/1.0/warehouses/xxxxx)
  • --client-id: OAuth client ID from your service principal
  • --client-secret: OAuth client secret from your service principal

Other Arguments

  • -o, --output: target folder to store the extracted files

Optional Arguments

  • --catalog-allowed: Catalogs you want to extract
  • --catalog-blocked: Catalogs you do not want to extract
  • --skip-existing: Skip files already extracted instead of replacing them
Help

You can also get help with the --help argument.

Use ENV Variables

If you don't want to specify arguments every time, you can set the following ENV in your .bashrc:

export CASTOR_DATABRICKS_HOST=https://dbc-abc12345-6789.cloud.databricks.com
export CASTOR_DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxxxx
export CASTOR_DATABRICKS_CLIENT_ID=your-client-id
export CASTOR_DATABRICKS_CLIENT_SECRET=your-client-secret

export CASTOR_OUTPUT_DIRECTORY="tmp/catalog_output"

Then the script can be executed without any arguments:

castor-extract-databricks

It can also be executed with partial arguments (the script looks in your ENV as a fallback):

castor-extract-databricks --output /tmp/catalog