Databricks
Extract Databricks metadata into Catalog using the castor-extractor package.
Prerequisites
Follow the castor-extractor installation instructions before running the extraction.
We recommend creating a dedicated service principal to extract your metadata.
Follow the instructions for creating a service principal on Databricks to create the OAuth credentials.
Run Extraction Script
Once the package has been installed, you should be able to run the following command in your terminal:
castor-extract-databricks [arguments]
The script will run and display logs as following:
INFO - Extracting `CATALOG` ...
INFO - Available databases: ['main', 'samples']
INFO - Extracted 8 databases to /tmp/catalog/1780550718-database.csv
INFO - Extracted 203 schemas to /tmp/catalog/1780550718-schema.csv
...
INFO - Extracted 128 view_ddl to /tmp/catalog/1780550718-view_ddl.csv
INFO - Wrote output file: /tmp/catalog/1780550718-summary.json
Credentials
-H,--host: Databricks workspace hostname (for example,https://dbc-abc12345-6789.cloud.databricks.com)-p,--http-path: SQL warehouse path (for example,/sql/1.0/warehouses/xxxxx)--client-id: OAuth client ID from your service principal--client-secret: OAuth client secret from your service principal
Other Arguments
-o,--output: target folder to store the extracted files
Optional Arguments
--catalog-allowed: Catalogs you want to extract--catalog-blocked: Catalogs you do not want to extract--skip-existing: Skip files already extracted instead of replacing them
You can also get help with the --help argument.
Use ENV Variables
If you don't want to specify arguments every time, you can set the following ENV in your .bashrc:
export CASTOR_DATABRICKS_HOST=https://dbc-abc12345-6789.cloud.databricks.com
export CASTOR_DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxxxx
export CASTOR_DATABRICKS_CLIENT_ID=your-client-id
export CASTOR_DATABRICKS_CLIENT_SECRET=your-client-secret
export CASTOR_OUTPUT_DIRECTORY="tmp/catalog_output"
Then the script can be executed without any arguments:
castor-extract-databricks
It can also be executed with partial arguments (the script looks in your ENV as a fallback):
castor-extract-databricks --output /tmp/catalog