Skip to main content

Castor Extractor

The castor-extractor package is a set of command-line tools that connect to your data sources, read metadata only, and write local JSON or CSV files. You upload those files to Catalog when an integration is client managed: Catalog does not connect to your BI tool or warehouse directly. You control credentials, scheduling, and network access on your side.

Developers is the documentation area for engineers automating Catalog; castor-extractor is the installable Python package documented in this section.

When to Use the Extractor

Choose client-managed extraction when Catalog cannot reach your source over the network, or when your security policy requires metadata to leave your environment only as files you send.

Typical cases include:

  • BI tools listed as client managed in Quick Set Up (for example during a trial one-time load, or after trial when you schedule daily pushes to a Catalog GCP bucket).
  • Warehouses or other sources where you run extraction on your own hosts instead of granting Catalog a persistent connection.
  • Custom or generic warehouse CSV flows where you build files yourself and validate them before upload.

If Catalog manages the connection for your technology, configure credentials in Settings > Integrations and Catalog runs extraction. You do not need the extractor for that path.

How Extraction Fits in Catalog Setup

The extractor sits between your systems and Catalog ingestion. Catalog never reads your warehouse or BI tool during client-managed sync; it ingests the files you upload.

In practice, follow this order:

  1. Install the castor-extractor package and the extra for your source (for example [tableau] or [snowflake]). See Installation below.
  2. Extract metadata with the CLI for your technology (for example castor-extract-tableau). Output lands in a directory you choose.
  3. Optionally validate warehouse CSV files with File Checker before upload.
  4. Upload files with castor-upload using your Catalog token and source ID.
  5. Wait for ingestion in Catalog after the upload completes.

Technology-specific credentials, flags, and environment variables are in the BI Tools extraction guides and Warehouse extraction guides.

What Gets Extracted

The package writes structured metadata files to disk. Asset types depend on the source.

Warehouse Assets

Warehouse metadata typically includes:

  • databases
  • schemas
  • tables
  • columns
  • queries

Visualization Assets

Visualization metadata typically includes:

  • dashboards
  • users
  • folders

Knowledge Assets

Knowledge content can come from tools such as Confluence and Notion when you use those extractors.

The package also includes utilities to push metadata into Catalog:

  • File Checker to validate your Warehouse Importer CSV files before upload
  • castor-upload to push extracted files to Google Cloud Storage (GCS) for Catalog ingestion

Before You Begin

  • Castor Extractor requires Python 3.10, 3.11, 3.12, or 3.13.
  • A source ID from Catalog for the integration you are loading. The upload CLI flag is --source_id.
  • A Catalog API token for upload.

Installation

Create an Isolated Environment

We recommend creating an isolated Python environment. The following example uses pyenv:

brew install pyenv pyenv-virtualenv

pyenv install -v 3.11.9
pyenv virtualenv 3.11.9 castor-env
pyenv shell castor-env
python --version # should print 3.11.9

If pyenv shell doesn't work, add the following lines to your shell profile and restart your terminal:

eval "$(pyenv init -)"
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

Install the Package

pip install --upgrade pip
pip install castor-extractor

Most sources need an extra. Install only the ones you use:

pip install castor-extractor[all]
# or only one integration, for example:
pip install castor-extractor[bigquery]
pip install castor-extractor[count]
pip install castor-extractor[databricks]
pip install castor-extractor[glue-athena]
pip install castor-extractor[looker]
pip install castor-extractor[lookerstudio]
pip install castor-extractor[metabase]
pip install castor-extractor[mysql]
pip install castor-extractor[omni]
pip install castor-extractor[powerbi]
pip install castor-extractor[qlik]
pip install castor-extractor[postgres]
pip install castor-extractor[redshift]
pip install castor-extractor[snowflake]
pip install castor-extractor[sqlserver]
pip install castor-extractor[strategy]
pip install castor-extractor[tableau]

Create an Output Directory

mkdir -p /tmp/castor

Example: Extract Tableau Metadata

After you install castor-extractor[tableau], run the Tableau extractor from a terminal. The command writes JSON files under your output directory and prints progress in the log.

castor-extract-tableau \
--username YOUR_TABLEAU_USER \
--password YOUR_TABLEAU_PASSWORD \
--server-url https://YOUR_SITE.online.tableau.com \
--site-id YOUR_SITE_ID \
--output /tmp/castor

Example log output:

INFO - Logging in using user and password authentication
INFO - Signed into https://eu-west-1a.online.tableau.com as user with id ****
INFO - Extracting USER from Tableau API
INFO - Fetching USER
INFO - Querying all users on site
INFO - Wrote output file: /tmp/castor/1649078755-custom_sql_queries.json
INFO - Wrote output file: /tmp/castor/1649078755-summary.json

For PAT sign-in, flags, and environment variables, see Tableau extraction.

1. Extract Metadata

You can either use an extractor or add the files to the directory.

Using an Extractor Package

Using one of the packages, for example, pip install castor-extractor[snowflake] you can extract the data directly and load it into the output directory created during setup to upload to Catalog.

castor-extract-snowflake \
--account xy12345.eu-west-1 \
--user svc_castor \
--password secret \
--output /tmp/castor

See the full list in Castor Extractor Reference.

Using a Folder

Place your generic CSV files in the output directory.

2. Validate Data (Optional)

This step is optional. Run castor-file-check to validate your generic warehouse CSV files before uploading.

castor-file-check --directory /tmp/castor

See Castor Extractor Reference for all castor-file-check flags.

3. Upload to Catalog

You need a source ID and Catalog API token from Catalog.

castor-upload \
--token YOUR_CATALOG_TOKEN \
--source_id YOUR_SOURCE_ID \
--file_type WAREHOUSE \
--zone EU \
--directory_path /tmp/castor

See Castor Extractor Reference for all castor-upload flags and zone selection.

Using Environment Variables

You can set environment variables for extract and upload commands instead of passing every flag on the command line.

Snowflake extractor example
export CASTOR_SNOWFLAKE_ACCOUNT="xy12345.eu-west-1"
export CASTOR_SNOWFLAKE_USER="svc_castor"
export CASTOR_SNOWFLAKE_PASSWORD="secret"
export CASTOR_OUTPUT_DIRECTORY="/tmp/castor"

castor-extract-snowflake
Castor uploader example
export CASTOR_UPLOADER_TOKEN="your-token"
export CASTOR_UPLOADER_SOURCE_ID="your-source-id"
export CASTOR_UPLOADER_FILE_TYPE="WAREHOUSE"
export CASTOR_UPLOADER_ZONE="US"
export CASTOR_UPLOADER_DIRECTORY_PATH="/tmp/castor"

castor-upload

Troubleshooting

If you have problems uploading your files, you can increase the timeout or configure retries with environment variables:

  • CASTOR_TIMEOUT_OVERRIDE: seconds before timeout (default 60)
  • CASTOR_RETRY_OVERRIDE: number of retries (default 1)

What's Next?