Castor Extractor
The castor-extractor package is a set of command-line tools that connect to your data sources, read metadata only, and write local JSON or CSV files. You upload those files to Catalog when an integration is client managed: Catalog does not connect to your BI tool or warehouse directly. You control credentials, scheduling, and network access on your side.
Developers is the documentation area for engineers automating Catalog; castor-extractor is the installable Python package documented in this section.
When to Use the Extractor
Choose client-managed extraction when Catalog cannot reach your source over the network, or when your security policy requires metadata to leave your environment only as files you send.
Typical cases include:
- BI tools listed as client managed in Quick Set Up (for example during a trial one-time load, or after trial when you schedule daily pushes to a Catalog GCP bucket).
- Warehouses or other sources where you run extraction on your own hosts instead of granting Catalog a persistent connection.
- Custom or generic warehouse CSV flows where you build files yourself and validate them before upload.
If Catalog manages the connection for your technology, configure credentials in Settings > Integrations and Catalog runs extraction. You do not need the extractor for that path.
How Extraction Fits in Catalog Setup
The extractor sits between your systems and Catalog ingestion. Catalog never reads your warehouse or BI tool during client-managed sync; it ingests the files you upload.
In practice, follow this order:
- Install the
castor-extractorpackage and the extra for your source (for example[tableau]or[snowflake]). See Installation below. - Extract metadata with the CLI for your technology (for example
castor-extract-tableau). Output lands in a directory you choose. - Optionally validate warehouse CSV files with File Checker before upload.
- Upload files with
castor-uploadusing your Catalog token and source ID. - Wait for ingestion in Catalog after the upload completes.
Technology-specific credentials, flags, and environment variables are in the BI Tools extraction guides and Warehouse extraction guides.
What Gets Extracted
The package writes structured metadata files to disk. Asset types depend on the source.
Warehouse Assets
Warehouse metadata typically includes:
databasesschemastablescolumnsqueries
Visualization Assets
Visualization metadata typically includes:
dashboardsusersfolders
Knowledge Assets
Knowledge content can come from tools such as Confluence and Notion when you use those extractors.
The package also includes utilities to push metadata into Catalog:
- File Checker to validate your Warehouse Importer CSV files before upload
castor-uploadto push extracted files to Google Cloud Storage (GCS) for Catalog ingestion
Before You Begin
- Castor Extractor requires Python 3.10, 3.11, 3.12, or 3.13.
- A source ID from Catalog for the integration you are loading. The upload CLI flag is
--source_id. - A Catalog API token for upload.
Installation
Create an Isolated Environment
We recommend creating an isolated Python environment. The following example uses pyenv:
brew install pyenv pyenv-virtualenv
pyenv install -v 3.11.9
pyenv virtualenv 3.11.9 castor-env
pyenv shell castor-env
python --version # should print 3.11.9
If pyenv shell doesn't work, add the following lines to your shell profile and restart your terminal:
eval "$(pyenv init -)"
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
Install the Package
pip install --upgrade pip
pip install castor-extractor
Most sources need an extra. Install only the ones you use:
pip install castor-extractor[all]
# or only one integration, for example:
pip install castor-extractor[bigquery]
pip install castor-extractor[count]
pip install castor-extractor[databricks]
pip install castor-extractor[glue-athena]
pip install castor-extractor[looker]
pip install castor-extractor[lookerstudio]
pip install castor-extractor[metabase]
pip install castor-extractor[mysql]
pip install castor-extractor[omni]
pip install castor-extractor[powerbi]
pip install castor-extractor[qlik]
pip install castor-extractor[postgres]
pip install castor-extractor[redshift]
pip install castor-extractor[snowflake]
pip install castor-extractor[sqlserver]
pip install castor-extractor[strategy]
pip install castor-extractor[tableau]
Create an Output Directory
mkdir -p /tmp/castor
Example: Extract Tableau Metadata
After you install castor-extractor[tableau], run the Tableau extractor from a terminal. The command writes JSON files under your output directory and prints progress in the log.
castor-extract-tableau \
--username YOUR_TABLEAU_USER \
--password YOUR_TABLEAU_PASSWORD \
--server-url https://YOUR_SITE.online.tableau.com \
--site-id YOUR_SITE_ID \
--output /tmp/castor
Example log output:
INFO - Logging in using user and password authentication
INFO - Signed into https://eu-west-1a.online.tableau.com as user with id ****
INFO - Extracting USER from Tableau API
INFO - Fetching USER
INFO - Querying all users on site
INFO - Wrote output file: /tmp/castor/1649078755-custom_sql_queries.json
INFO - Wrote output file: /tmp/castor/1649078755-summary.json
For PAT sign-in, flags, and environment variables, see Tableau extraction.
1. Extract Metadata
You can either use an extractor or add the files to the directory.
Using an Extractor Package
Using one of the packages, for example, pip install castor-extractor[snowflake] you can extract the data directly and load it into the output directory created during setup to upload to Catalog.
castor-extract-snowflake \
--account xy12345.eu-west-1 \
--user svc_castor \
--password secret \
--output /tmp/castor
See the full list in Castor Extractor Reference.
Using a Folder
Place your generic CSV files in the output directory.
2. Validate Data (Optional)
This step is optional. Run castor-file-check to validate your generic warehouse CSV files before uploading.
castor-file-check --directory /tmp/castor
See Castor Extractor Reference for all castor-file-check flags.
3. Upload to Catalog
You need a source ID and Catalog API token from Catalog.
castor-upload \
--token YOUR_CATALOG_TOKEN \
--source_id YOUR_SOURCE_ID \
--file_type WAREHOUSE \
--zone EU \
--directory_path /tmp/castor
See Castor Extractor Reference for all castor-upload flags and zone selection.
Using Environment Variables
You can set environment variables for extract and upload commands instead of passing every flag on the command line.
export CASTOR_SNOWFLAKE_ACCOUNT="xy12345.eu-west-1"
export CASTOR_SNOWFLAKE_USER="svc_castor"
export CASTOR_SNOWFLAKE_PASSWORD="secret"
export CASTOR_OUTPUT_DIRECTORY="/tmp/castor"
castor-extract-snowflake
export CASTOR_UPLOADER_TOKEN="your-token"
export CASTOR_UPLOADER_SOURCE_ID="your-source-id"
export CASTOR_UPLOADER_FILE_TYPE="WAREHOUSE"
export CASTOR_UPLOADER_ZONE="US"
export CASTOR_UPLOADER_DIRECTORY_PATH="/tmp/castor"
castor-upload
Troubleshooting
If you have problems uploading your files, you can increase the timeout or configure retries with environment variables:
CASTOR_TIMEOUT_OVERRIDE: seconds before timeout (default 60)CASTOR_RETRY_OVERRIDE: number of retries (default 1)
What's Next?
- BI Tools extraction guides for Looker, Tableau, Metabase, and other visualization sources
- Warehouse extraction guides for Snowflake, Postgres, BigQuery, and other warehouses
- File Checker for validating generic warehouse CSV files
- Quick Set Up for Catalog managed versus client-managed onboarding
- Warehouse Importer when you supply generic warehouse CSV files