Skip to main content

Castor Extractor

Castor Extractor is a set of command-line tools that pull metadata from your data stack and output local JSON and CSV files. You can then upload those files to Coalesce.

Castor Extractor pulls 3 kinds of assets:

  • Warehouse: Databases, schemas, tables, columns, and queries.
  • Visualization: Dashboards, users, and folders.
  • Knowledge: Content from tools such as Confluence and Notion.

How It Works

The extractor works in the following way:

  1. Extract metadata from your source system into local files using system credentials such as Snowflake, Looker, etc.
  2. Review or validate files if needed.
  3. Upload files to Coalesce using your source_id and Catalog token.

Before You Begin

  • Castor Extractor requires Python 3.10, 3.11, 3.12, or 3.13.
  • Your source_id provided by Catalog.
  • Your Catalog token given by Catalog.

Installation

Create an Isolated Environment

We recommend creating an isolated Python environment. The following example uses pyenv:

brew install pyenv pyenv-virtualenv

pyenv install -v 3.11.9
pyenv virtualenv 3.11.9 castor-env
pyenv shell castor-env
python --version # should print 3.11.9

If pyenv shell doesn't work, add the following lines to your shell profile and restart your terminal:

eval "$(pyenv init -)"
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

Install the Package

pip install --upgrade pip
pip install castor-extractor

Most sources need an extra. Install only the ones you use:

pip install castor-extractor[all]
# or only one integration, for example:
pip install castor-extractor[bigquery]
pip install castor-extractor[count]
pip install castor-extractor[databricks]
pip install castor-extractor[looker]
pip install castor-extractor[lookerstudio]
pip install castor-extractor[metabase]
pip install castor-extractor[mysql]
pip install castor-extractor[powerbi]
pip install castor-extractor[qlik]
pip install castor-extractor[postgres]
pip install castor-extractor[redshift]
pip install castor-extractor[snowflake]
pip install castor-extractor[sqlserver]
pip install castor-extractor[strategy]
pip install castor-extractor[tableau]

Create an Output Directory

mkdir -p /tmp/castor

1. Extract Metadata

You can either use an extractor or add the files to the directory.

Using an Extractor Package

Using one of the packages, for example, pip install castor-extractor[snowflake] you can extract the data directly and load it into the output directory created during setup to upload to Catalog.

castor-extract-snowflake \
--account xy12345.eu-west-1 \
--user svc_castor \
--password secret \
--output /tmp/castor

See the full list in Extractor Reference.

Using a Folder

Place your generic CSV files in the output directory.

2. Validate Data (Optional)

This step is optional. Run castor-file-check to validate your generic warehouse CSV files before uploading:

castor-file-check --directory /tmp/castor

All flags:

  • -d, --directory: directory containing generic warehouse CSV files
  • --verbose: show detailed validation logs

3. Upload to Coalesce

You will need:

  • source_id: Provided by Catalog
  • Catalog API token: Provided by Catalog
castor-upload \
--token YOUR_CATALOG_TOKEN \
--source_id YOUR_SOURCE_ID \
--file_type WAREHOUSE \
--directory_path /tmp/castor

All flags:

  • -k, --token: API token from Coalesce. Required
  • -s, --source_id: source id from Coalesce. Required
  • -t, --file_type: file type (WAREHOUSE, VIZ, DBT, QUALITY)
  • -z, --zone: upload zone. Required
    • Use US if your instance is on app.us.castordoc.com.
    • Use EU if your instance is on app.castordoc.com.
  • -f, --file_path: upload one file
  • -d, --directory_path: upload all files in a directory

You can only use --file_path or --directory_path, not both.

Using Environment Variables

Extracting and uploading data can export environment variables.

Snowflake extractor example
export CASTOR_SNOWFLAKE_ACCOUNT="xy12345.eu-west-1"
export CASTOR_SNOWFLAKE_USER="svc_castor"
export CASTOR_SNOWFLAKE_PASSWORD="secret"
export CASTOR_OUTPUT_DIRECTORY="/tmp/castor"

castor-extract-snowflake
Castor uploader example
export CASTOR_UPLOADER_TOKEN="your-token"
export CASTOR_UPLOADER_SOURCE_ID="your-source-id"
export CASTOR_UPLOADER_FILE_TYPE="WAREHOUSE"
export CASTOR_UPLOADER_ZONE="US"
export CASTOR_UPLOADER_DIRECTORY_PATH="/tmp/castor"

castor-upload

Troubleshooting

If you encounter problems to upload your files you can increase the timeout or configure retries. This can be done by setting them as ENV variables.

CASTOR_TIMEOUT_OVERRIDE: number of seconds before timeout (default = 60) CASTOR_RETRY_OVERRIDE: number of retries (default = 1)

Castor Extractor Reference

Global Variables

These variables apply across all commands:

VariablePurpose
CASTOR_OUTPUT_DIRECTORYDefault output directory for all extractors.
GOOGLE_APPLICATION_CREDENTIALSDefault GCP credentials file for BigQuery and Looker Studio.

Zone Selection

  • Use US if your instance is on app.us.castordoc.com.
  • Use EU if your instance is on app.castordoc.com.

Upload and Validate

castor-file-check

Validate generic warehouse CSV files before upload.

  • -d, --directory: directory containing generic warehouse CSV files
  • --verbose: show detailed validation logs

castor-upload

Push extracted files to Coalesce-managed GCS.

  • -k, --token: API token from Coalesce
  • -s, --source_id: source id from Coalesce
  • -t, --file_type: file type (WAREHOUSE, VIZ, DBT, QUALITY)
    • WAREHOUSE extractors
    • VIZ Visualization extractors
    • Knowledge bases (Confluence and Notion) use VIZ
    • QUALITY - Used for external data quality tools along with generic CSV files.
  • -z, --zone: upload zone (US or EU, default EU)
  • -f, --file_path: upload one file
  • -d, --directory_path: upload all files in a directory

You can only use --file_path or --directory_path, not both.

info

Use --help to get the most up to date flags. For example castor-extract-sqlserver --help

Warehouse Extractors

These use upload file type WAREHOUSE

castor-extract-bigquery

FlagDescription
-c, --credentialsPath to Google credentials file.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.
--db-allowed <list>Allowed GCP projects.
--db-blocked <list>Blocked GCP projects.
-s, --safe-modeSafe mode.

Environment variables:

  • GOOGLE_APPLICATION_CREDENTIALS

castor-extract-databricks

FlagDescription
-H, --hostDatabricks host.
-t, --tokenAccess token.
-p, --http-pathHTTP path.
-o, --outputOutput directory.
--catalog-allowed <list>Allowed catalogs.
--catalog-blocked <list>Blocked catalogs.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_DATABRICKS_HOST
  • CASTOR_DATABRICKS_HTTP_PATH
  • CASTOR_DATABRICKS_TOKEN

castor-extract-mysql

FlagDescription
-H, --hostMySQL host.
-P, --portMySQL port.
-u, --userMySQL user.
-p, --passwordMySQL password.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_MYSQL_USER
  • CASTOR_MYSQL_PASSWORD
  • CASTOR_MYSQL_HOST
  • CASTOR_MYSQL_PORT (optional)

castor-extract-postgres

FlagDescription
-H, --hostPostgres host.
-P, --portPostgres port.
-d, --databasePostgres database.
-u, --userPostgres user.
-p, --passwordPostgres password.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_POSTGRES_USER
  • CASTOR_POSTGRES_PASSWORD
  • CASTOR_POSTGRES_HOST
  • CASTOR_POSTGRES_PORT
  • CASTOR_POSTGRES_DATABASE

castor-extract-redshift

FlagDescription
-H, --hostRedshift host.
-P, --portRedshift port.
-d, --databaseRedshift database.
-u, --userRedshift user.
-p, --passwordRedshift password.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.
--serverlessExtract from Redshift Serverless.

Environment variables:

  • CASTOR_REDSHIFT_USER
  • CASTOR_REDSHIFT_PASSWORD
  • CASTOR_REDSHIFT_HOST
  • CASTOR_REDSHIFT_PORT
  • CASTOR_REDSHIFT_DATABASE
  • CASTOR_REDSHIFT_SERVERLESS (optional; true/false)

castor-extract-snowflake

FlagDescription
-a, --accountSnowflake account.
-u, --userSnowflake user.
-p, --passwordPassword. Mutually exclusive with --private-key.
-pk, --private-keyPrivate key. Mutually exclusive with --password.
--warehouseWarehouse override.
--roleRole override.
--db-allowed <list>Allowed databases.
--db-blocked <list>Blocked databases.
--query-blocked <list>Blocked query patterns. Supports % and _ wildcards.
--fetch-transientInclude transient tables.
--insecure-modeDisable OCSP checking.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_SNOWFLAKE_ACCOUNT
  • CASTOR_SNOWFLAKE_USER
  • CASTOR_SNOWFLAKE_PASSWORD (optional if using private key)
  • CASTOR_SNOWFLAKE_PRIVATE_KEY (optional if using password)
  • CASTOR_SNOWFLAKE_INSECURE_MODE (optional)

castor-extract-sqlserver

FlagDescription
-H, --hostMSSQL host.
-P, --portMSSQL port.
-u, --userMSSQL user.
-p, --passwordMSSQL password.
-s, --skip-queriesSkip SQL query extraction.
--db-allowed <list>Allowed databases.
--db-blocked <list>Blocked databases.
--default-dbFallback database for login issues.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_MSSQL_USER
  • CASTOR_MSSQL_PASSWORD
  • CASTOR_MSSQL_HOST
  • CASTOR_MSSQL_PORT
  • CASTOR_MSSQL_DEFAULT_DB (optional)

Visualization Extractors

These use file type VIZ.

castor-extract-count

FlagDescription
-c, --credentialsGCP credentials as string.
-d, --dataset_idData set ID storing Count data.
-o, --outputOutput directory.

castor-extract-domo

FlagDescription
-b, --base-urlDomo host.
-a, --api-tokenAPI token.
-d, --developer-tokenDeveloper token.
-c, --client-idClient ID.
-C, --cloud-idExternal warehouse ID.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_DOMO_API_TOKEN
  • CASTOR_DOMO_BASE_URL
  • CASTOR_DOMO_CLIENT_ID
  • CASTOR_DOMO_DEVELOPER_TOKEN
  • CASTOR_DOMO_CLOUD_ID
  • CLOUD_ID

castor-extract-looker

FlagDescription
-b, --base-urlLooker base URL.
-c, --client-idClient ID.
-s, --client-secretClient secret.
-t, --timeoutTimeout in seconds.
--thread-pool-sizeThread pool size.
-S, --safe-modeSafe mode.
--log-to-stdoutLog to stdout.
--search-per-folderFetch looks and dashboards per folder.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_LOOKER_BASE_URL
  • CASTOR_LOOKER_CLIENT_ID
  • CASTOR_LOOKER_CLIENT_SECRET
  • CASTOR_LOOKER_TIMEOUT_SECOND (optional override)
  • CASTOR_LOOKER_PAGE_SIZE (optional override)
  • CASTOR_LOOKER_THREAD_POOL_SIZE (optional override)
  • CASTOR_LOOKER_IS_SAFE_MODE (optional; true/false)
  • CASTOR_LOOKER_LOG_TO_STDOUT (optional; true/false)
  • CASTOR_LOOKER_SEARCH_PER_FOLDER (optional; true/false)

castor-extract-looker-studio

FlagDescription
-o, --outputOutput directory.
--source-queries-onlyOnly extract BigQuery source queries.
--skip-view-activity-logsSkip activity log extraction.
-c, --credentialsService account credentials file.
-a, --admin-emailGoogle Workspace admin email.
--users-file-pathPath to JSON array of user emails.
-b, --bigquery-credentialsBigQuery service account credentials file.
--db-allowed <list>Allowed GCP projects for source queries.
--db-blocked <list>Blocked GCP projects for source queries.

Environment variables:

CASTOR_LOOKER_BASE_URL="https://mycompany.looker.com"
CASTOR_LOOKER_CLIENT_ID="xxxxx"
CASTOR_LOOKER_CLIENT_SECRET="yyyyy"
CASTOR_OUTPUT_DIRECTORY="/tmp/castor"

castor-extract-metabase-api

FlagDescription
-b, --base-urlMetabase base URL.
-u, --userUsername.
-p, --passwordPassword.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_METABASE_API_BASE_URL
  • CASTOR_METABASE_API_USERNAME
  • CASTOR_METABASE_API_USER
  • CASTOR_METABASE_API_PASSWORD

castor-extract-metabase-db

FlagDescription
-H, --hostHost.
-P, --portPort.
-d, --databaseDatabase.
-s, --schemaSchema.
-u, --userUsername.
-p, --passwordPassword.
-k, --encryption_secret_keyEncryption key.
--require_sslRequire SSL.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_METABASE_DB_HOST
  • CASTOR_METABASE_DB_PORT
  • CASTOR_METABASE_DB_DATABASE
  • CASTOR_METABASE_DB_SCHEMA
  • CASTOR_METABASE_DB_USERNAME
  • CASTOR_METABASE_DB_PASSWORD
  • CASTOR_METABASE_DB_ENCRYPTION_SECRET_KEY (optional)
  • CASTOR_METABASE_DB_REQUIRE_SSL_KEY (optional)

castor-extract-mode

FlagDescription
-H, --hostMode host.
-w, --workspaceWorkspace.
-t, --tokenAPI token.
-s, --secretAPI token password.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_MODE_ANALYTICS_HOST
  • CASTOR_MODE_ANALYTICS_SECRET
  • CASTOR_MODE_ANALYTICS_TOKEN
  • CASTOR_MODE_ANALYTICS_WORKSPACE

castor-extract-powerbi

FlagDescription
-t, --tenant_idTenant ID.
-c, --client_idClient ID.
-s, --secretClient secret. Mutually exclusive with --certificate.
-cert, --certificateCertificate file. Mutually exclusive with --secret.
-sc, --scopes <list>API scopes. Optional.
-l, --login_urlLogin URL. Optional.
-a, --api_basePower BI REST API base. Optional.
-g, --graph_api_baseMicrosoft Graph API base. Optional.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_POWERBI_CLIENT_ID
  • CASTOR_POWERBI_TENANT_ID
  • CASTOR_POWERBI_SECRET (optional if using certificate)
  • CASTOR_POWERBI_CERTIFICATE (optional if using secret)
  • CASTOR_POWERBI_API_BASE (optional)
  • CASTOR_POWERBI_GRAPH_API_BASE (optional)
  • CASTOR_POWERBI_LOGIN_URL (optional)
  • CASTOR_POWERBI_SCOPES (optional)

castor-extract-qlik

FlagDescription
-b, --base-urlQlik base URL.
-a, --api-keyAPI key.
-e, --except-http-error-statuses <list>HTTP status codes to ignore as warnings.
-s, --include-sheetsInclude sheets extraction.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_QLIK_API_KEY
  • CASTOR_QLIK_BASE_URL

castor-extract-salesforce

FlagDescription
-u, --usernameSalesforce username.
-p, --passwordPassword.
-c, --client-idClient ID.
-s, --client-secretClient secret.
-t, --security-tokenSecurity token.
-b, --base-urlInstance URL.
-o, --outputOutput directory.
--skip-existingKeep previously extracted files.

Environment variables:

  • CASTOR_SALESFORCE_BASE_URL
  • CASTOR_SALESFORCE_CLIENT_ID
  • CASTOR_SALESFORCE_CLIENT_SECRET
  • CASTOR_SALESFORCE_PASSWORD
  • CASTOR_SALESFORCE_SECURITY_TOKEN
  • CASTOR_SALESFORCE_USERNAME

castor-extract-salesforce-viz

FlagDescription
-u, --usernameSalesforce username.
-p, --passwordPassword.
-c, --client-idClient ID.
-s, --client-secretClient secret.
-t, --security-tokenSecurity token.
-b, --base-urlInstance URL.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_SALESFORCE_BASE_URL
  • CASTOR_SALESFORCE_CLIENT_ID
  • CASTOR_SALESFORCE_CLIENT_SECRET
  • CASTOR_SALESFORCE_PASSWORD
  • CASTOR_SALESFORCE_SECURITY_TOKEN
  • CASTOR_SALESFORCE_USERNAME

castor-extract-sigma

FlagDescription
-H, --hostSigma host.
-c, --client-idClient ID.
-a, --api-tokenAPI key.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_SIGMA_API_TOKEN
  • CASTOR_SIGMA_CLIENT_ID
  • CASTOR_SIGMA_HOST
  • CASTOR_SIGMA_GRANT_TYPE (optional)

castor-extract-strategy

FlagDescription
-u, --usernameUsername.
-p, --passwordPassword.
-b, --base-urlStrategy URL.
-i, --project-ids <list>Project IDs. Optional.
-o, --outputOutput directory.

Environment variables:

  • CATALOG_STRATEGY_BASE_URL
  • CATALOG_STRATEGY_PASSWORD
  • CATALOG_STRATEGY_USERNAME
  • CATALOG_STRATEGY_PROJECT_IDS (optional; comma-separated supported)

castor-extract-tableau

FlagDescription
-u, --userTableau user.
-n, --token-nameToken name.
-p, --passwordPassword.
-t, --tokenToken.
-b, --server-urlServer URL.
-i, --site-idSite ID.
--skip-columnsSkip column extraction.
--skip-fieldsSkip field extraction.
--with-pulseExtract Pulse assets.
--page-sizeCustom pagination size.
--ignore-sslDisable SSL verification.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_TABLEAU_SERVER_URL
  • CASTOR_TABLEAU_SITE_ID
  • CASTOR_TABLEAU_USER (required for username/password auth)
  • CASTOR_TABLEAU_PASSWORD (required for username/password auth)
  • CASTOR_TABLEAU_TOKEN_NAME (required for PAT auth)
  • CASTOR_TABLEAU_TOKEN (required for PAT auth)

castor-extract-thoughtspot

FlagDescription
-b, --base_urlBase URL.
-u, --usernameUsername.
-p, --passwordPassword.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_THOUGHTSPOT_BASE_URL
  • CASTOR_THOUGHTSPOT_USERNAME
  • CASTOR_THOUGHTSPOT_PASSWORD

castor-extract-confluence

FlagDescription
-a, --account_idConfluence account ID.
-b, --base_urlConfluence base URL.
-t, --tokenAPI token.
-u, --usernameUsername.
--include-archived-spacesInclude archived spaces.
--include-personal-spacesInclude personal spaces.
--space-ids-allowed <list>Only include these space IDs.
--space-ids-blocked <list>Exclude these space IDs.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_CONFLUENCE_ACCOUNT_ID
  • CASTOR_CONFLUENCE_BASE_URL
  • CASTOR_CONFLUENCE_TOKEN
  • CASTOR_CONFLUENCE_USERNAME

castor-extract-notion

FlagDescription
-t, --tokenNotion token.
-o, --outputOutput directory.

Environment variables:

  • CASTOR_NOTION_TOKEN