Castor Extractor
Castor Extractor is a set of command-line tools that pull metadata from your data stack and output local JSON and CSV files. You can then upload those files to Coalesce.
Castor Extractor pulls 3 kinds of assets:
- Warehouse: Databases, schemas, tables, columns, and queries.
- Visualization: Dashboards, users, and folders.
- Knowledge: Content from tools such as Confluence and Notion.
How It Works
The extractor works in the following way:
- Extract metadata from your source system into local files using system credentials such as Snowflake, Looker, etc.
- Review or validate files if needed.
- Upload files to Coalesce using your
source_idand Catalog token.
Before You Begin
- Castor Extractor requires Python 3.10, 3.11, 3.12, or 3.13.
- Your
source_idprovided by Catalog. - Your Catalog token given by Catalog.
Installation
Create an Isolated Environment
We recommend creating an isolated Python environment. The following example uses pyenv:
brew install pyenv pyenv-virtualenv
pyenv install -v 3.11.9
pyenv virtualenv 3.11.9 castor-env
pyenv shell castor-env
python --version # should print 3.11.9
If pyenv shell doesn't work, add the following lines to your shell profile and restart your terminal:
eval "$(pyenv init -)"
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"
Install the Package
pip install --upgrade pip
pip install castor-extractor
Most sources need an extra. Install only the ones you use:
pip install castor-extractor[all]
# or only one integration, for example:
pip install castor-extractor[bigquery]
pip install castor-extractor[count]
pip install castor-extractor[databricks]
pip install castor-extractor[looker]
pip install castor-extractor[lookerstudio]
pip install castor-extractor[metabase]
pip install castor-extractor[mysql]
pip install castor-extractor[powerbi]
pip install castor-extractor[qlik]
pip install castor-extractor[postgres]
pip install castor-extractor[redshift]
pip install castor-extractor[snowflake]
pip install castor-extractor[sqlserver]
pip install castor-extractor[strategy]
pip install castor-extractor[tableau]
Create an Output Directory
mkdir -p /tmp/castor
1. Extract Metadata
You can either use an extractor or add the files to the directory.
Using an Extractor Package
Using one of the packages, for example, pip install castor-extractor[snowflake] you can extract the data directly and load it into the output directory created during setup to upload to Catalog.
castor-extract-snowflake \
--account xy12345.eu-west-1 \
--user svc_castor \
--password secret \
--output /tmp/castor
See the full list in Extractor Reference.
Using a Folder
Place your generic CSV files in the output directory.
2. Validate Data (Optional)
This step is optional. Run castor-file-check to validate your generic warehouse CSV files before uploading:
castor-file-check --directory /tmp/castor
All flags:
-d, --directory: directory containing generic warehouse CSV files--verbose: show detailed validation logs
3. Upload to Coalesce
You will need:
source_id: Provided by Catalog- Catalog API token: Provided by Catalog
castor-upload \
--token YOUR_CATALOG_TOKEN \
--source_id YOUR_SOURCE_ID \
--file_type WAREHOUSE \
--directory_path /tmp/castor
All flags:
-k, --token: API token from Coalesce. Required-s, --source_id: source id from Coalesce. Required-t, --file_type: file type (WAREHOUSE,VIZ,DBT,QUALITY)-z, --zone: upload zone. Required- Use
USif your instance is onapp.us.castordoc.com. - Use
EUif your instance is onapp.castordoc.com.
- Use
-f, --file_path: upload one file-d, --directory_path: upload all files in a directory
You can only use --file_path or --directory_path, not both.
Using Environment Variables
Extracting and uploading data can export environment variables.
export CASTOR_SNOWFLAKE_ACCOUNT="xy12345.eu-west-1"
export CASTOR_SNOWFLAKE_USER="svc_castor"
export CASTOR_SNOWFLAKE_PASSWORD="secret"
export CASTOR_OUTPUT_DIRECTORY="/tmp/castor"
castor-extract-snowflake
export CASTOR_UPLOADER_TOKEN="your-token"
export CASTOR_UPLOADER_SOURCE_ID="your-source-id"
export CASTOR_UPLOADER_FILE_TYPE="WAREHOUSE"
export CASTOR_UPLOADER_ZONE="US"
export CASTOR_UPLOADER_DIRECTORY_PATH="/tmp/castor"
castor-upload
Troubleshooting
If you encounter problems to upload your files you can increase the timeout or configure retries. This can be done by setting them as ENV variables.
CASTOR_TIMEOUT_OVERRIDE: number of seconds before timeout (default = 60)
CASTOR_RETRY_OVERRIDE: number of retries (default = 1)
Castor Extractor Reference
Global Variables
These variables apply across all commands:
| Variable | Purpose |
|---|---|
CASTOR_OUTPUT_DIRECTORY | Default output directory for all extractors. |
GOOGLE_APPLICATION_CREDENTIALS | Default GCP credentials file for BigQuery and Looker Studio. |
Zone Selection
- Use
USif your instance is onapp.us.castordoc.com. - Use
EUif your instance is onapp.castordoc.com.
Upload and Validate
castor-file-check
Validate generic warehouse CSV files before upload.
-d, --directory: directory containing generic warehouse CSV files--verbose: show detailed validation logs
castor-upload
Push extracted files to Coalesce-managed GCS.
-k, --token: API token from Coalesce-s, --source_id: source id from Coalesce-t, --file_type: file type (WAREHOUSE,VIZ,DBT,QUALITY)WAREHOUSEextractorsVIZVisualization extractors- Knowledge bases (Confluence and Notion) use
VIZ QUALITY- Used for external data quality tools along with generic CSV files.
-z, --zone: upload zone (USorEU, defaultEU)-f, --file_path: upload one file-d, --directory_path: upload all files in a directory
You can only use --file_path or --directory_path, not both.
Use --help to get the most up to date flags. For example castor-extract-sqlserver --help
Warehouse Extractors
These use upload file type WAREHOUSE
castor-extract-bigquery
| Flag | Description |
|---|---|
-c, --credentials | Path to Google credentials file. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
--db-allowed <list> | Allowed GCP projects. |
--db-blocked <list> | Blocked GCP projects. |
-s, --safe-mode | Safe mode. |
Environment variables:
GOOGLE_APPLICATION_CREDENTIALS
castor-extract-databricks
| Flag | Description |
|---|---|
-H, --host | Databricks host. |
-t, --token | Access token. |
-p, --http-path | HTTP path. |
-o, --output | Output directory. |
--catalog-allowed <list> | Allowed catalogs. |
--catalog-blocked <list> | Blocked catalogs. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_DATABRICKS_HOSTCASTOR_DATABRICKS_HTTP_PATHCASTOR_DATABRICKS_TOKEN
castor-extract-mysql
| Flag | Description |
|---|---|
-H, --host | MySQL host. |
-P, --port | MySQL port. |
-u, --user | MySQL user. |
-p, --password | MySQL password. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_MYSQL_USERCASTOR_MYSQL_PASSWORDCASTOR_MYSQL_HOSTCASTOR_MYSQL_PORT(optional)
castor-extract-postgres
| Flag | Description |
|---|---|
-H, --host | Postgres host. |
-P, --port | Postgres port. |
-d, --database | Postgres database. |
-u, --user | Postgres user. |
-p, --password | Postgres password. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_POSTGRES_USERCASTOR_POSTGRES_PASSWORDCASTOR_POSTGRES_HOSTCASTOR_POSTGRES_PORTCASTOR_POSTGRES_DATABASE
castor-extract-redshift
| Flag | Description |
|---|---|
-H, --host | Redshift host. |
-P, --port | Redshift port. |
-d, --database | Redshift database. |
-u, --user | Redshift user. |
-p, --password | Redshift password. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
--serverless | Extract from Redshift Serverless. |
Environment variables:
CASTOR_REDSHIFT_USERCASTOR_REDSHIFT_PASSWORDCASTOR_REDSHIFT_HOSTCASTOR_REDSHIFT_PORTCASTOR_REDSHIFT_DATABASECASTOR_REDSHIFT_SERVERLESS(optional;true/false)
castor-extract-snowflake
| Flag | Description |
|---|---|
-a, --account | Snowflake account. |
-u, --user | Snowflake user. |
-p, --password | Password. Mutually exclusive with --private-key. |
-pk, --private-key | Private key. Mutually exclusive with --password. |
--warehouse | Warehouse override. |
--role | Role override. |
--db-allowed <list> | Allowed databases. |
--db-blocked <list> | Blocked databases. |
--query-blocked <list> | Blocked query patterns. Supports % and _ wildcards. |
--fetch-transient | Include transient tables. |
--insecure-mode | Disable OCSP checking. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_SNOWFLAKE_ACCOUNTCASTOR_SNOWFLAKE_USERCASTOR_SNOWFLAKE_PASSWORD(optional if using private key)CASTOR_SNOWFLAKE_PRIVATE_KEY(optional if using password)CASTOR_SNOWFLAKE_INSECURE_MODE(optional)
castor-extract-sqlserver
| Flag | Description |
|---|---|
-H, --host | MSSQL host. |
-P, --port | MSSQL port. |
-u, --user | MSSQL user. |
-p, --password | MSSQL password. |
-s, --skip-queries | Skip SQL query extraction. |
--db-allowed <list> | Allowed databases. |
--db-blocked <list> | Blocked databases. |
--default-db | Fallback database for login issues. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_MSSQL_USERCASTOR_MSSQL_PASSWORDCASTOR_MSSQL_HOSTCASTOR_MSSQL_PORTCASTOR_MSSQL_DEFAULT_DB(optional)
Visualization Extractors
These use file type VIZ.
castor-extract-count
| Flag | Description |
|---|---|
-c, --credentials | GCP credentials as string. |
-d, --dataset_id | Data set ID storing Count data. |
-o, --output | Output directory. |
castor-extract-domo
| Flag | Description |
|---|---|
-b, --base-url | Domo host. |
-a, --api-token | API token. |
-d, --developer-token | Developer token. |
-c, --client-id | Client ID. |
-C, --cloud-id | External warehouse ID. |
-o, --output | Output directory. |
Environment variables:
CASTOR_DOMO_API_TOKENCASTOR_DOMO_BASE_URLCASTOR_DOMO_CLIENT_IDCASTOR_DOMO_DEVELOPER_TOKENCASTOR_DOMO_CLOUD_IDCLOUD_ID
castor-extract-looker
| Flag | Description |
|---|---|
-b, --base-url | Looker base URL. |
-c, --client-id | Client ID. |
-s, --client-secret | Client secret. |
-t, --timeout | Timeout in seconds. |
--thread-pool-size | Thread pool size. |
-S, --safe-mode | Safe mode. |
--log-to-stdout | Log to stdout. |
--search-per-folder | Fetch looks and dashboards per folder. |
-o, --output | Output directory. |
Environment variables:
CASTOR_LOOKER_BASE_URLCASTOR_LOOKER_CLIENT_IDCASTOR_LOOKER_CLIENT_SECRETCASTOR_LOOKER_TIMEOUT_SECOND(optional override)CASTOR_LOOKER_PAGE_SIZE(optional override)CASTOR_LOOKER_THREAD_POOL_SIZE(optional override)CASTOR_LOOKER_IS_SAFE_MODE(optional;true/false)CASTOR_LOOKER_LOG_TO_STDOUT(optional;true/false)CASTOR_LOOKER_SEARCH_PER_FOLDER(optional;true/false)
castor-extract-looker-studio
| Flag | Description |
|---|---|
-o, --output | Output directory. |
--source-queries-only | Only extract BigQuery source queries. |
--skip-view-activity-logs | Skip activity log extraction. |
-c, --credentials | Service account credentials file. |
-a, --admin-email | Google Workspace admin email. |
--users-file-path | Path to JSON array of user emails. |
-b, --bigquery-credentials | BigQuery service account credentials file. |
--db-allowed <list> | Allowed GCP projects for source queries. |
--db-blocked <list> | Blocked GCP projects for source queries. |
Environment variables:
CASTOR_LOOKER_BASE_URL="https://mycompany.looker.com"
CASTOR_LOOKER_CLIENT_ID="xxxxx"
CASTOR_LOOKER_CLIENT_SECRET="yyyyy"
CASTOR_OUTPUT_DIRECTORY="/tmp/castor"
castor-extract-metabase-api
| Flag | Description |
|---|---|
-b, --base-url | Metabase base URL. |
-u, --user | Username. |
-p, --password | Password. |
-o, --output | Output directory. |
Environment variables:
CASTOR_METABASE_API_BASE_URLCASTOR_METABASE_API_USERNAMECASTOR_METABASE_API_USERCASTOR_METABASE_API_PASSWORD
castor-extract-metabase-db
| Flag | Description |
|---|---|
-H, --host | Host. |
-P, --port | Port. |
-d, --database | Database. |
-s, --schema | Schema. |
-u, --user | Username. |
-p, --password | Password. |
-k, --encryption_secret_key | Encryption key. |
--require_ssl | Require SSL. |
-o, --output | Output directory. |
Environment variables:
CASTOR_METABASE_DB_HOSTCASTOR_METABASE_DB_PORTCASTOR_METABASE_DB_DATABASECASTOR_METABASE_DB_SCHEMACASTOR_METABASE_DB_USERNAMECASTOR_METABASE_DB_PASSWORDCASTOR_METABASE_DB_ENCRYPTION_SECRET_KEY(optional)CASTOR_METABASE_DB_REQUIRE_SSL_KEY(optional)
castor-extract-mode
| Flag | Description |
|---|---|
-H, --host | Mode host. |
-w, --workspace | Workspace. |
-t, --token | API token. |
-s, --secret | API token password. |
-o, --output | Output directory. |
Environment variables:
CASTOR_MODE_ANALYTICS_HOSTCASTOR_MODE_ANALYTICS_SECRETCASTOR_MODE_ANALYTICS_TOKENCASTOR_MODE_ANALYTICS_WORKSPACE
castor-extract-powerbi
| Flag | Description |
|---|---|
-t, --tenant_id | Tenant ID. |
-c, --client_id | Client ID. |
-s, --secret | Client secret. Mutually exclusive with --certificate. |
-cert, --certificate | Certificate file. Mutually exclusive with --secret. |
-sc, --scopes <list> | API scopes. Optional. |
-l, --login_url | Login URL. Optional. |
-a, --api_base | Power BI REST API base. Optional. |
-g, --graph_api_base | Microsoft Graph API base. Optional. |
-o, --output | Output directory. |
Environment variables:
CASTOR_POWERBI_CLIENT_IDCASTOR_POWERBI_TENANT_IDCASTOR_POWERBI_SECRET(optional if using certificate)CASTOR_POWERBI_CERTIFICATE(optional if using secret)CASTOR_POWERBI_API_BASE(optional)CASTOR_POWERBI_GRAPH_API_BASE(optional)CASTOR_POWERBI_LOGIN_URL(optional)CASTOR_POWERBI_SCOPES(optional)
castor-extract-qlik
| Flag | Description |
|---|---|
-b, --base-url | Qlik base URL. |
-a, --api-key | API key. |
-e, --except-http-error-statuses <list> | HTTP status codes to ignore as warnings. |
-s, --include-sheets | Include sheets extraction. |
-o, --output | Output directory. |
Environment variables:
CASTOR_QLIK_API_KEYCASTOR_QLIK_BASE_URL
castor-extract-salesforce
| Flag | Description |
|---|---|
-u, --username | Salesforce username. |
-p, --password | Password. |
-c, --client-id | Client ID. |
-s, --client-secret | Client secret. |
-t, --security-token | Security token. |
-b, --base-url | Instance URL. |
-o, --output | Output directory. |
--skip-existing | Keep previously extracted files. |
Environment variables:
CASTOR_SALESFORCE_BASE_URLCASTOR_SALESFORCE_CLIENT_IDCASTOR_SALESFORCE_CLIENT_SECRETCASTOR_SALESFORCE_PASSWORDCASTOR_SALESFORCE_SECURITY_TOKENCASTOR_SALESFORCE_USERNAME
castor-extract-salesforce-viz
| Flag | Description |
|---|---|
-u, --username | Salesforce username. |
-p, --password | Password. |
-c, --client-id | Client ID. |
-s, --client-secret | Client secret. |
-t, --security-token | Security token. |
-b, --base-url | Instance URL. |
-o, --output | Output directory. |
Environment variables:
CASTOR_SALESFORCE_BASE_URLCASTOR_SALESFORCE_CLIENT_IDCASTOR_SALESFORCE_CLIENT_SECRETCASTOR_SALESFORCE_PASSWORDCASTOR_SALESFORCE_SECURITY_TOKENCASTOR_SALESFORCE_USERNAME
castor-extract-sigma
| Flag | Description |
|---|---|
-H, --host | Sigma host. |
-c, --client-id | Client ID. |
-a, --api-token | API key. |
-o, --output | Output directory. |
Environment variables:
CASTOR_SIGMA_API_TOKENCASTOR_SIGMA_CLIENT_IDCASTOR_SIGMA_HOSTCASTOR_SIGMA_GRANT_TYPE(optional)
castor-extract-strategy
| Flag | Description |
|---|---|
-u, --username | Username. |
-p, --password | Password. |
-b, --base-url | Strategy URL. |
-i, --project-ids <list> | Project IDs. Optional. |
-o, --output | Output directory. |
Environment variables:
CATALOG_STRATEGY_BASE_URLCATALOG_STRATEGY_PASSWORDCATALOG_STRATEGY_USERNAMECATALOG_STRATEGY_PROJECT_IDS(optional; comma-separated supported)
castor-extract-tableau
| Flag | Description |
|---|---|
-u, --user | Tableau user. |
-n, --token-name | Token name. |
-p, --password | Password. |
-t, --token | Token. |
-b, --server-url | Server URL. |
-i, --site-id | Site ID. |
--skip-columns | Skip column extraction. |
--skip-fields | Skip field extraction. |
--with-pulse | Extract Pulse assets. |
--page-size | Custom pagination size. |
--ignore-ssl | Disable SSL verification. |
-o, --output | Output directory. |
Environment variables:
CASTOR_TABLEAU_SERVER_URLCASTOR_TABLEAU_SITE_IDCASTOR_TABLEAU_USER(required for username/password auth)CASTOR_TABLEAU_PASSWORD(required for username/password auth)CASTOR_TABLEAU_TOKEN_NAME(required for PAT auth)CASTOR_TABLEAU_TOKEN(required for PAT auth)
castor-extract-thoughtspot
| Flag | Description |
|---|---|
-b, --base_url | Base URL. |
-u, --username | Username. |
-p, --password | Password. |
-o, --output | Output directory. |
Environment variables:
CASTOR_THOUGHTSPOT_BASE_URLCASTOR_THOUGHTSPOT_USERNAMECASTOR_THOUGHTSPOT_PASSWORD
castor-extract-confluence
| Flag | Description |
|---|---|
-a, --account_id | Confluence account ID. |
-b, --base_url | Confluence base URL. |
-t, --token | API token. |
-u, --username | Username. |
--include-archived-spaces | Include archived spaces. |
--include-personal-spaces | Include personal spaces. |
--space-ids-allowed <list> | Only include these space IDs. |
--space-ids-blocked <list> | Exclude these space IDs. |
-o, --output | Output directory. |
Environment variables:
CASTOR_CONFLUENCE_ACCOUNT_IDCASTOR_CONFLUENCE_BASE_URLCASTOR_CONFLUENCE_TOKENCASTOR_CONFLUENCE_USERNAME
castor-extract-notion
| Flag | Description |
|---|---|
-t, --token | Notion token. |
-o, --output | Output directory. |
Environment variables:
CASTOR_NOTION_TOKEN