BigQuery Set Up
Set up BigQuery with the Catalog. Create a Google Service Account, grant the required roles, and add credentials to the Catalog App.
Which Rights Do You Need To Complete the Onboarding?
In order to create the Google Service account, you need to have at minimum one of the following roles:
- Service Account Admin
- Editor Basic
What Will Catalog Do With This Access?
Catalog's access: The given roles will only allow Catalog to see the data model, the queries, and the users, by running queries on the metadata. It is the minimum roles allowing us to read your metadata while not being able to read your data.
Need Some More Info?
- If you want some further information on service accounts:
- Service account role documentation
- Roles and permission documentation
1. Whitelist Catalog's IP on BigQuery
Needed only if you've set up network policies, meaning your BigQuery instance only accepts connections coming from specific IPs.
To allow Catalog to connect to BigQuery, you will first need to whitelist Catalog's IP address.
Here are our fixed IPs:
- For instances on app.us.castordoc.com:
34.42.92.72 - For instances on app.castordoc.com:
35.246.176.138
BigQuery's documentation on configuring public IP connectivity can be found in the Google Cloud docs.
2. Create Catalog User on BigQuery
2.1 Create Google Service Account for Catalog
Client creates a service account for Catalog from the Google Console. See how to create and manage service accounts.
Make sure to create and download a JSON key for that service account. See how to create service account keys.
2.2 Grant Roles to Google Service Account
Client grants the needed access to this new service account with the following roles. See how to grant and change access.
You can find the documentation on the different roles in the BigQuery access control docs.
Make sure to add roles for all projects you would like to see in the Catalog App.
-
BigQuery Read Session User
The
BigQuery Read Session Userrole allows Catalog to create and use read sessions. It allows read capabilities via SQL yet does not grant by itself any data access. -
BigQuery Metadata Viewer
The
metadataViewerrole allows Catalog to fetch the schemas of your data. -
BigQuery Job User
The
jobUserrole is used by Catalog to parse queries and compute lineage and usage stats. -
BigQuery Resource Viewer
The
resourceVieweris also used by Catalog to parse queries and compute lineage and usage stats.
3. Add New Credentials in Catalog App
You must be a Catalog admin to do it.
You can now enter the newly created credentials in the Catalog App integrations settings.
- Go to "Settings > Integrations"
- Click on "BigQuery Add"
- Add the credentials
.png)
If you are using a separate project to host your service accounts:
Make sure to update the JSON key. You'll have to overwrite the project_id with a project from which the service user can run queries.
Appendix
Queries Ran by the Catalog User
-
Databases
WITH catalogs AS (
SELECT DISTINCT catalog_name
FROM `{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA`
)
SELECT
catalog_name AS database_id,
catalog_name AS database_name
FROM catalogs -
Schemas
SELECT
catalog_name AS database_id,
catalog_name AS database_name,
schema_name,
schema_owner,
creation_time,
last_modified_time,
location,
CONCAT(catalog_name, '.', schema_name) AS schema_id
FROM `{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA` -
Tables
WITH d AS (
SELECT
table_catalog,
table_schema,
table_name,
option_value AS `comment`
FROM
`{project}.region-{region}.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE TRUE
AND option_name = 'description'
AND option_value IS NOT NULL
AND option_value != ''
AND option_value != '""'
),
t AS
(
SELECT
table_catalog,
table_schema,
table_name,
option_value AS tags
FROM
`{project}.region-{region}.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE TRUE
AND option_name = 'labels'
AND option_value IS NOT NULL
AND option_value != ''
AND option_value != '""'
),
dt AS
(
SELECT
catalog_name,
schema_name,
option_value AS tags
FROM
`{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA_OPTIONS`
WHERE TRUE
AND option_name = 'labels'
AND option_value IS NOT NULL
AND option_value != ''
AND option_value != '""'
)
SELECT
i.table_catalog AS database_name,
i.table_catalog AS database_id,
i.table_schema AS `schema_name`,
i.table_name AS table_name,
i.table_type,
i.is_insertable_into,
i.is_typed,
i.creation_time,
d.comment,
CONCAT(
COALESCE(t.tags, ""),
COALESCE(dt.tags, "")
) AS tags,
CONCAT(i.table_catalog, '.', i.table_schema) AS schema_id,
CONCAT(i.table_catalog, '.', i.table_schema, '.', i.table_name) AS table_id
FROM
`{project}.region-{region}.INFORMATION_SCHEMA.TABLES` AS i
LEFT JOIN d ON i.table_catalog = d.table_catalog
AND i.table_schema = d.table_schema
AND i.table_name = d.table_name
LEFT JOIN t ON i.table_catalog = t.table_catalog
AND i.table_schema = t.table_schema
AND i.table_name = t.table_name
LEFT JOIN dt ON i.table_catalog = dt.catalog_name
AND i.table_schema = dt.schema_name -
Columns
WITH field_path AS (
SELECT
table_catalog,
table_schema,
table_name,
column_name,
description,
field_path,
data_type
FROM
`{project}.region-{region}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
)
SELECT
c.table_catalog AS database_id,
c.table_catalog AS database_name,
c.table_schema AS `schema_name`,
c.table_name AS table_name,
f.field_path AS column_name,
c.ordinal_position,
c.is_nullable,
f.data_type,
c.is_generated,
c.generation_expression,
c.is_stored,
c.is_hidden,
c.is_updatable,
c.is_system_defined,
c.is_partitioning_column,
c.clustering_ordinal_position,
f.description AS `comment`,
CONCAT(c.table_catalog, '.', c.table_schema) AS schema_id,
CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name) AS table_id,
CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name, '.', f.field_path) AS column_id
FROM
`{{project}}.region-{{region}}.INFORMATION_SCHEMA.COLUMNS` AS c
LEFT JOIN field_path AS f ON
c.table_catalog = f.table_catalog
AND c.table_schema = f.table_schema
AND c.table_name = f.table_name
AND c.column_name = f.column_name
WHERE TRUE
AND c.column_name != '_PARTITIONTIME' -
Users
SELECT DISTINCT user_email AS user_email
FROM `{project}.region-{region}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE TRUE
AND DATE(creation_time) >= DATE_ADD(CURRENT_DATE, INTERVAL -30 DAY) -
Queries
SELECT
job_id AS query_id,
creation_time,
project_id AS database_name,
user_email AS user_name,
user_email AS user_id,
job_type,
statement_type,
priority,
start_time,
end_time,
query AS query_text,
state,
reservation_id,
total_bytes_processed,
total_bytes_billed,
total_slot_ms,
error_result,
cache_hit,
destination_table,
referenced_tables,
labels,
parent_job_id
FROM `{project}.region-{region}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE TRUE
AND DATE(creation_time) = DATE_ADD(CURRENT_DATE, INTERVAL -1 DAY)
AND EXTRACT(hour FROM creation_time) BETWEEN 0 AND 23
AND job_type = 'QUERY' -
View DDL
SELECT
table_catalog AS database_name,
table_schema AS schema_name,
table_name AS view_name,
'{{"project_id": "' || table_catalog || '", "dataset_id": "' || table_schema || '", "table_id": "' || table_name || '"}}' as destination_table,
view_definition
FROM `{project}.{dataset}.INFORMATION_SCHEMA.VIEWS`
References
Here's a list of all Google Cloud documentation referenced above: