Skip to main content

Databricks Quickstart

    Before You Begin

    Need Databricks Access?

    Reach out to our sales team.

    Add a Data Source

    In your Workspace, you are on the Build Interface. This is where you'll build your data pipelines.

    1. Make sure you are on Nodes.

    2. Click on the "+" sign > Add Sources.

      A dashboard interface is displayed with a left navigation panel labeled Nodes, a top menu bar featuring options like Browser and Build, and a central workflow diagram. The diagram shows several connected nodes named forecast_hourly_metrics, city_weather_metrics, and city_weather_facts. A button labeled Run All is located at the top right.
    3. Add Sources to SQL Pipeline will open. Find the forecast_daily_calendar_metric table. Double-clicking the source will give you a preview of the data.

      A window labeled Add Sources to SQL Pipeline is open, showing a list of data sources in a panel on the left and a table preview with columns city_name, country_code, latitude, longitude, and date on the right. A button labeled Add 3 sources is located in the bottom-right corner.
    4. Then click Add Source.

    5. The source Node is added to your DAG. Now you can start building.

    Adding Data From Different Sources

    We support multiple data types from different sources. Take a look at the External Data Package.

    Add a Stage Node

    1. Right-click on the source Node > Add Node > Stage.

      A screenshot of the Coalesce interface shows a flow diagram in the center with a node named forecast_hourly_metrics. A right-click menu offers options like Edit, Add Node, or changing the node type to Stage or Fact. On the left, the Nodes panel lists available items, and the pipeline status at the bottom is marked Validated.
    2. It will open the added stage Node in the Node Editor.

    3. Next to the name, click the pencil and name it stg_weather_metrics. Databricks Node names are formatted as lowercase.

    4. Then click Create and Run.

      A Coalesce interface displays a stage node named stg_weather_metrics with a configurable description field at the top. A table below shows columns such as city_name, transform, data type, source, and nullable status. At the bottom, the pipeline is marked as validated, with Create and Run buttons.
    5. You’ve just populated your first table in Databricks. Click on the Results tab. You can see each stage complete and the SQL executed.

    What Create and Run Do

    Create renders and executes all SQL stages of the create template defined in the Node Type definition. Run renders and executes all SQL stages of the run template defined in the Node Type definition.

    A Coalesce interface displays a node named stg_weather_metrics at the top, with a table listing columns like city_name and measurement_time beneath it. A panel on the right shows Node Properties and configuration options. Tabs labeled Results and DDL appear under the table, revealing a SQL script and status indicators for validation, creation, and data loading, with a Run button positioned near the bottom.

    Add a Transformation

    1. If the Node isn’t open, you can open the stg_weather_metrics by double clicking the Node or right-click and select Edit.

    2. Sort the column by name, by clicking on the Column Name table header.

    3. Scroll to the column named temperature.

    4. Double-click in the Transform column next to temperature.

    5. Add ROUND({{SRC}}, 1). This uses standard SQL and Helper Tokens. SRC evaluates to the fully qualified name and location. After adding, it should update to the correct SRC. For example, ROUND(`forecast_hourly_metric`.`temperature`, 1).

      A Coalesce interface shows a node named stg_weather_metrics. The user is editing a transform cell with an expression that uses the ROUND function to round the temperature value. The table lists columns including city_name and wind_speed, with associated data types and descriptions.
    6. Click Validate Select. Validate Select lets you validate your transformations and joins.

    7. Scroll down in the Results panel to see the SQL applied to temperature.

      A Coalesce interface with a node named stg_weather_metrics at the top, a table listing columns such as cloud_cover_percent, and a code snippet below that includes a SELECT statement referencing forecast_hourly_metric. The node is validated, and the interface shows options for creating, data loading, and reviewing results.
    8. Click Create and Run.

    9. In the Data panel, scroll over to temperature, and see that the value has been rounded to one decimal place.

    Add a Dimension Node

    You’ll create a Dimension Node that will track how weather metrics change over time and each location. First we'll add some transformations to our stage Node to prep the data.

    1. Make sure you're in the stg_weather_metrics Node.

    2. You need to delete any columns not being used. Un-sort the columns by clicking the Column Name until the arrows disappear. Click and drag the following columns so they are at the top:

      1. city_name
      2. country_code
      3. datetime_valid_local
      4. temperature
      5. humidity_relative
      6. wind_speed
      7. precipitation_type_desc
      8. cloud_cover_perc_total
    3. Click on the first column to be deleted, hold Shift or Command then select the last column to be deleted. Right-click and Delete the columns.

      A demo showing selecting multiple columns for deletion.
    4. Rename the column datetime_valid_local to measurement_time. Double click on measurement_time to rename it.

    5. Rename cloud_cover_perc_total to cloud_cover_percent.

    6. Add the following transformations to standardize the data:

      1. temperature - ROUND({{SRC}}, 1) to round to one decimal place, if not already added.
      2. humidity_relative - ROUND({{SRC}}, 1) to round to one decimal place.
      3. precipitation_type_desc - COALESCE({{SRC}}, 'None') to replace null values.
    7. Then double click in the Data Type for cloud_cover_percent and type INTEGER.

      A Coalesce interface screenshot shows a node named stg_weather_metrics with a table listing columns such as city_name, measurement_time, and humidity_relative. Each column has a Transform, Data Type, Source, and Nullable status. The bottom area shows pipeline validation status alongside Create and Run buttons.
    8. Click Validate Select to preview the SQL that will be run.

    9. Then click Create and Run.

    10. Return to your DAG and create a Dimension Node off the stg_weather_metrics. Name it dim_weather_metrics.

    11. Under Options, select the following as Business Keys by selecting them in the box, then click the arrow to move them over.

      1. city_name
      2. country_code
      3. measurement_time
    12. In Change Tracking, select the following:

      1. temperature
      2. precipitation_type_desc
      3. wind_speed
      4. cloud_cover_percent
    13. Then click Create and Run.

    A Coalesce interface shows a node named stg_weather_metrics with a table listing columns, transforms, and data types. On the right, a Node Properties panel includes multi source, business key, and change tracking options. The bottom of the screen displays pipeline validation status and Create or Run buttons.

    Create a Fact Node

    You'll create a Fact Node to track historical data. You'll also use the Join tab.

    1. Create a Stage Node called stg_weather_facts off the Dimension Node dim_weather_metrics.

    2. You'll need to create a Join. The Join tab can be used to accept any valid SQL. We are joining the source data, forecast_hourly_metric to dim_weather_metrics. The conditions are the city_name, country_code, and measurement_time must match. This will join all results even if there are no matches. They will be null values.

      FROM {{ ref('ACCUWEATHER', 'forecast_hourly_metric') }} `forecast_hourly_metric`
      LEFT JOIN {{ ref('<YOUR-STORAGE-LOCATION-NAME>', 'dim_weather_metrics') }} `dim_weather_metrics`
      ON `forecast_hourly_metric`.`city_name` = `dim_weather_metrics`.`city_name`
      AND `forecast_hourly_metric`.`country_code` = `dim_weather_metrics`.`country_code`
      AND `forecast_hourly_metric`.`datetime_valid_local` = `dim_weather_metrics`.`measurement_time`
      A Coalesce interface displays a node named stg_weather_facts. The Join tab is selected, showing a SQL statement that pulls data from different sources, including forecast_hourly_metric. The bottom section indicates the pipeline is validated, with Create and Run buttons nearby.
    3. Create and Run the Node.

    4. Click on the browser tab to see that the stg_weather_facts has a reference to the source and dimension Node.

      A Coalesce interface displays a pipeline with four nodes arranged left to right: forecast_hourly_metrics, stg_weather_metrics, dim_weather_metrics, and stg_weather_facts. They are connected by lines, with a top menu featuring build, deploy, and docs options, and a nodes list on the left panel.
    5. Create a Fact Node from the stg_weather_facts.

    6. Name it fct_weather_facts.

    7. Set the Business Keys as:

      1. city_name
      2. country_code
      3. measurement_time
    8. Create and Run the Node.

    A Coalesce interface displays four rectangular nodes arranged in a horizontal workflow, labeled forecast_hourly_metrics, stg_weather_metrics, dim_weather_metrics, and stg_weather_facts. A left panel lists available nodes by category, and a top navigation bar includes options like Browser, Build, and Deploy, with a Run All button in the upper right corner.
    The final DAG

    Wrap Up

    Here's what you did:

    You Joined the source Node to the stg_weather_facts to get raw measurements like humidity and precipitation_type_desc. This is the data you are tracking.

    Stg_weather_facts will have the dim_weather_metrics_key. This makes sure each row has the correct historical weather data. Another way of saying this is: "On this day, time, and place, the weather description was X." Another example, "This forecast measurement ties to dim_weather_metrics_key at 124 to 124 in the fact Node and that record said it was cloudy, 68°F, and humid at that time."

    Databricks Query

    You can also query this in Databricks.

    Avg Humidity by cloud cover type
    SELECT
    d.cloud_cover_percent,
    COUNT(*) AS measurement_count,
    ROUND(AVG(f.humidity_relative), 1) AS avg_humidity
    FROM fct_weather_facts f
    JOIN dim_weather_metrics d
    ON f.dim_weather_metrics_key = d.dim_weather_metrics_key
    GROUP BY d.cloud_cover_percent
    ORDER BY d.cloud_cover_percent;
    Daily temperature trends by city
    SELECT
    f.measurement_time::DATE AS date,
    f.city_name,
    ROUND(AVG(d.temperature), 1) AS avg_temp
    FROM fct_weather_facts f
    JOIN dim_weather_metrics d
    ON f.dim_weather_metrics_key = d.dim_weather_metrics_key
    GROUP BY date, f.city_name
    ORDER BY date, f.city_name;
    A table labeled Raw results displays columns for date, city_name, and average temperature. The rows show entries for 2024-07-11 from various cities such as amsterdam and athens, each with numeric temperature values. A note at the bottom indicates 1 s 378 ms with 300 rows returned.
    Daily temperature trends by city