DataOps Best Practices with Git and Coalesce
Overview
What Is DataOps?
DataOps has emerged in recent years as a new approach improving the speed, efficiency, and quality of data analytics processes. It is a methodology that combines cross-functional teams, automated data workflows, and standardized data management practices to quickly and effectively process, analyze, and deliver data-driven insights.
In this guide, you will find best practices for applying the principles of DataOps within the context of Coalesce and how to best implement them by fully leveraging our Git integration for version control.
By saving (or committing) any code changes in Coalesce to Git, you will ensure that previous versions of your data warehouse are preserved and accessible. These changes are accessible through a dedicated Git repository, which is a collection of files that store the different versions of your data warehouse.
We strongly recommend setting up Git within your Coalesce instance to perform proper version control and avoid losing your work.
Introduction to Git
What Is Git?
Git is a widely used version control solution that is used to store, branch, and merge developers' work. Git maintains a centralized code "database" (known as a Git repository) on a hosted provider such as GitHub, Bitbucket, GitLab and Azure DevOps Git. This centralized repository is copied / cloned to a Git client (such as Coalesce) and used to manage development operations.
How Does Git Work?
Git allows developers working on the default / live version of their data warehouse (the main branch) to isolate code changes by creating new branches of work. Branches are then merged together to incorporate new changes into the main branch. All deployments to non-development environments (for example, Test / Production) are performed from the main branch.
Branching and Merging
Branching provides flexibility in team environments by separating development from deployment. Types of branches include:
- An integration branch is used to merge development for system testing and reviews.
- A feature branch is a temporary branch that is created from the integration branch.
- Feature branches track all changes consisting of a "deployable unit of work" (for example, a new or modified pipeline of nodes). All development should be committed and unit tested in a feature branch.
- Several feature branches can be worked on in parallel by different members of the team.
- Feature branches are merged into an integration branch for system testing and review.
- When development is complete and ready for integration, a feature branch is merged back into an integration or main branch.
Git Permissions
Git permissions and approvals can be set up directly on your Git platform (not through Coalesce). Any Git pipelines (also known as actions or runners) can be configured along with Coalesce's Command Line (CLI) utility to automate deployments and executions.
Git operations such as pull requests and merge conflict resolution can be done through Coalesce, or off of the platform via Git commands or other clients (VS Code).
Using Version Control in Coalesce
1. Create a Project
The first step to setting up version control is by creating a Project in Coalesce. Projects are logical groupings of data warehousing efforts in Coalesce, and a good way to organize your work by a particular initiative or team focus.
When you create a Project, you'll be asked for a Git repository URL. Git repository URLs are used by Git hosting services (GitHub) as a way of pointing to a Git repository. However, everyone still has a full copy of the repository on their own machine. Coalesce generates and manages metadata which is converted and executed as SQL code, its this metadata that is stored in Git as YAML files.
After adding the Git repository URL, you'll be asked to add your Git credentials. Your credentials are similar to a username and password, and a way of controlling access to the Git repository.
2. Create a Workspace Within Your Project
Workspaces are dedicated development areas within Coalesce where you can build out your data warehouse and apply transformations to data.
When you create a Workspace, you'll be asked to select a Git branch, and create a new branch from it.
Branches are one of the most powerful features of Git, as they allow different people to work on the same items at the same time, and later combine (merge) their changes. This is particularly useful if you need to work on a hotfix for your data warehouse, or need to add new items to it that shouldn't be exposed to end users just yet.
With feature branches, teams can make changes to the main data warehouse while other people work on hotfixes or new features. Any changes made to the main data warehouse can be synced with the new features, and vice-versa.
3. Transform Data in Your Workspace
Once your Workspace has been created, you can start creating or modifying your data warehouse. Although your workspace is linked to Git, Coalesce doesn't automatically save your changes to Git - changes are, however, saved somewhere else, and you won't risk losing your work. However, it is a good idea to commit your changes when you're at a good stopping point (for example, when creating a node or multiple nodes, mapping storage locations, building a subgraph, building a job). You should also consider committing if you need to work on another Workspace, or if you want to make your work visible to other members of your team.