Automating Database Builds
This is the fourth post in a series we’re calling CI/CD:DB. Read from the beginning.
In our previous blog, we defined our changes and how we will arrange and group them. Now, we need to get serious about what we are going to do with the groups or ‘batches’.
Continuous Integration, the “CI” in CI/CD, provides good guidance for frequently ‘building’ your code and applying automation and quality to that process. We run into a bit of trouble here because there is no ‘build’ for database code — it really is a set of scripts you run in a database engine. Even so, we can absolutely provide a parallel. In this post, we’ll focus on the core considerations for automatically processing database changes in a CI-like fashion.
The core CI cycle
The classic CI cycle is a setup where, whenever new code is added to a source repository, a tool automatically does an integration build of the codebase. If the build is successful, it is put in the stack of ‘testable’ builds. If the build is not successful, it notifies everyone that the codebase is ‘broken’ and the team has to correct that break before it can proceed. So, no changes can get into the testing pipeline without passing the CI process successfully and everyone knows it. That is a powerful paradigm and creates tremendous leverage around the definition of ‘successful build’.
The core CI cycle can therefore be broken down into three basic steps.
- Get the code.
- Process the code.
- Deliver the code that’s been processed to a place it can be consumed.
Retrieving the batch from source control
In most application code CI scenarios, the process of retrieving the batch is usually an exercise in a source control system. The simplest answer is ‘newest versions of files in this branch’. That is fine for application code that can be recompiled at will. In fact, you can have many branches of application code — each with its own build — and not worry about one breaking another until you try to merge the code.
Unfortunately, batch retrieval for database changes is significantly more rigid. You will likely not be able to just use the latest/newest versions of files in the branch. You will also need additional, scripted intelligence to manage and enforce your batch definitions. There are two fundamental reasons for this.
First, database code is a lot less forgiving when it comes to branching. That comes in part from its nature as a stateful, order-specific, additive definition of the database’s structure that makes merging after it has been applied to a database very difficult. Branching limitations are also driven by the fact that very few teams have enough databases to give each branch its own testing sandbox. As a result, the smartest approach is a simple structure — ideally just one branch — where database changes get merged into your batches before they are retrieved for processing. This is why we spent all of the last post discussing changes and batches — changes and batches are infinitely cumulative in target databases.
The second key area to be aware of is that the dividing lines in batches matter a lot more and need to be enforced if automation is going to flow quickly and easily. While a batch is being processed, the system must NOT pick up anything from a subsequent batch. That does not mean it can’t process more than one batch, but due to the nature of database changes, they must be handled serially. While there is some inefficiency here — for example, some would argue that ‘if they are on different objects, they could be parallel’ — the automation would usually be done long before someone could figure out if parallelism was safe. Best stated, the principle here is: Never ingest a partial batch. Changes and batches are serially cumulative and order matters.
This need to serialize changes is why database change management tools all have a specific ordering structure; be it a simplistic naming convention or a more sophisticated construct like Liquibase’s ChangeLog which supports adding metadata for grouping changes. Each approach provides a clean sequence as well as a means for programmatically identifying batch beginning and ending.
Processing database changes
Processing batches of database changes involves running through each included change in each batch in the prescribed order and doing some level of verification on each in order to certify them as ‘good enough to test for production readiness’. The simplest way to process them is to run them in their designated order into the first database in the pipeline. This run is the first rehearsal for deploying those changes in production. If that deployment works and that database is nominally production-representative, you know that the change should work equally well in Production. This is about as close an equivalent process to ‘make sure it compiles’ as exists in the database world.
At first this sounds like a simple, iterative process, but databases have more going on than simple structural changes. In addition to the structural changes, databases also have programmed logic (stored procedures, functions, triggers, etc.) and the data that gives them their name. These three broad categories of changes must be processed in order and then each individual change within each category must be processed in a designated order. For example:
- Structural changes go first
- Structural change 1
- Structural change 2 (depends on structural change 1)
- Structural change 3
- Programmable logic goes second — may depend on structural changes
- Prog. logic 1
- Prog. logic 2
- Prog. logic 3 (depends on prog. logic 1 and structural change 3)
- Data updates go last — may depend on structure, programmable logic, or both
- Data update 1
- Data update 2 (depends on structural change 2 and prog logic 2)
Assuming the type and order of the changes were specified during batch creation, then the automation process should be able to follow its instructions to run through the batch relatively easily.
Database batch problems
The last thing the Processing phase must handle is when there is a problem with the batch. For example, one of the changes fails with an error when applied to the database. There must be a way to reset the database to its state before the Processing stage. This is crucial — if the batch deployment in the lower environment is to be a rehearsal for the eventual deployment to Production, it has to be redone from the start of that batch. Remember that when the batch is deployed into Production, Production will be at the before state — not the partial state. This is another example of why smaller batches are better.
Backing changes out of a database can be difficult. The traditional approach is to ‘rollback’ each of the changes in the batch in reverse order until the start of the batch is reached. If the team has a good discipline of defining the back-out process for each change while defining the batch, then that will work. Unfortunately, ‘rollback’ falls short in the case of data and data transforms performed by programmed logic during a load and the rollback process itself might have a problem. Fortunately, in the 40 years since database rollback became a tradition, technology has provided better alternatives in the form of Cloud or VM snapshots, Storage snapshot technology, features in database engines such as Oracle Flashback, and even some bespoke tools such as Delphix.
Delivering the post-processing results
Once the batches have been processed and marked ‘done’, they need to be delivered to a known point so that they can be consumed in the pipeline for delivery into testing environments and, eventually, deployment to production. That means a structure that isolates the batch or batches as having been processed and then a record of all the changes in an executable form. This can be a ZIP file of the scripts or something more sophisticated. Bespoke database change automation tools, for example, allow you to have a specific snapshot point of your changes, ensure that batches are not skipped on any target databases, and enable you to reconcile change records in any given target database with the expected list from your batches.
Whatever the mechanism, when post-processing is complete, the object delivered must be separated from the inlet point. It must be standalone and deployable to any target database without requiring a recheck with anything from the pre-processing world. This is crucial for ensuring consistency during deployment. We will discuss that in a future post, but it reflects the core CI/CD principle of ‘build once/deploy many’.
Summing it up
By accommodating the specific nature of database code, exploiting good batch definition, and changing a few paradigms, a CI process for database changes can reliably deliver deployable groups of database changes that we can control and track through our pipelines just like we do our application code changes. However, just as application code developers learned, CI is a lot more than getting working packages of code (which are binaries in their case). It also provides an opportunity to improve the overall quality of the changes — getting them right the first time — which reduces rework and improves the productivity of everyone involved. We will delve into the quality dimension of the CI process in our next post.
A Guide for Bringing Database Changes into Your CI/CD Pipeline
Automate BigQuery schema change and version control with database DevOps
Google's BigQuery is a fully managed, serverless cloud data warehouse, or database as a service (DBaaS), that brings unparalleled scalability and convenience to data analytics.