Batching Database Changes
This is the third post in a series we’re calling CI/CD:DB. Read from the beginning.
In part 1 and part 2, we worked to understand the problems associated with bringing database changes into a CI/CD flow and break it into manageable and, hopefully, solvable chunks.
In this part of the series, we begin to address those chunks one at a time. With that in mind, the first thing we have to do is to get our inbound changes into some sort of structure that we can handle. If we cannot get this first piece organized, we will fall victim to the lesser discussed truth of computing — “chaos in, chaos out”. (Closely related to “garbage in, garbage out” — but with more downtime.)
Changes and batches
First, we have to understand what a ‘change’ is and, from there, what a ‘batch’ is. For the purposes of a database, a change is simply a command that causes the contents of the database to be different after the command is executed. So, an ‘ALTER’ would typically be a change, but a ‘SELECT’ would not. (Well… unless you added an INTO, anyway.) In this context, a single change is an easier concept. Agile and DevOps practices like CI/CD get interesting when it comes to the notion of “batch size”.
The term “batch” comes from classic manufacturing. The notion is that you wanted to produce a large number of items. To do so at scale, you set up a machine to do the job, and then the machine would take identical raw material in and churn identical post-worked pieces out. You would do this in ‘batches’ because it was costly and time-consuming to set the machine up. This approach is economical when all the inputs and outputs are the same — whether they are cookies or car parts. With software and databases, you are never making the exact same change twice. Therefore, every change is effectively a unique batch of one.
SQL scripts are NOT automatically batches
The idea that every change is effectively a unique batch of one usually stimulates a conversation. Changes are typically packaged and tracked as SQL files but they can contain an unlimited number of statements. Also, the nature of the contained statements can alter how the SQL file is perceived.
For example, you perform 10 INSERTs into a table in a single SQL file.
- Is that one change or one batch of 10 changes?
- What if they were 10 ALTER statements?
- What if they were 10 ALTER statements, but three were for one table, five were for a second table and the remaining two were for a third table?
Extending the problem is the scenario where there are multiple SQL files and each contains a number of potentially unrelated changes.
These scenarios illustrate a common problem:
“We just track and review SQL scripts and then someone runs them against the database.”
This approach can be so wildly variable that it is very difficult and time-consuming to manage. This is why database change management tools such as Liquibase are precise in defining the boundaries of any given change they are tracking.
Explicitly defining what a “single change” means
Given the variable nature of database change types and the subjective nature of what a ‘single change’ is, the first step is to establish rules for what constitutes a single change.
For example, you might declare something like the following:
- Each DDL operation is a change.
- If multiple DDL operations on the same table are desired, each of them is still an individual change.
- Operations on stored logic objects, such as stored procedures, functions, etc. are represented as one stored logic object per script (similar to the best practice of ‘only one Class per .java file). Each stored logic script is a change.
- Multiple DML operations on the same table AND in the same SQL file can be considered a single change.
- Once a change has been applied to any database (including non-production databases) as part of a batch, it cannot be modified. It must be replaced or corrected with another change in a subsequent batch.
- And so on…
Once you have established the definition of a ‘single change’, you must stringently enforce it — ideally by technological means. Get and use a code analysis tool. If something does not pass, the change contributor must refactor the submission — no exceptions. That may sound harsh, but remember that these change scripts are intended to run against your Production database someday. So, why is it OK to let something slide?
Batches: Grouping database changes
It’s important to remember that because each specific change in software is effectively a one-off, the “ideal” batch size is exactly one. The idea is that there will be no dependencies within a group of changes if there is only one. It will either work or not. If it does not work, it will be quickly apparent and it will be easy to find and fix because only one thing changed. This line of thinking underpins Continuous Integration, Continuous Delivery, and other practice doctrines within the realm of DevOps principles.
As a practical matter, one-change batches are difficult to achieve in most organizations. There will need to be some compromises to the ideal— particularly in the early days of adopting the new practices. The key point is that smaller batches are better.
Building a batch of database changes
The application development world and common CI practices give us a clear pattern to use for defining batches. Every CI build, by definition, represents the production of a batch. Code changes come into a particular branch of the source code repository and a build engine picks them up, checks them over, builds them, runs some tests, and then, hopefully, delivers a compiled binary that includes the incremental new batch of changes.
To do this with database changes, we need to specify how we are going to do certain things every time we create a batch — and do them automatically as we do for a CI build for code.
- Collect changes
The most basic task is to decide how to collect the database changes. Database changes typically come in as code (.sql files), so they should be handled by a Source Code Management (SCM) system, such as Git. There are plenty of well-established patterns for merging changes in an SCM and there is no need to reinvent them for database changes. (This may be obvious to some, but it bears repeating because there are many organizations where these files are processed as attachments to Change Management tickets by people who never use an SCM.)
Good practices are key for identifying new changes and ignoring old ones. SCM tools are great at facilitating this, but unlike application code, where the same files evolve over time through many builds, most database changes cannot be rerun. Therefore the contents of a .sql file cannot persist between successful runs of the CI process. The change collection phase must specifically account for this.
- Inspect changes
Each inbound change must be verified for compliance with the agreed rules defining a ‘single change’. The CI process is an ideal time to use automation to scan and verify compliance. It also explicitly has an exception handling procedure — ‘breaking the build’ — to deal with problems.
- Organize changes into a batch
Once changes pass inspection, they can be organized into a batch. This is the heart of the CI process where the inbound changes are assembled and a specification is created for their assumed sequence when the batch is processed against a target database.
The sequence can be declared by any means the team sees fit — everything from explicit manual ordering to simple automated First-in/First-out processing and everything in between. The important thing is that the sequence is as intrinsic to the batch of changes as the changes themselves. This is a crucial difference between database changes and application changes — database changes are cumulative.
This is why Liquibase maintains its core changelog file. Ideally, this sequence should be tested in some way before the CI process declares that the batch has been successfully created.
- Create an immutable artifact
Once you have created a ‘build’ of some application code, a binary or similar product exists. That binary represents the prior changes plus the new ones added as part of the build and is not going to change. In other words, the end product is important. To make a change, you would have to modify the source code, run a new build of the software, and get a new binary.
The CI process for the database changes also needs to produce an equivalent artifact. It can be as simple as putting the batch definition into a ZIP file or it can be more complex. Whatever the containment solution, each batch’s contents must be known and remain unchanged once defined.
Any change to something in a defined batch — even if it is just to override one thing — needs to go into a new batch. This way batches, just like individual changes, can be identified at all times in all areas of the CI/CD pipeline.
- Track changes
The point of being able to identify changes and batches individually is so that they are trackable. You can track who created them, which batch a change is in, which databases where a change has been applied, and so on.
Precise tracking is important for people managing the CI/CD pipeline, but it is absolutely critical to the automation around the pipeline. This tracking data is what allows the automation to decide whether changes are in an environment or not, whether to run or rerun a change, whether there are any gaps in the prior change history, and so on. In other words, without good tracking, automation cannot function.
Summing it up
We’ve covered how to define database changes and creating batches in source control so that database changes are usable and trackable.
The next problem we’ll tackle in CI/CD for databases is how to wrap automation around these batches of database changes. As we have learned, CI provides guidance, but each phase of batch creation brings challenges for automation. We will be digging into those more in Part IV: Automating Database Builds.