How to Handle Problems with Database Changes
This is the eleventh post in a series we’re calling CI/CD:DB. Read from the beginning.
CI/CD is an engineering practice geared for change management. With any such process, no matter how automated, you must address what to do when something goes wrong. This is particularly important because software systems are made of many parts and a problem with just one of them can mean that any or all of the other parts have to be modified in some way to get the system to a running state. So, the owners of the processes for each piece, such as the databases, must have a clear plan for what they will do in case of a problem.
When you bring up the topic of dealing with problematic database changes in a software delivery pipeline, one of the first things to come up is the contrast between “Rollback” and “Roll Forward”, sometimes called “Fix Forward”. (Read our whitepaper on rollback vs. fix forward.)
This is a debate between a classic IT management solution for handling problems in Change Management and a newer speed-oriented approach. A rollback approach brings discipline but is a poor framework for understanding how to deal with problem changes in databases. A roll forward approach is a better structure for dealing with the technical nature of databases but can be difficult to adopt because it is perceived to lack discipline. The truth is that both have lessons for us, so it is worth looking closer at each approach.
“Rollback” is the most common approach for IT shops dealing with bad database changes. The idea of Rollback is that you can simply “undo” a change. For an application binary, that can be as simple as replacing the new, changed binary with an older one. While it is a compelling idea to ‘just put it back the way it was and try again later’ — to “roll back” the changes — it is not always possible or preferable where database changes are concerned due to two fundamental truths about database changes.
All database changes are cumulative
First, databases are stateful. Any action that adds or removes changes is, in effect, a forward change. This leads to a slightly academic conversation of whether databases can ever truly be rolled back or if they are really rolling forward to a new state that resembles an older state they previously held. While this is conceptually academic, it has technical ramifications for how problem handling processes must be designed.
Database changes may not be directly reversible
Second, “Rollback” or “Undo” implies the direct reversal of changes. On the surface, that would seem to be great for databases and their always cumulative changes. However, because there is data in databases — not just structure and logic — the order in which you add changes may not be easily reversed. Even when the changes are directly reversible, it may be less efficient to do so when compared to simply applying a new fixing change.
But we have to have a rollback plan…
Rollback shows up in most organizations because it has been around for a very long time. Classic IT management approaches rely on rollback as the safety net for what happens when a database change goes wrong. Those management approaches were invented in an era when system architectures were generally more monolithic, things were less automated, and patterns such as ‘blue/green’ or ‘canary’ did not yet exist.
The classic approaches are correct in that you definitely should have a plan to deal with problems. They are (or are frequently implemented to be) very prescriptive that the change must be a ‘rollback’. Many organizations have an inflexible dependence on the old IT management frameworks and therefore force a rollback-centered solution even while admitting it is technically dubious for a database. This leads to teams force-fitting reversal scripts or creating remediations that go ‘forward’ while claiming and documenting them as rollback plans. Neither example is particularly healthy.
Roll forward / fix forward
A popular contrast to Rollback is “Roll Forward”, sometimes called “Fix Forward”. This approach prioritizes getting the new business value of the changes and features into the hands of the users; the opportunity cost of waiting until the next release cycle is unacceptable. Therefore, rather than reverting to an older configuration to deal with problems, the team should diagnose the problem and be good enough at applying changes so they can push a fix to the newly pushed, but problematic, version of the system quickly.
Technically speaking, from a database perspective, “Roll Forward” or ”Fix Forward” makes a lot of sense. It better reflects how databases work. It takes into account the cumulative nature of database changes and how the forward sequence of database changes is not necessarily the backward sequence.
There are two main pitfalls of a “Roll Forward” / “Fix Forward” approach:
The first problem is cultural. Many teams struggle with the raw concept of having a “Roll BACK” plan for the application pieces and a “Roll FORWARD” plan for the database — even if the end state of the database is backward to where it started pre-change. This may sound a bit silly, but it is a very serious cultural issue for many organizations.
- Rollback doctrines
The second, more serious item, is that the old “Rollback” doctrine enforced having a plan, but typical “Roll Forward” / “Fix Forward” discussions have no such codified structure. It is often seen, and too-often implemented, as an exception-based approach that does not require a plan. Too many teams view it as a reactive situation where they just figure it out and slap a fix in. That is not a great approach and creates a reputation around the “Roll Forward” / ”Fix Forward” approach that makes organizations very uncomfortable about, and therefore resistant to, adopting it.
Pragmatic problem handling
It is easy to get pulled into the “Back versus Forward” discussion, but that does not really solve the problem. Rather than obsessing about one approach or the other, focus instead on building a set of layered defenses for your pipeline so that your team can efficiently deal with problem changes holistically. Working from the beginning of the pipeline forward to production, we can think about the following:
- Shifting problem detection left
- Considering fixes as quality checks
- Deliberately handling exceptions
Shift left to avoid problems
“Shift Left” is a common mantra in CI/CD. It tackles how to detect problems or deficiencies in inbound changes at the very beginning of the pipeline. The premise is simple; the sooner you can detect a problem, the less impact it will have, and therefore less rework needs to be done. This is why earlier entries in this series focused on ensuring change quality and using a build-like process as a screen for inbound changes. If you can detect problems before they ever get into the pipeline, then the database changes, at least, will not be the cause of exceptions when delivering changes to your system.
Consider fixes as quality checks
As good as you are at screening for problems earlier in your pipeline, you still have to prepare for the fact that some things will still go wrong. In fact, you can expect things to go wrong early in a development pipeline. After all, that is where people run experiments for new features and solutions. A fair number of those experiments will fail and require cleanup. Consider that a serious change delivery problem happens in production once or twice a year, but similar problems in the integration environment can happen several times a week. The ones in the ‘lower’ environments are less visible but can take an equal amount of time for someone to clean up.
With the work in the early environments in mind, a part of your pipeline’s quality checks should include verifying that there is a remediation script with every new database change or batch of changes. It does not matter if that script goes ‘back’ or ‘forward’ — just that it remediates the problem. Additionally, you should deliberately test the remediation scripts in an early phase of the pipeline. As you progressively apply this discipline in your CI/CD pipeline, you will discover that you spend less time keeping the pipeline tidy and, should something happen in production where the DB changes have to be remediated, you will have very high confidence that it will work because it has already been tested.
Deliberately handling exceptions
Inevitably, there is a worst-case scenario where a new fixing change is needed because something unforeseen has happened. This will require a rapid response and some way to get a database change through the pipeline very quickly. Deliberately designing a ‘hotfix’ path into your pipeline ensures that you have the means to handle the situation in an organized way. There should be no need for heroic effort or high-risk, manual hacks.
The hotfix path should answer two questions:
- What is the minimum bar for acceptable checks we must have in place for a database change to bypass the pipeline and go to production?
- How are we capturing the hotfix change so that it goes back through the pipeline to avoid regression and ensure quality?
Minimally, these two questions should be answerable by the ‘shift left’ checks mentioned above. Those checks are theoretically the minimum standard for a database change to get into the pipeline and should apply as the minimum safety standard for an emergency or hotfix change. Additional checks are obviously possible if required, but that requirement might be an indication that the shift left process needs to be enhanced.
Understanding the common approaches for how organizations handle problem changes is important. These expectations influence the standards that your organization will expect from a CI/CD pipeline and will influence your solution’s design. They also help you build efficiency into how the pipeline provides efficient problem handling end-to-end.