One of the challenges with making changes in a database environment is that undoing those changes can be hard. What's often preferred is rolling forward with a new change to correct the issue, but that's often done with limited analysis and thought. Instead, we hope our staff makes a quick patch and a better decision under pressure than they did with more time to examine the problem. That works if it's a simple mistake that was made in implementation but not if we haven't designed our solution well at the start.
I ran across an article on DoorDash that I thought was interesting. During the pandemic, their business exploded and they outgrew the Aurora PostgreSQL database. They migrated to Cockroach, a cloud version of PostgreSQL that's distributed and can (theoretically) scale much higher.
The thing I found interesting is that the engineers at DoorDash were trying to break apart their monolith and get better scalability, primarily from certain tables, by extracting their tables to get single writers in a cluster, which should help them handle a larger workload. They wanted to use their main identity table as a test, which I assume is the table that tracks each user in the system. They tried to migrate this and cutover to a new cluster 4 times before a fifth attempt worked.
I think any large migration is fraught with issues, but I appreciated the design here that allowed them to rollback their change and revert to the previous version of the database. That's something I don't see many teams think about or build into their database change process. I think having a clear, known, tested way to undo changes is important, at least for some of your tables.
There are two pieces of advice they give that I often give to customers as well. First, learn to spread out changes across batches. When I work with Flyway customers, I always let them know they need to think of a migration script as a unit of deployment and break those apart as best you can. Those often also become units of rollback, so keep them small. Not necessarily every change in its own script, but don't bundle too many things together.
Second, keep things simple. Too often I find engineers build clever solutions that make sense to them, but no one else. You never know the quality of your next hire, so don't overcomplicate things without a really good reason.
Did their process work? They've grown to about 1.9PB of data. That's a lot of food orders. They've also had other metrics of success, and seem to be saving time for their tech team, which is often one of the main reasons to build a better process and use it consistently.