Multi-Tier Production Release Environment by Grant Fritchey
In order to maintain a stable, protected, production system, my company has built a multi-tier Development, Quality Assurance (QA), Staging & Production environment. A number of processes have been defined around how to move data & structures out of development and into production. We’ve also established a set of tools that we use for maintaining these processes. During the building of these systems and the development of these processes, a number of problems have come up that we’ve had to work around. Hopefully, by communicating our processes and the problems these processes address, you can learn from our pain.
Obviously, the normal progression would be to develop new structures, procedures, etc., against the Development environment. Once unit testing is complete, those objects are moved into QA for more thorough testing. Not everyone uses a Staging environment, but, they’re normally a place where the build is done to verify its impact on Production prior to actually performing the build on Production. This prevents as much inadvertent damage to Production as possible. We also do performance and load testing on Staging.
The Production Environment
Our Production environment is an active/active SQL Server cluster pair using multi-processor machines. Four different instances of SQL Server run on the two machines. All the rest of the environments mimic this basic set up in that they are all clustered environments, and they are all running similar instances of SQL Server. In the case of QA and Development, they are much less powerful machines. Staging also doubles as a backup cluster in case of a complete catastrophic loss of Production. We built out the systems to mirror Production in order to identify and deal with the unique constraints of the clustered environment from the very beginning of development. We’re using the same instances and database names on all the clusters for the same reason. At no time can a developer or DBA accidentally build a process that ignores the cluster or instances. This way we avoid, for example, hard coding path statements that may not be applicable in Production.
The real meat of this set-up is in the process that makes the rollout to Production possible. It starts in Development where we have multiple development teams working on a single database at various stages of implementation. The Development environment is frequently referred to as the Wild West. This is because the developers can, to a large degree, act as cowboys, creating and changing procedures & functions at will. Because of this, we collect our structural changes and publish them once a week to the Development environment. The developers have built a day into their schedule to address issues with code caused by database changes, either to their own team's objects or to dependencies on other teams. Our model is maintained using Embarcadero ERStudio, stored in Embarcadero Repository. Our SQL code, procedures, views, functions, etc., are maintained in Visual Source Safe (VSS). Source code control and versioning for both the model and the code have proven invaluable. The development build process takes about an hour. Generally, in the Wild West, there are several DBA's, each working with a development team. Communication between the DBA's is facilitated by having a DBA assigned to work on integration across teams.
Working with Development
On a scheduled basis, we perform a build of the QA environment. This is where we start to reign in the Wild West show that takes place in development. We use another tool from Embarcadero called ChangeManager. This allows us to perform an incremental update to the QA environment. Since we use VSS for the SQL code, we only migrate structures to QA using ChangeManager. ChangeManager accurately, to a very high degree, captures differences in structures. However, it isn’t perfect. Usually on a QA build we have to generate differential scripts twice in order for ChangeManager to capture all the changes perfectly, and sometimes we have to do it more often.
After the structures are migrated, we build the SQL code using the objects that were checked into VSS. Only procedures that have been unit tested can be checked into VSS. This ensures that the objects migrated to QA are ready for a higher degree of testing. We also use RedGate’s Data Compare utility to move lookup table data between the environments. This tool, like ChangeManager, is incredibly accurate but not perfect. We will occasionally have to generate two scripts using the tool to ensure all the data is moved correctly. We maintain a documented list of lookup tables so that we can ensure accurate maintenance of their data through the environments. The use of all three utilities enables a repeatable, safe process that moves a large number of structures between the environments without having to document each and every object that needs to be moved on each build. This has allowed us to do as many as three QA builds a week as we move towards a release. This process takes 1-2 hours. There is usually only one or two DBA's involved in doing the QA builds.
After everything has been sufficiently tested in QA, a production build is scheduled. About three weeks prior to the scheduled release, we’re ready for a build to Staging. This is where we eliminate the last vestiges of the Wild West in preparation for a move into Production. Step one is to restore the most recent backup of the Production environment on the Staging system. Now, any scripts generated and tested, will be tested against the actual data & structures that we will see in Production when we release. The scripts are generated from QA because that is our tested and accepted base. Embarcadero ChangeManager and RedGate Data Compare are used again to generate scripts. Unlike in QA, we won’t run the compare multiple times until perfect. In Staging, we run the generated script. Anywhere it fails, the script itself is edited. This usually entails rearranging for dependencies, dealing with constraint name differences (a constant battle), etc. We then restore the Production database back onto Staging and test the script again. This is repeated until the script, with all structural and data changes, can be run without error in Staging.
Lastly, another compare is run using the tools to ensure all the changes have been migrated. At this time the script and Staging database are placed under what we call Release Control. More testing will occur in QA and Staging. Any changes resulting from the testing are reflected by edits to the script. Any edits to the script must be tested by completely resetting the Staging database and re-running the entire script. When we’re ready to release to Production, we’ve done everything we can to eliminate errors because of differences in structure or data between what we’re developing and what’s in Production. The initial build process in Staging takes from 4-8 hours. This is always done by a single DBA who relies on the various team DBA's to provide assistance.
Releasing to Production
After all that, the Production release is really quite simple. We simply run the script tested against Staging as part of the code release process. So far, knock wood, we’ve had no problems in the Production part of the release because of our preparation work in the other environments. Our Production release goes through a simple process. We bleed off all user connections, and then isolate the database. After the database is secure, we perform a backup (paranoia in a DBA is a good thing). We run our script, checking for errors. Assuming none, we run a consistency check on the system. If none of this fails, we’ve got an updated Production system. If errors appear at any step during the production release, we're prepared to apply the backups in order to ensure continuation of the Production system.
Lessons Learned...
The pain has generally been concentrated in a few areas. The weekly dev build is simultaneously blessed and cursed. If you’re on a team with a lot of changes, only building once a week is good because it allows you to concentrate your efforts during the week. If you’re on a team with only a few changes, the weekly build is a place of horror because a dependency on another team may change without your knowledge. The DBA team has had to act as a communication facilitator between development teams to prevent these kinds of surprises. The QA builds, by their nature and frequency are somewhat problematic. Getting the incremental changes in place sometimes leads to data loss. Objects can be missed. Everything that delays the QA build impacts the QA teams time to get their job done and slows down development since they have to assist the QA department. We usually spent more time trying to get this move right than we did any other process. The Staging build is relatively painless, but the length of time it takes to get everything just right can impact schedules. We’ve had to remind the project managers several times to allow for our time to get the build right. The biggest win, in all the work done, has come from documenting the process and then following the document. If we identified shortcomings in the process, we updated the document with the fix. The more that was written down, the easier it became to incorporate other DBA's and other development teams into the process. We've even taken the processes for this development effort and migrated them into other development efforts with positive results.
There are more places where I’m sure we can improve our process. Overall though, we’ve been doing four major releases in the last year with three to four minor releases between them and things have gone fairly smoothly (knock wood). Having said that, I’m off to do an extra backup. Just in case.