EL PROBLEMO
Like most (one would hope) development groups, we develop our database code in a specific development environment. From there we promote it to a QA environment for testing, then to Staging, and finally to Production. Since we use Subversion to store our application code, we also use it to store our database code. Storing scripts for Tables, Views, Procedures, Functions, etc. is pretty straight-forward but what about static (i.e. Lookup) data? We have quite a bit of static data here at ChannelAdvisor (.com) and with the amount of development being done it can change quite frequently. We wanted to have a way to back up this data in Development and Staging and version it in Subversion just like the DB object code. Relying upon a database backup would not suffice given the following requirements:
Static Data should be exported under the following requirements:
- Data should be easily human readable (i.e. easily "diff"ed to spot exact changes between versions)
- Data should be easily importable back into SQL Server (for Disaster Recovery).
- Data should be checked into Subversion (svn) nightly.
- If possible, process should allow for auditing distinct data changes so we can see point-in-time changes and who made the change.
- This process needs to work in Development, Staging, and Production. For the moment can skip QA.
- Only certain Databases will be exported
- Not all tables within an exported Database will be exported
- Output filename should be the Table name
In order to facilitate items 1 and 2, the following need to apply:
- file should be plain text
- file should contain column-headers
- text fields should be text-qualified (to be importable)
- text-qualifier cannot be naturally occurring in the data (hence no double-quote: ")
- column delimiter can be a tab (\t)
- file will need to support Unicode character set for foreign language characters
After doing some testing for an appropriate text-qualifier, I determined that ASCII character 170 (the logical "not" symbol: ¬) is best since anything above ASCII 175 (») shows up differently depending on which application you are in and character 175 shows up in one of the tables to be exported. Character 174 («) is hence just as likely to be someday used while characters 171 - 173 (½, ¼, and ¡) may also be used in the future.
A Solution?
Initially I set up a SSIS package to do the export that was Server & DB specific and would allow for a Table name to be passed in to do that table specifically or it would run all tables within that DB on that Server if no Table name was passed in. This would require a total of 10 SSIS packages given the list of Servers / DBs to export. This solution worked great as it satisfied all of the given requirements.
However, it was then noted that over time additional tables will be added to the system and the structure of existing tables might have new columns or new datatypes for existing columns. Given how SSIS stores all of the column and datatype structure within each Data Flow task (most likely because you can tailor the handling of each column individually), it would require that we have one Data Flow task per table that has it's own Flat-File Destination (i.e. cannot use a Loop Container for the Table names) since you cannot (at least not easily) dynamically change the column mappings even though you can dynamically assign the filename within the Flat-File Destination.
This was fine with the database that had 10 tables, but not the one with 120 tables ;-). AND we would also be required to update one or more SSIS packages every time we:
- modify an existing static table,
- add a new static table, or
- remove or rename an existing static table.
In the case of removing a table or a column the package would error and this is a clear indication that it needs to be updated. However, in the case of a new table there would be no error and hence it is possible to have newer tables not getting backed up at all if someone forgets to update the SSIS package. This is more overhead than desirable so I started looking into other solutions. [Note: this is not to say that this is completely impossible via SSIS; it would just take more effort than I think it was worth when other options were available.]
More Requirements:
- Backup newly created tables without needing to update the data export process
- Account for column / structure changes on existing tables without the data export process failing and without needing to update the process
Woe Is Me
I then looked into using BCP as it seemed like the logical choice; it allows for a SELECT * query so it will always pick up column changes and I can do a query to get the list of tables so I can build a dynamic SQL string to call xp_cmdshell to execute BCP. However, for some odd reason which I will forever be unaware of, BCP -- which is meant to import and EXPORT data -- does not have a facility for including column headers in an export: quite odd indeedy (yes, that IS a word). And while BCP does allow for text-qualification, it does so via Format Files which would need to reflect the structure of the table and would hence need to be created for new tables or updated for changes to existing ones so that would bring us back to the maintenance problem of SSIS.
I then looked into using SQLCMD (which replaces OSQL in SQL 2005). SQLCMD does include column headers (good!) but does not allow for text-qualification (bad!). There are also other minor output formatting issues with SQLCMD as it is meant more for reporting than data extraction.
Finally, a Solution!
With few options left, I looked into using a SQL CLR procedure that does the same thing as BCP (it will output any query) but it includes the items from SSIS that we need, namely: column headers and text-qualification. Since the CLR will not be enabled on production machines (DB group policy), I had OPS set up VMs to be Maintenance instances of SQL 2005. Having a dedicated maintenance VM is useful for projects needing greater flexibility on potential security issues as well as configuring 3rd party applications on the OS. Using Virtual Machines is ideal for projects like this as they are not mission critical / customer-facing.
I created a Database named "Maintenance" and installed the SQL# (SQLsharp) CLR library. For this project I will be using a Function (String_Join) and a Stored Procedure (DB_BulkExport) that are available in the Free version of SQL# (http://www.SQLsharp.com/). [Please note that there is no affiliation between my employer, ChannelAdvisor, and the SQL# project.]
I have 3 tables for this process:
StaticDataExportSourceServer -- details about the environment (especially for the SVN directory) based on the calling Server
StaticDataExportSourceDB -- which Tables from which Server\Instance to export based on the environment
StaticDataExportException -- the Tables, per DB, that should NOT be exported since we look for all non-system tables
Some notes about the tables (in addition to per field comments in the DDL):
- StaticDataExportSourceServer:
- The MaintenanceServer field exists since each back up process is environment specific (i.e. the Maintenance server running in Dev only backs up Dev data and so on)
- The Environment field is used for both separating Databases (since we might want to do different DBs in different environments) as well as constructing the file-system path for the exported files
- The SVNFolderName field is used to construct the file-system path where the export file will be saved since our existing Subversion repository structure did not always match exact names between the servers and the folder names; this field is only for that mapping.
- Yes, technically speaking the Environment field could be normalized out to another table with MaintenanceServer as the PK since the value, for our purposes at least, will repeat per each MaintenanceServer value and hence there is a possibility of a data mismatch between these two fields. However, given the scope of this project, normalizing that field was not truly important given that this data will rarely, if ever, change. And, if the scope of the project ever does increase, breaking that field out and adding the appropriate JOIN in the Stored Procedure (shown below the Table DDL) would be easy enough to do.
- StaticDataExportSourceDB:
- The DBName field value is per each Environment / ServerName combination. Our servers have slightly different names across the various environments so this lets us account for those differences as well as determine which DBs we want since there might be a reason to do different DBs in different environments (e.g. a Database might be in Dev but not yet in QA, or in DEV and QA but not yet in Staging, etc.)
- StaticDataExportException:
- This table lists all of the non-standard exceptions (standard exceptions are statically listed in the Stored Procedure). The theory here is that DB names are unique so if a name appears in more than one environment it is the same DB and if we do not want a particular Table in a particular DB in one environment then we will not want it in the other environments as well.
- This structure is partially due to the environment (outside of being generally flexible); we have several terabytes of data in Production alone and so we spread the data across several different physical servers for load balancing. Right now this project only pulls from one of the servers in each environment since the data is replicated to a few other servers but this setup can accommodate pulling any data from any server, even if the data is the same as on another server.
- The data in the tables is the same across all instances of the Maintenance DB across all of the environments (Dev, QA, Staging, and Production). This way we do not need to worry about which environment should have which data. The entire export system here (tables, data, procs, etc.) is identical across each environment. The only difference is the name of the server in each environment and the name of the server (via @@SERVERNAME) is matched against the MaintenanceServer column in the StaticDataExportSourceServer table to determine what to do in each environment.
- Additional comments are noted in the DDL
USE [Maintenance] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO CREATE TABLE [dbo].[StaticDataExportException]( [DBName] [nvarchar](128) NOT NULL, -- PK, name of DB to back up [TableName] [nvarchar](128) NOT NULL, -- PK, name of Table to exclude CONSTRAINT [PK_StaticDataExportException] PRIMARY KEY CLUSTERED ( [DBName] ASC, [TableName] ASC ) WITH (FILLFACTOR = 100) ON [Tables] ) ON [Tables] GO CREATE TABLE [dbo].[StaticDataExportSourceDB]( [Environment] [varchar](50) NOT NULL, -- PK, FK, values = Dev, Staging, QA, Production [ServerName] [nvarchar](128) NOT NULL, -- PK, FK, name of server\instance to back up via @@SERVERNAME [DBName] [nvarchar](128) NOT NULL, -- PK, name of DB to back up CONSTRAINT [PK_StaticDataExportSourceDB] PRIMARY KEY CLUSTERED ( [Environment] ASC, [ServerName] ASC, [DBName] ASC ) WITH (FILLFACTOR = 100) ON [Tables] ) ON [Tables] GO CREATE TABLE [dbo].[StaticDataExportSourceServer]( [Environment] [varchar](50) NOT NULL, -- PK, values = Dev, Staging, QA, Production) [ServerName] [nvarchar](128) NOT NULL, -- PK, name of server\instance to back up via @@SERVERNAME, used for LinkedServer [SVNFolderName] [varchar](50) NOT NULL, -- name of server to back up in the Subversion repository [MaintenanceServer] [nvarchar](128) NOT NULL, -- name of the Maintenance server via @@SERVERNAME CONSTRAINT [PK_StaticDataExportSourceServer] PRIMARY KEY CLUSTERED ( [Environment] ASC, [ServerName] ASC ) WITH (FILLFACTOR = 100) ON [Tables] ) ON [Tables] GO ALTER TABLE [dbo].[StaticDataExportSourceDB] WITH CHECK ADD CONSTRAINT [FK_StaticDataExportSourceDB_StaticDataExportSourceServer] FOREIGN KEY([Environment], [ServerName]) REFERENCES [dbo].[StaticDataExportSourceServer] ([Environment], [ServerName]) GO
There is also a Stored Procedure, ExportStaticData, which will:
- find all Tables for the Servers\Instances listed in the StaticDataExportSourceDB table, removing any tables listed in StaticDataExportException
- export the data within those tables
- add any new files to SVN
- commit all updates to SVN
Notes:
- Linked Servers need to have access either by explicit credentials when creating the Linked Server or the account that is running the SQL Server process needs to have the appropriate connection and SELECT privileges.
- This requires the command-line version of Subversion (svn):
- You can download SubVersion from (http://Subversion.tigris.org/servlets/ProjectDocumentList?folderID=91&expandFolder=91&folderID=0) - This puts you in the folder path of: Subversion | Releases | Windows
- You should log into the server running SQL Server 2005, open a command-window, and run an SVN command that requires a username and password once so that it can cache the credentials (encrypted) hence there will be no need to specify the username and password in the Stored Procedure.
- The Maintenance Server VM is running SQL Server 2005 but the back up / export process can also connect to SQL Server 2000 servers.
- Since the CLR Stored Procedure DB_BulkExport writes directly to the file system (it can output as a regular query result but the easiest way to get data to a file programmatically AND to not consume too much memory is to write each row of data to disk as it is read from the DataReader [i.e. Result Set]), you need to set the CLR Assembly to have the EXTERNAL_ACCESS permission. To do this, simply execute the following:
EXEC Maintenance.SQL#.SQLsharp_SetSecurity 2 - One nice thing about Subversion is that it only stores the differences between file revisions after the initial file is stored. So while the initial export for this project produced 270 megs worth of files, nightly changes are probably around 100 Kb so this will not overly bloat the SVN repository.
- Another nice thing about Subversion is that it can easily show the differences between checked-in versions. Using the "Compare Revisions" feature in the Repo-Browser of the TortoiseSVN client, it is really easy to see what changes were made between check-ins.
- Because of the need to efficiently find and view the differences, it is important to sort the data the same way each time. To this end I dynamically found the PK fields of each table so that I could do an ORDER BY on the SELECT *.
- The @SVNLocation variable would ideally contain the fully-qualified path to the "svn.exe" executable. Unfortunately, xp_cmdshell only allows you to have one set of double-quotes (") per execution and those needed to be reserved for the " -- message" option when doing the "commit". So for now I am relying upon the fact that when you install the Subversion command-line package it places the install directory into the PATH environment variable. Fortunately none of our directory names, table names, or DB names have any spaces in them. Of course, if the table names or DB names had spaces then I could strip them out but there wouldn't be anything I could do with spaces in directory names.
USE [Maintenance]
GO SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO -- Procedure Name: ExportStaticData
--
-- Description: Based on a list of Servers and DBs (in StaticDataExportSourceServer and
-- StaticDataExportSourceDB), export all tables (unless noted in
-- StaticDataExportException) for upload into Subversion (svn).
-- This procedure is meant to be ran by a T-SQL job step (nightly).
-- Parameters: None / Nada / Zip / Zilch
--
-- Returns:
--
-- Created By: Solomon Rutzky
-- Created On: 04/12/2008
--
CREATE PROCEDURE [dbo].[ExportStaticData]
AS SET NOCOUNT ON DECLARE @ServerName SYSNAME,
@DBName SYSNAME,
@Environment VARCHAR(50),
@ExportSQL NVARCHAR(4000),
@TableIndex SMALLINT,
@TableCount SMALLINT,
@ExportRoot VARCHAR(500),
@ExportFile VARCHAR(600),
@SVNFolderName VARCHAR(50),
@TableName SYSNAME,
@SVNLocation VARCHAR(100),
@SVNCommand VARCHAR(2000) SET @SVNLocation = 'svn.exe ' -- '"C:\Program Files\Subversion\bin\svn.exe" ' IF ( OBJECT_ID('tempdb..#ExportTables') IS NULL ) BEGIN CREATE TABLE #ExportTables (RowID SMALLINT IDENTITY(1,1) PRIMARY KEY, TableName SYSNAME, PKColumns NVARCHAR(500)) END SELECT TOP 1 @Environment = sdess.Environment FROM dbo.StaticDataExportSourceServer sdess WHERE sdess.MaintenanceServer = @@SERVERNAME SET @ExportRoot = 'E:\Databases\tags\Autogenned\' + @Environment + '\' -- make sure we have the latest and greatest from the SVN repository: SET @SVNCommand = @SVNLocation + ' update ' SET @SVNCommand = @SVNCommand + @ExportRoot PRINT @SVNCommand -- print for inclusion in Step Output capture in Job setup EXEC master.dbo.xp_cmdshell @SVNCommand -- SVN Loc = Dev, Production, Staging DECLARE cur_ExportDBs CURSOR LOCAL FAST_FORWARD FOR SELECT sdess.ServerName, sdesd.DBName, sdess.SVNFolderName FROM dbo.StaticDataExportSourceServer sdess INNER JOIN dbo.StaticDataExportSourceDB sdesd ON sdesd.Environment = sdess.Environment AND sdesd.ServerName = sdess.ServerName AND sdess.MaintenanceServer = @@SERVERNAME ORDER BY ServerName, DBName OPEN cur_ExportDBs FETCH NEXT FROM cur_ExportDBs INTO @ServerName, @DBName, @SVNFolderName WHILE ( @@FETCH_STATUS = 0 ) BEGIN -- clear out the table from the previous run TRUNCATE TABLE #ExportTables --create temp table to hold results of dynamic query (TableName, PKColumns): INSERT INTO #ExportTables (TableName, PKColumns) EXEC(' SELECT obj.name, Maintenance.SQL#.String_Join('' SELECT sc.name AS ColumnName FROM [' + @ServerName + '].[' + @DBName + '].[dbo].sysobjects so INNER JOIN [' + @ServerName + '].[' + @DBName + '].[dbo].sysindexkeys sik ON so.xtype = ''''PK'''' AND so.parent_obj = sik.id AND so.parent_obj = '' + CONVERT(NVARCHAR(20), obj.id) + '' INNER JOIN [' + @ServerName + '].[' + @DBName + '].[dbo].sysindexes si ON si.id = sik.id AND si.indid = sik.indid AND si.name = so.name INNER JOIN [' + @ServerName + '].[' + @DBName + '].[dbo].syscolumns sc ON sc.id = sik.id AND sc.colid = sik.colid ORDER BY sik.keyno '', ''], ['', 1) AS ''Columns'' FROM [' + @ServerName + '].[' + @DBName + '].[dbo].sysobjects obj LEFT JOIN dbo.StaticDataExportException sdee ON sdee.DBName = ''' + @DBName + ''' AND sdee.TableName = obj.name WHERE obj.xtype = ''U'' AND sdee.TableName IS NULL AND obj.name NOT LIKE ''sys%'' COLLATE SQL_Latin1_General_CP1_CS_AS -- case-sensitive match, fyi AND obj.name NOT IN (''dtproperties'', ''MSpeer_lsns'', ''MSPeer_request'', ''MSpeer_response'', ''MSpub_identity_range'', ''MSreplication_subscriptions'', ''MSsubscription_agents'') ORDER BY obj.name ') SET @TableCount = @@ROWCOUNT -- cycle through list of tables: SET @TableIndex = 1 WHILE ( @TableIndex <= @TableCount ) BEGIN SELECT @ExportSQL = 'SELECT * FROM [' + @ServerName + '].[' + @DBName + '].[dbo].[' + et.TableName + ']' + CASE WHEN LTRIM(et.PKColumns) <> '' THEN ' ORDER BY [' + et.PKColumns + ']' ELSE '' END, @TableName = et.TableName FROM #ExportTables et WHERE et.RowID = @TableIndex -- example location: E:\Databases\tags\Autogenned\Dev\DevDB01\Data\Orders SET @ExportFile = @ExportRoot + @SVNFolderName + '\Data\' + @DBName + '\' + @TableName + '.txt' --print @ExportFile -- Text-Qualifier is ascii-char 170 ¬ (logical "not" symbol) EXEC SQL#.DB_BulkExport @ExportSQL, -- query to get the data '¬', -- text-qualifier 0, -- text-qualify all columns? 'always', -- column-headering-handling (Always, Results, Never) 'word', -- bit-handling (word [True/False], letter [T/F], or number [1/0]) 0, -- first row # (0 = from beginning) 0, -- last row # (0 = no limit) @ExportFile, -- output file path NULL, -- field terminator (default = \t) NULL, -- row terminator (default = \r\n) 'unicode' -- file-encoding (default = ascii; but we have NVARCHAR and NTEXT fields) SET @TableIndex = @TableIndex + 1 END FETCH NEXT FROM cur_ExportDBs INTO @ServerName, @DBName, @SVNFolderName END -- WHILE (@@FETCH_STATUS) CLOSE cur_ExportDBs DEALLOCATE cur_ExportDBs -- add new files (for newly created tables) to the SVN repository: SET @SVNCommand = @SVNLocation + ' add --force ' SET @SVNCommand = @SVNCommand + @ExportRoot PRINT @SVNCommand -- print for inclusion in Step Output capture in Job setup EXEC master.dbo.xp_cmdshell @SVNCommand -- commit all of the file changes to the SVN repository: SET @SVNCommand = @SVNLocation + ' commit --message "Automated Static Data Exporter for: ' SET @SVNCommand = @SVNCommand + @Environment SET @SVNCommand = @SVNCommand + '" ' SET @SVNCommand = @SVNCommand + @ExportRoot PRINT @SVNCommand -- print for inclusion in Step Output capture in Job setup EXEC master.dbo.xp_cmdshell @SVNCommand --drop table #ExportTables
This procedure is scheduled to run nightly via a SQL Server Agent job. There are only two lines of code in the T-SQL Job Step, and the second line (the PRINT command) is just to log the end-time since I am capturing Step Output to a text file via the "Advanced" tab of the Job Step.
EXEC Maintenance.dbo.ExportStaticData PRINT CONVERT(VARCHAR(50), GETDATE(), 121)
The nightly process takes between 5 and 8 minutes to process 2 Servers, 10 Databases, and 342 tables. Again, these export files amount to close to 270 megs; and while the "commit" to Subversion will only check in the differences on the files that have changed, the exporting of the data from SQL Server will always export the entire 270 megs (and slowly growing). It is typical that each night at least 10 tables have changes and most run almost instantly with only one table taking a few minutes since it has large amounts of text in a NTEXT field.
Are We There Yet?
Well, yes and no. Fortunately it is much more Yes than No. For the most part we are done since we can nightly export and version a dynamic set of tables without having to worry about new, removed, or updated tables. However, if we look back to Requirement #4 from the beginning--"If possible, process should allow for auditing distinct data changes so we can see point-in-time changes and who made the change"--we still have not covered that scenario and it would be great if we could accomplish all of the requirements. However, we will have to wait until the next exciting installment to see a few different thoughts on that topic.
Conclusion
While there has been a largely negative back-lash against using the CLR within SQL Server 2005 (and beyond) there really are some valid business cases for using it. This is not to say that the more trivial and common uses revolving around Regular Expressions and other string functions are not equally valid (although I personally believe that Regular Expressions should be built into the DB since, after all, we are doing searches here. Hello!) However, in this situation I found that by using the CLR I was able to best accomplish the task given to me. And while it might be possible to manipulate SSIS to deal with dynamic columns, I find that the solution provided here is a very clean and elegant one. Also, if it were not for the CLR I could have just as easily created a Console Application (a .exe to run from the command-line) and used xp_cmdshell to call it, but that would have required a process fork (shelling out) per each table to export. It is either that or having the Console App designed to be more specific regarding its functionality to export a set of tables but then it would not have been a general use feature. By using the CLR I am able to integrate the call to the BulkExport into a native T-SQL Procedure AND I have a general use function that I can use in any number of projects that need to export dynamic data. So, when used properly, the CLR can be a very powerful and flexible tool for getting the most out of SQL Server 2005 (and beyond).
Copyright © May, 2008 by Solomon Rutzky