This is an excerpt from my free eBook, Brad’s Sure Guide to SQL Server 2008, which can be downloaded here.
As DBAs, one of our goals is to separate OLTP (On-Line Transaction Processing) and OLAP (On- Line Analytical Processing) activity, preferably onto different servers. This is because performing both types of activities in the same database, or even in different databases on the same server, can lead to performance problems. In order to accomplish this goal, DBAs must figure out the best way to move transactional data from OLTP servers to servers designated for OLAP activity so that the data remains more or less synchronized. This sounds simple, but can prove to be difficult in practice.
One solution is to simply copy (backup and restore) the entire OLTP database to another server where OLAP activity can be performed on it. A slightly more sophisticated solution would be to transform all the data as it is being moved from the OLTP database to another server so that it is easier to perform OLAP activity on the data. While these solutions can and do work, they often become impractical when large amounts of data are involved.
Instead of moving all of the data, over and over again between an OLTP server and an OLAP server, it is much more efficient if only the changes are moved from server to server, so that the data on the OLTP server and the OLAP server are kept synchronized. This requires fewer resources and can boost performance. The larger the databases involved, the bigger the performance gain when only changed data is moved between servers.
In SQL Server 2005 and earlier, DBAs had several options when it came to moving changes from an OLTP server to an OLAP server. These included using replication, timestamp columns, triggers, complex queries, as well as expensive third-party tools. None of these are easy to implement, and many of them use a lot of server resources, negatively affecting the performance of the OLTP server.
In SQL Server 2008 (Enterprise Edition only), Microsoft introduced a new feature called Change Data Capture (CDC). It is designed to make it much easier and less resource intensive to move changed data from an OLTP server to an OLAP server. In a nutshell, CDC captures and records INSERT, UPDATE, and DELETE activity in an OLTP database and stores it in a form that is easily consumed by an application, such as an SSIS package. This package is used to take the data and apply it to an OLAP server, allowing both servers to keep their data synchronized.
Change Data Capture Architecture
Figure 1 presents a high-level overview of the major components of the CDC architecture, and at how the various components work together. Notice that the diagram is divided into two sections.
The top section represents an OLTP server and the bottom section represents an OLAP server.
Figure 1: High-level view of Change Data Capture architecture.
Let’s start with the OLTP server, as this is where CDC does its work.
Source Tables
When SQL Server 2008 is first installed, CDC is turned off by default so the first step is to enable it at the database level. If you want to turn it on for more than one database, you must enable it for each. Next, CDC must be enabled at the table level, on a table-by-table basis. Each table that is CDC-enabled is known as a Source table. In figure 1, a single OLTP database has been enabled for CDC, and three Source tables have been enabled for CDC.
Change Data Capture Tables
When CDC has been enabled for a database, along with one or more tables, several things happen under the covers. For example, for every Source table enabled for CDC, an associated Change Data Capture table is created, which is used to store the changes made in the Source table, along with some metadata used to track the changes.
In our example, since three of our Source tables have been enabled for CDC, three CDC tables have been created. CDC tables aren’t like the tables we are used to working with. Normally, when you want to view data from a table, you run a SELECT statement directly against the table. This is not the case with CDC tables as you can’t run statements directly against them. Instead, in order to view the data from a CDC table, you have to run SELECT against a Change Data Capture query function.
Change Data Capture Query Functions
For each of the Source tables enabled for CDC, one or two CDC query functions are created, depending on how the tables were enabled:
- cdc.fn_cdc_get_all_changes_capture_instance – this query function is always created and is used to return all changes from a CDC table.
- cdc.fn_cdc_get_net_changes_capture_instance - this query function is optionally created when enabling the Source table for CDC, and can be used to only return the net changes from a CDC table.
In order to view the contents of a CDC Table, you run a SELECT statement against its associated CDC query function, and the results are displayed just as any other query.
In our example in figure 1, only one CDC query function has been created for each of the three CDC tables.
Capture Instances
When a CDC table and any related CDC query functions are created, they make up what is called a capture instance. A capture instance is created for every Source table that is enabled for CDC. Capture instances are given a name so that they can be distinguished from one another. For example, if the table named sales.products is CDC-enabled, the capture instance created is named sales_products. In turn, the name of the CDC table is referred to as sales_products, and the names of the two associated CDC query functions are: cdc.fn_cdc_get_all_changes_sales_products and cdc.fn_cdc_get_net_changes_ sales_products. In our example, since three Source tables have been CDC-enabled, three new capture instances have been created.
Capture and Cleanup Jobs
Besides the Change Data Capture tables and query functions that have been created in our example, two SQL Server Agent jobs are created: a Capture and a Cleanup Job.
The Capture job generally runs continuously and is used to move changed data to the CDC tables from the transaction log. The Cleanup job runs on a scheduled basis to remove older data from the CDC tables so that they don’t get too large. By default, data older than three days is automatically removed from CDC tables by this job.
Now that we are familiar with the components of CDC, let’s see how they work together.
How Change Data Capture Works
You will recall from figure 1 that three Source tables were CDC-enabled. At this point, when any INSERT, UPDATE, or DELETE statements are issued against any of these three Source tables, these changes are first written to the transaction log associated with the database. This is normal behavior and occurs whether or not CDC has been enabled. What is different, now that CDC has been turned on for the Source tables, is that the Capture Process (the SQL Server Agent Capture job) reads the changes from the transaction log and moves them into the appropriate CDC tables.
INSERTs and DELETEs each occupy one row of the CDC table, and UPDATEs occupy two rows: one row for the "before" data and one row for the "after" data. These changes continue to accumulate in the CDC tables until they are deleted by the Cleanup Job, when it makes its scheduled run.
At any time, the DBA can view (or extract) the change data from the CDC tables by running appropriate statements against the relevant CDC query functions. For example, let’s say that a DBA has a job that executes an SSIS package once every four hours. When this SSIS package executes, it runs a statement against the appropriate CDC query function, which allows it to extract all of the changes made to a Source table since the last time the SSIS package was executed, and then moves the changes to a database on an OLAP server. This way, the data on the OLAP server is kept synchronized with the data from the OLTP server, with only a four hour delay. Of course, this is only one example. As the DBA, you have many options on how you might want to configure the data transfer between the two servers.
Implementing Change Data Capture for a Single Table
Now that we have a basic understanding of how CDC works, let’s implement a simple example of it using a modified version of the AdventureWorks database. For this example, our goal is to enable a single table for CDC so that we can see how changes to the Source table are stored in the CDC table. In addition, we will look at how we can query the CDC table using a CDC query function to examine the changes that were captured.
The first step is to CDC-enable the AdventureWorks database. This is done by running the sys.sp_cdc_enable_db stored procedure inside the database to be CDC-enabled, as follows:
USE AdventureWorks
GO
EXEC sys.sp_cdc_enable_db
GO
Now that the AdventureWorks database is CDC-enabled, the next step is to CDC-enable each Source table for which we want to track changes. While we will only perform this for a single table for this example, you can easily repeat this same process for each table in a database that you want to CDC-enable. This is done using the sys.sp_cdc_enable_table stored procedure:
USE AdventureWorks
GO
EXEC sys.sp_cdc_enable_table
@source_schema = N’Sales’,
@source_name = N’Store’,
@role_name = N’CDCRole’
The parameters are as follows:
- @source_schema refers to the schema name of the Source table you are enabling.
- @source_name refers to the name of the Source table you are enabling.
- @role_name refers to the security role that is created when CDC is enabled. This role can be ignored, or it can be used to assign permission to specific users so they can access the changed data using CDC query functions, without having to be a member of the db_owner role. If this role does not preexist, it will be created for you.
In our example above, we have CDC-enabled a Source table called Sales.Store, and have created a new role called CDCRole. Although you can’t see it, an associated CDC table has been created, along with a CDC query function named cdc.fn_cdc_get_all_changes_Sales_Store.
Now that CDC has been enabled for the database and a table, let’s make some changes to the Source table and then use the CDC query function to see what was captured in the CDC table. The following code INSERTs a row, DELETEs a row, and UPDATEs a row in the Source table:
INSERT INTO Sales.Store
VALUES (1556,N’The Bunny Shoppe’, ‘277′, GETDATE());
DELETE FROM Sales.Store
WHERE CustomerID = 2
UPDATE Sales.Store
SET Name = ‘The Magic Shoppe’
WHERE CustomerID = 6
Now that some changes have been made to our Source table, let’s see what data the CDC table contains. As mentioned before, the only way to view the rows in a Change Data Capture Table is use the CDC query function that was created when the Source table was CDC-enabled.
The generic appearance of this function is:
cdc.fn_cdc_get_all_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )
The first thing to notice about the CDC query function above is that the first part of the function’s name is:
cdc.fn_cdc_get_all_changes_
And that the last part of the function name is:
capture_instance
As we discussed earlier, whenever a capture instance is created, it not only creates the related CDC table, it also creates a custom CDC query function that is based on the capture instance name. For our example, the capture instance name is:
Sales_Store
Given this, then the name of the CDC query function is:
cdc.fn_cdc_get_all_changes_Sales_Store
Keep in mind that every Source table that is CDC-enabled will have either one or two CDC query functions, and that they will be named differently, based on the capture instance name.
Our CDC query function has three parameters. The first parameter is:
from_lsn
This refers to the LSN (Log Sequence Number) that represents the low endpoint of the LSN range you want to include in your result set. A LSN is assigned to every entry made into the transaction log, in ascending order. This means that as changes are made to the Source table, and then made in the transaction log, each change is assigned an LSN. The LSN of every transaction is also recorded in the CDC table, and can be used as a way to select which changes you want to return using the CDC query function.
The second parameter is:
to_lsn
This is the LSN that represents the high endpoint of the LSN range you want included in the result set. The combination of the from_lsn and the to_lsn constitute the range of "changes" you want to return.
The third parameter is:
‘<row_filter_option>’
The values of this parameter can be either:
- All - tells the function to return all changes within the specified LSN range, but only return a single UPDATE row that contains the new values.
- All update old - also returns all changes within the specified LSN range, but includes the two UPDATE rows, one with the before data and one with the after data.
Now, you may be asking yourself, how do you find the from_lsn and the to_lsn that are needed for the function’s parameters? There are different ways this can be done, but for this example, we will keep it simple. In our case, we just want to return the oldest from_lsn and the newest to_lsn that is stored in the CDC table. In this case, two different functions are used for this purpose.
The first function is:
sys.fn_cdc_get_min_lsn(‘capture_instance_name’)
This function is used to find the begin_lsn, where:
‘capture_instance_name’
is the capture instance name of the Chance Data Capture table.
The second function:
sys.fn_cdc_get_max_lsn()
is used to find the to_lsn value. Notice that it requires no parameters.
OK, let’s put all of this together and see what rows were recorded in the Change Data Capture table, after performing the three DML statements above. To find out, let’s run the following code:
USE AdventureWorks
GO–declare variables to represent beginning and ending lsn
DECLARE @from_lsn BINARY(10), @to_lsn BINARY(10)– get the first LSN for table changes
SELECT @from_lsn = sys.fn_cdc_get_min_lsn(‘Sales_Store’)– get the last LSN for table changes
SELECT @to_lsn = sys.fn_cdc_get_max_lsn()– get all changes in the range using "all update old" parameter
SELECT * FROM cdc.fn_cdc_get_all_changes_Sales_Store(@from_lsn, @to_lsn, ‘all update old’);GO
The results returned are shown in Figure 2:
Figure 2: This is what is stored in the Change Data Capture table.
So let’s take a look at the above code and see how it works. The line:
DECLARE @from_lsn BINARY(10), @to_lsn BINARY(10)
is used to declare two variables: one represents the from_lsn value and the other represents the to_lsn value. Notice that these values use the BINARY data type.
The next two statements are used to identify the from_lsn and to_lsn values:
SELECT @from_lsn = sys.fn_cdc_get_min_lsn(‘Sales_Store’)
SELECT @to_lsn = sys.fn_cdc_get_max_lsn()
Now that we have the from_lsn and the to_lsn values, we can place them in our CDC query function:
SELECT * FROM cdc.fn_cdc_get_all_changes_Sales_Store(@from_lsn, @to_lsn, ‘all update old’);
Since I used the option "all update old" as part of the above query function, all the rows in the table were returned, included the row I INSERTED, the row I DELETED, and the row I UPDATED (both the before and after values).
As you might imagine, you could incorporate similar code inside a SSIS package that would periodically execute and get all of the changes from the CDC table, manipulate the data appropriately, and then move it into a database on an OLAP server. Of course, the code needed to accomplish this goal is more complex than what you see above, but I think you get the idea.
Summary
Change Data Capture offers a new technology in SQL Server 2008 Enterprise Edition that makes it easier than in previous versions of SQL Server to move changed data from an OLTP server to an OLAP server. Not only is setup and administration much easier than previously available technology, performance has been greatly enhanced. Change Data Capture should be seriously considered as you design new applications, or redesign older applications, whenever you need to move changed data from one database to another.