June 8, 2011 at 12:16 am
Hi All,
I have seen lots of similar posts but couldn't find one that kinda matched my situation. So, here it is:
We are getting source data in Table A (after all calculations) and putting it in datawarehouse Table B (This table is supposed to hold all modified rows and history). In other words, if yesterday 1 or more columns got updated in Table A, then we need a new row of data in Table B.
Currently, there are about 70K rows with about 80 columns; and per our user, apart from 1, data in all columns is updatabale on their end.
So, I have a view that shows the latest and distinct data that I am using to compare each column of each row.
Now here is the problem - it is taking sooo... long.
code sample -
[font="Arial Narrow"]Insert into TableB
Select A.* from TableA A
WHERE EXISTS (select B.ID
from vw_On_TableB B
where A.id = B.ID
and
( A.column2 <> B.Column2
or A.Column3 <> B.Column3
or A.....
or A.Column80 <> B.Column80
))
OR Not exists ( select 1 from vw_On_TableB B
where A.id = B.ID
)[/font]
Can you suggest what I should use to make it more efficient? Pls help. Thanks in advance.
June 8, 2011 at 5:58 am
Instead of trying to build this yourself, I'd suggest you pick up one of the third party tools out there that are designed for doing exactly this. It'll make your life much easier. I'd suggest checking out Red Gate SQL Data Compare (disclosure, I work for them).
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
June 9, 2011 at 9:21 pm
That definitely sounds like a good idea, but I don't think we'd be able to buy another software just for that. Isn't there a T-sql way to do this?
June 9, 2011 at 11:36 pm
You can use hash of the row data, transfer it across with the etl step, and use it as the basis for comparison.
In fact, I have code on my blog that automatically generates all of the required procedures etc for you. There are some minor limitations, all explained in the posts.
This method is very fast. Our system is significantly larger than what you're describing and it analyzes several hundred gig worth of data and transfers changes in about half an hour for all tables.
June 10, 2011 at 12:08 am
By the way, we were able to further improve the performance of this solution given some knowledge of business rules.
For instance, in an invoice table, posted invoices cannot be modified, so there is no need to check to see whether they have changed.
Now, the initial solution doesn't have the greatest support for adding additional predicates. You want the predicates to be evaluated locally at the source for maximum performance. As such, the solution I posted on my blog isn't complete. I've only just realised that I never updated it. So let me go over the process again here, with the extra code required.
In brief, you need the following things for each table you want to capture:
1) A way to get the metadata for the table you are capturing, so you can generate a destination table. (This is easy if you are doing one table, but we were doing over 200, which is why all of the procedures below spit out DDL to make the objects for you based on table name parameters)
2) The destination table that will hold your captured data.
3) A way to have useful efficiency-gaining predicates evaluated at the source. For this we use a view.
4) A procedure to efficiently move data from the source view (3) to the destination table (2)
I have the code to automatically generate all this in a "config" schema.
Here's the code for (1), a function that returns the column definitions for your selected source table (passed in as the argument):
----------------------------------------------------------------------------------------------------------------------------
-- return metadata about a given table
----------------------------------------------------------------------------------------------------------------------------
ALTER function [config].[source_column_definitions] (@schema_name sysname, @table_name sysname) returns table as
return (
select
tb.table_schema,
quotename(tb.table_name) as table_name,
quotename(co.column_name) as column_name,
co.data_type +
case
when co.data_type like '%char%' or co.data_type like '%binary%' then
case
when co.character_maximum_length = -1 then '(max)'
else '(' + cast(co.character_maximum_length as varchar(4)) + ')'
end
when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'
else ''
end as data_type,
quotename(co.column_name) +
co.data_type +
case
when co.data_type like '%char%' or co.data_type like '%binary%' then
case
when co.character_maximum_length = -1 then '(max)'
else '(' + cast(co.character_maximum_length as varchar(4)) + ')'
end
when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'
else ''
end +
case
when co.IS_NULLABLE = 'NO' then ' not null'
else ''
end as column_definition,
co.ordinal_position,
kc.ordinal_position as key_ordinal_position
from OLTP.natlive.information_schema.tables tb
join OLTP.natlive.information_schema.columns coon co.table_catalog = tb.table_catalog
and co.table_schema = tb.table_schema
and co.table_name = tb.table_name
left join OLTP.natlive.information_schema.table_constraints tc on tc.table_catalog = tb.table_catalog
and tc.table_schema = tb.table_schema
and tc.table_name = tb.table_name
and tc.constraint_type = 'PRIMARY KEY'
left join OLTP.natlive.information_schema.key_column_usage kc on kc.constraint_catalog = tc.constraint_catalog
and kc.constraint_schema = tc.constraint_schema
and kc.constraint_name = tc.constraint_name
and kc.column_name = co.column_name
where (co.character_maximum_length is null or co.character_maximum_length between 0 and 255)
AND co.DATA_TYPE != 'timestamp'
and tb.table_schema like @schema_name
and tb.table_name like @table_name
)
Here's the code for (2), a function which returns DDL, using (1), that generates a destination table:
----------------------------------------------------------------------------------------------------------------------------
-- DDL to create CDC destination table and indexes
----------------------------------------------------------------------------------------------------------------------------
ALTER function [config].[destination_table_DDL] (@source_table_name sysname) returns varchar(max) as begin
-- inititalize with DDL common to all CDC tables
declare @DDL varchar(max) set @DDL =
'create table CDC.' + quotename(@source_table_name)
+ ' (row_effective_start_DT datetime not null, row_effective_end_DT datetime not null, row_hash varbinary(20) not null, '
declare @index_ddl varchar(max) set @index_ddl = ''
-- use concatenation trick to build column and index definitions
select @DDL = @DDL + t.column_definition + ', ',
@index_ddl = @index_ddl + case when t.key_ordinal_position is null then '' else t.column_name + ', ' end
from config.source_column_definitions('dbo', @source_table_name) t
order by t.key_ordinal_position asc, t.ordinal_position asc
-- remove trailing comma, close table definition and add terminator
set @DDL = left(@DDL, len(@DDL) - 1) + '); '
-- add primary key
set @DDL = @DDL
+ 'alter table CDC.' + quotename(@source_table_name) + ' add constraint PK_' + @source_table_name
+ ' primary key clustered (row_effective_start_DT, ' + left(@index_ddl, len(@index_ddl) - 1) + '); '
-- add unique index
set @DDL = @DDL
+ 'create unique index ux_' + @source_table_name + ' on CDC.' + @source_table_name
+ '(' + @index_ddl + 'row_effective_end_DT) include (row_hash, row_effective_start_DT); '
-- add foreign key to audit schema. Comment this out if there is no audit schema
set @DDL = @DDL
+ 'alter table CDC.' + quotename(@source_table_name) + ' add constraint FK_' + @source_table_name
+ '__ETL_executions foreign key (row_effective_start_DT) references CDC.ETL_executions (ETL_cutoff_DT);'
set @DDL = @DDL
+ 'create index IX_' + @source_table_name + '_dates on CDC.' + @source_table_name + '(row_effective_end_DT, row_effective_start_DT);'
return @DDL
end
Here is the code that generates the view (3) on the source side:
----------------------------------------------------------------------------------------------------------------------------
-- gets the ddl for view to create on source
----------------------------------------------------------------------------------------------------------------------------
ALTER function [config].[source_view_DDL](@source_table_name sysname) returns varchar(max) as begin
declare @cols varchar(max) set @cols = ''
declare @DDL varchar(max) set @DDL =
'create view CDC.' + quotename(@source_table_name) + ' as /n' +
'selectetl.cutoff_DT,/n' +
' src.*/n' +
'from(/n' +
' selecthashbytes(''SHA1'', {cast_cols}) as row_hash,/n' +
'{cols}/n' +
' fromnatlive.dbo.' + quotename(@source_table_name) + '/n' +
' ) src/n' +
'outer apply (select top 1 cutoff_DT from BI.natlive_CDC.audit.ETL_table_loads where status_code = ''SUCC'' and table_name = ''' + @source_table_name + ''' order by cutoff_DT desc) etl/n' +
'go'
select @cols = @cols + t.column_name + ', '
from config.source_column_definitions('dbo', @source_table_name) t
order by t.key_ordinal_position asc, t.ordinal_position asc
set @cols = left(@cols, len(@cols) - 1)
set @DDL = replace(@DDL, '{cols}', @cols)
set @cols = 'isnull(cast(' + replace(@cols, ', ', ' as varchar(255)), '''') + ''|'' + isnull(cast(') + ' as varchar(255)), '''') + ''|'''
set @DDL = replace(@DDL, '{cast_cols}', @cols)
return @DDL
end
And finally (4), the procedure that gets data from the view and puts it into the destination table (don't forget to replace the /n's with actual newlines!!):
----------------------------------------------------------------------------------------------------------------------------
-- gets the ddl required to create the procedure to ETL changed data from the source to the CDC system
----------------------------------------------------------------------------------------------------------------------------
ALTER function [config].[procedure_DDL] (@source_table_name sysname, @use_openquery bit = 0) returns varchar(max) as begin
-- used for easy search and replace to allow reformat of output to match template formatting
declare @n char(2) set @n = '&'
declare @pre_exec varchar(2000)
set @pre_exec = 'insert audit.etl_table_loads (cutoff_DT, status_code, table_name, process_start_DT) values (@ETL_Cutoff_DT, ''PROC'', ''{table}'', getdate())'
declare @post_exec varchar(2000)
set @post_exec = 'update audit.etl_table_loads set status_code = ''{status}'', process_end_DT = GETDATE(), row_count = @rc where [Table_name] = ''{table}'' and cutoff_DT = @ETL_cutoff_DT'
-- initialize with template for merge
declare @DDL varchar(max) set @DDL =
'create procedure CDC.ETL_{table} (@ETL_cutoff_DT datetime, @debug_flag bit = 0) with recompile as begin /n' +
' set nocount on /n' +
' /n' +
' declare @rc int /n' +
' /n' +
' {pre_exec} /n' +
' /n' +
' begin tran /n' +
' /n' +
' begin try /n' +
' /n' +
' insert CDC.[{table}] (row_effective_start_DT, row_effective_end_DT, row_hash, {cols}) /n' +
' select @ETL_cutoff_DT, ''9999-12-30'', src.row_hash, {src_cols} /n' +
case when @use_openquery = 0 then
' from OLTP.natlive_ext.CDC.[{table}] src /n'
else
' from openquery(OLTP, ''select * from natlive_ext.CDC.[{table}]'') src /n'
end+
' left join CDC.[{table}] dst on {join_cols} /n' +
' and dst.row_effective_end_DT = ''9999-12-31'' /n' +
' and src.row_hash = dst.row_hash /n' +
' where dst.row_effective_end_DT is null /n' +
' /n' +
' set @rc = @@rowcount /n' +
' /n' +
' if (@RC > 0) begin /n' +
' update dst /n' +
' set dst.row_effective_end_DT = case /n' +
' when dst.row_effective_end_DT = ''9999-12-31'' then @ETL_cutoff_DT /n' +
' when dst.row_effective_end_Dt = ''9999-12-30'' then ''9999-12-31'' /n' +
' end /n' +
' from CDC.[{table}] dst /n' +
' join CDC.[{table}] src on {join_cols} /n' +
' and dst.row_effective_end_DT between ''9999-12-30'' and ''9999-12-31'' /n' +
' and src.row_effective_start_DT = @ETL_cutoff_DT /n' +
' /n' +
' end /n' +
' /n' +
' commit /n' +
' /n' +
' {post_exec_ok} /n' +
' /n' +
' end try begin catch /n' +
' /n' +
' rollback /n' +
'/n' +
' {post_exec_fail}/n' +
' /n' +
' end catch /n' +
'end /n' +
'GO'
declare @cols varchar(max) set @cols = ''
declare @join_cols varchar(max) set @join_cols = ''
-- build strings
select @cols = @cols + t.column_name + ', ',
@join_cols = @join_cols + case
when t.key_ordinal_position is null then ''
when t.key_ordinal_position = 1 then 'src.' + t.column_name + ' = dst.' + t.column_name
else ' and src.' + t.column_name + ' = dst.' + t.column_name
end
from config.source_column_definitions('dbo', @source_table_name) t
order by t.key_ordinal_position asc, t.ordinal_position asc
-- remove trailing comma
set @cols = left(@cols, len(@cols) - 1)
-- merge audit logic first just because
set @pre_exec = replace(@pre_exec, '{table}', @source_table_name)
set @DDL = replace(@DDL, '{pre_exec}', @pre_exec)
set @post_exec = replace(@post_exec, '{table}', @source_table_name)
set @post_exec = replace(@post_exec, '{status}', 'SUCC')
set @DDL = replace(@DDL, '{post_exec_ok}', @post_exec)
set @post_exec = replace(@post_exec, '''SUCC''', '''FAIL''')
set @DDL = replace(@DDL, '{post_exec_fail}', @post_exec)
-- merge with template
set @DDL = replace (@DDL, '{table}', @source_table_name)
set @DDL = replace(@DDL, '{cols}', @cols)
set @cols = 'src.' + replace(@cols, ', ', ', src.')
set @DDL = replace(@DDL, '{src_cols}', @cols)
set @DDL = replace(@DDL, '{join_cols}', @join_cols)
return @DDL
end
If you have any business rules that can speed things up, then you add that to the view definition on a case by case basis. For example, after generating the "base" ddl for the source view for our invoice tables, we altered the view and added the predicate to exclude posted invoices.
The procedure can use openquery against the view so that any extra predicates are evaluated at the source. If you have such modifications, set the argument to use openquery. Otherwise you needn't bother.
June 10, 2011 at 1:39 am
IMO the most simple solution is to use the microsoft provided tool. Tablediff.exe. Its what its there for.
C:\Program Files\Microsoft SQL Server\100\COM\Tablediff.exe
June 10, 2011 at 2:04 am
You may have it already in your software suite, visual studio 2010 has a compare that you could use.
MCT
MCITP Database Admin 2008
MCITP Database Admin 2008
MCITP Database Dev 2008
www.jnrit.com.au/Blog.aspx
June 10, 2011 at 9:20 am
You could indeed use any of several existing (and sometimes free) out of the box tools. But you want to keep history (so there will necessarily be differences that you want, you're not just trying to replicate), and my way is much, much faster, particularly once tweaked based on your business rules! 😛
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply