How to effectively compare Data in 2 same tables?

Question

How to effectively compare Data in 2 same tables?

TK-487581

Mr or Mrs. 500

Points: 586
More actions
June 8, 2011 at 12:16 am

#239582

Hi All,
I have seen lots of similar posts but couldn't find one that kinda matched my situation. So, here it is:
We are getting source data in Table A (after all calculations) and putting it in datawarehouse Table B (This table is supposed to hold all modified rows and history). In other words, if yesterday 1 or more columns got updated in Table A, then we need a new row of data in Table B.
Currently, there are about 70K rows with about 80 columns; and per our user, apart from 1, data in all columns is updatabale on their end.
So, I have a view that shows the latest and distinct data that I am using to compare each column of each row.
Now here is the problem - it is taking sooo... long.
code sample -
[font="Arial Narrow"]Insert into TableB
Select A.* from TableA A
WHERE EXISTS (select B.ID
from vw_On_TableB B
where A.id = B.ID
and
( A.column2 <> B.Column2
or A.Column3 <> B.Column3
or A.....
or A.Column80 <> B.Column80
))
OR Not exists ( select 1 from vw_On_TableB B
where A.id = B.ID
)[/font]
Can you suggest what I should use to make it more efficient? Pls help. Thanks in advance.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398890 More actions · Answer 1

Instead of trying to build this yourself, I'd suggest you pick up one of the third party tools out there that are designed for doing exactly this. It'll make your life much easier. I'd suggest checking out Red Gate SQL Data Compare (disclosure, I work for them).

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

TK-487581 Mr or Mrs. 500 Points: 586 More actions · Answer 2

That definitely sounds like a good idea, but I don't think we'd be able to buy another software just for that. Isn't there a T-sql way to do this?

Don Halloran SSCertifiable Points: 5937 More actions · Answer 3

You can use hash of the row data, transfer it across with the etl step, and use it as the basis for comparison.

In fact, I have code on my blog that automatically generates all of the required procedures etc for you. There are some minor limitations, all explained in the posts.

http://www.sqlservercentral.com/blogs/don_halloran/archive/2010/12/21/easy-cdc-in-5-hours-less-for-you-using-automatic-code-generation.aspx

http://www.sqlservercentral.com/blogs/don_halloran/archive/2011/01/01/easy-cdc-part-2-automating-a-quick-initial-load-and-some-gotchas.aspx

This method is very fast. Our system is significantly larger than what you're describing and it analyzes several hundred gig worth of data and transfers changes in about half an hour for all tables.

Don Halloran SSCertifiable Points: 5937 More actions · Answer 4

By the way, we were able to further improve the performance of this solution given some knowledge of business rules.

For instance, in an invoice table, posted invoices cannot be modified, so there is no need to check to see whether they have changed.

Now, the initial solution doesn't have the greatest support for adding additional predicates. You want the predicates to be evaluated locally at the source for maximum performance. As such, the solution I posted on my blog isn't complete. I've only just realised that I never updated it. So let me go over the process again here, with the extra code required.

In brief, you need the following things for each table you want to capture:

1) A way to get the metadata for the table you are capturing, so you can generate a destination table. (This is easy if you are doing one table, but we were doing over 200, which is why all of the procedures below spit out DDL to make the objects for you based on table name parameters)

2) The destination table that will hold your captured data.

3) A way to have useful efficiency-gaining predicates evaluated at the source. For this we use a view.

4) A procedure to efficiently move data from the source view (3) to the destination table (2)

I have the code to automatically generate all this in a "config" schema.

Here's the code for (1), a function that returns the column definitions for your selected source table (passed in as the argument):

----------------------------------------------------------------------------------------------------------------------------

-- return metadata about a given table

----------------------------------------------------------------------------------------------------------------------------

ALTER function [config].[source_column_definitions] (@schema_name sysname, @table_name sysname) returns table as

return (

select

tb.table_schema,

quotename(tb.table_name) as table_name,

quotename(co.column_name) as column_name,

co.data_type +

case

when co.data_type like '%char%' or co.data_type like '%binary%' then

case

when co.character_maximum_length = -1 then '(max)'

else '(' + cast(co.character_maximum_length as varchar(4)) + ')'

end

when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'

else ''

end as data_type,

quotename(co.column_name) +

co.data_type +

case

when co.data_type like '%char%' or co.data_type like '%binary%' then

case

when co.character_maximum_length = -1 then '(max)'

else '(' + cast(co.character_maximum_length as varchar(4)) + ')'

end

when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'

else ''

end +

case

when co.IS_NULLABLE = 'NO' then ' not null'

else ''

end as column_definition,

co.ordinal_position,

kc.ordinal_position as key_ordinal_position

from OLTP.natlive.information_schema.tables tb

join OLTP.natlive.information_schema.columns coon co.table_catalog = tb.table_catalog

and co.table_schema = tb.table_schema

and co.table_name = tb.table_name

left join OLTP.natlive.information_schema.table_constraints tc on tc.table_catalog = tb.table_catalog

and tc.table_schema = tb.table_schema

and tc.table_name = tb.table_name

and tc.constraint_type = 'PRIMARY KEY'

left join OLTP.natlive.information_schema.key_column_usage kc on kc.constraint_catalog = tc.constraint_catalog

and kc.constraint_schema = tc.constraint_schema

and kc.constraint_name = tc.constraint_name

and kc.column_name = co.column_name

where (co.character_maximum_length is null or co.character_maximum_length between 0 and 255)

AND co.DATA_TYPE != 'timestamp'

and tb.table_schema like @schema_name

and tb.table_name like @table_name

)

Here's the code for (2), a function which returns DDL, using (1), that generates a destination table:

----------------------------------------------------------------------------------------------------------------------------

-- DDL to create CDC destination table and indexes

----------------------------------------------------------------------------------------------------------------------------

ALTER function [config].[destination_table_DDL] (@source_table_name sysname) returns varchar(max) as begin

-- inititalize with DDL common to all CDC tables

declare @DDL varchar(max) set @DDL =

'create table CDC.' + quotename(@source_table_name)

+ ' (row_effective_start_DT datetime not null, row_effective_end_DT datetime not null, row_hash varbinary(20) not null, '

declare @index_ddl varchar(max) set @index_ddl = ''

-- use concatenation trick to build column and index definitions

select @DDL = @DDL + t.column_definition + ', ',

@index_ddl = @index_ddl + case when t.key_ordinal_position is null then '' else t.column_name + ', ' end

from config.source_column_definitions('dbo', @source_table_name) t

order by t.key_ordinal_position asc, t.ordinal_position asc

-- remove trailing comma, close table definition and add terminator

set @DDL = left(@DDL, len(@DDL) - 1) + '); '

-- add primary key

set @DDL = @DDL

+ 'alter table CDC.' + quotename(@source_table_name) + ' add constraint PK_' + @source_table_name

+ ' primary key clustered (row_effective_start_DT, ' + left(@index_ddl, len(@index_ddl) - 1) + '); '

-- add unique index

set @DDL = @DDL

+ 'create unique index ux_' + @source_table_name + ' on CDC.' + @source_table_name

+ '(' + @index_ddl + 'row_effective_end_DT) include (row_hash, row_effective_start_DT); '

-- add foreign key to audit schema. Comment this out if there is no audit schema

set @DDL = @DDL

+ 'alter table CDC.' + quotename(@source_table_name) + ' add constraint FK_' + @source_table_name

+ '__ETL_executions foreign key (row_effective_start_DT) references CDC.ETL_executions (ETL_cutoff_DT);'

set @DDL = @DDL

+ 'create index IX_' + @source_table_name + '_dates on CDC.' + @source_table_name + '(row_effective_end_DT, row_effective_start_DT);'

return @DDL

end

Here is the code that generates the view (3) on the source side:

----------------------------------------------------------------------------------------------------------------------------

-- gets the ddl for view to create on source

----------------------------------------------------------------------------------------------------------------------------

ALTER function [config].[source_view_DDL](@source_table_name sysname) returns varchar(max) as begin

declare @cols varchar(max) set @cols = ''

declare @DDL varchar(max) set @DDL =

'create view CDC.' + quotename(@source_table_name) + ' as /n' +

'selectetl.cutoff_DT,/n' +

' src.*/n' +

'from(/n' +

' selecthashbytes(''SHA1'', {cast_cols}) as row_hash,/n' +

'{cols}/n' +

' fromnatlive.dbo.' + quotename(@source_table_name) + '/n' +

' ) src/n' +

'outer apply (select top 1 cutoff_DT from BI.natlive_CDC.audit.ETL_table_loads where status_code = ''SUCC'' and table_name = ''' + @source_table_name + ''' order by cutoff_DT desc) etl/n' +

'go'

select @cols = @cols + t.column_name + ', '

from config.source_column_definitions('dbo', @source_table_name) t

order by t.key_ordinal_position asc, t.ordinal_position asc

set @cols = left(@cols, len(@cols) - 1)

set @DDL = replace(@DDL, '{cols}', @cols)

set @cols = 'isnull(cast(' + replace(@cols, ', ', ' as varchar(255)), '''') + ''|'' + isnull(cast(') + ' as varchar(255)), '''') + ''|'''

set @DDL = replace(@DDL, '{cast_cols}', @cols)

return @DDL

end

And finally (4), the procedure that gets data from the view and puts it into the destination table (don't forget to replace the /n's with actual newlines!!):

----------------------------------------------------------------------------------------------------------------------------

-- gets the ddl required to create the procedure to ETL changed data from the source to the CDC system

----------------------------------------------------------------------------------------------------------------------------

ALTER function [config].[procedure_DDL] (@source_table_name sysname, @use_openquery bit = 0) returns varchar(max) as begin

-- used for easy search and replace to allow reformat of output to match template formatting

declare @n char(2) set @n = '&'

declare @pre_exec varchar(2000)

set @pre_exec = 'insert audit.etl_table_loads (cutoff_DT, status_code, table_name, process_start_DT) values (@ETL_Cutoff_DT, ''PROC'', ''{table}'', getdate())'

declare @post_exec varchar(2000)

set @post_exec = 'update audit.etl_table_loads set status_code = ''{status}'', process_end_DT = GETDATE(), row_count = @rc where [Table_name] = ''{table}'' and cutoff_DT = @ETL_cutoff_DT'

-- initialize with template for merge

declare @DDL varchar(max) set @DDL =

'create procedure CDC.ETL_{table} (@ETL_cutoff_DT datetime, @debug_flag bit = 0) with recompile as begin /n' +

' set nocount on /n' +

' /n' +

' declare @rc int /n' +

' /n' +

' {pre_exec} /n' +

' /n' +

' begin tran /n' +

' /n' +

' begin try /n' +

' /n' +

' insert CDC.[{table}] (row_effective_start_DT, row_effective_end_DT, row_hash, {cols}) /n' +

' select @ETL_cutoff_DT, ''9999-12-30'', src.row_hash, {src_cols} /n' +

case when @use_openquery = 0 then

' from OLTP.natlive_ext.CDC.[{table}] src /n'

else

' from openquery(OLTP, ''select * from natlive_ext.CDC.[{table}]'') src /n'

end+

' left join CDC.[{table}] dst on {join_cols} /n' +

' and dst.row_effective_end_DT = ''9999-12-31'' /n' +

' and src.row_hash = dst.row_hash /n' +

' where dst.row_effective_end_DT is null /n' +

' /n' +

' set @rc = @@rowcount /n' +

' /n' +

' if (@RC > 0) begin /n' +

' update dst /n' +

' set dst.row_effective_end_DT = case /n' +

' when dst.row_effective_end_DT = ''9999-12-31'' then @ETL_cutoff_DT /n' +

' when dst.row_effective_end_Dt = ''9999-12-30'' then ''9999-12-31'' /n' +

' end /n' +

' from CDC.[{table}] dst /n' +

' join CDC.[{table}] src on {join_cols} /n' +

' and dst.row_effective_end_DT between ''9999-12-30'' and ''9999-12-31'' /n' +

' and src.row_effective_start_DT = @ETL_cutoff_DT /n' +

' /n' +

' end /n' +

' /n' +

' commit /n' +

' /n' +

' {post_exec_ok} /n' +

' /n' +

' end try begin catch /n' +

' /n' +

' rollback /n' +

'/n' +

' {post_exec_fail}/n' +

' /n' +

' end catch /n' +

'end /n' +

'GO'

declare @cols varchar(max) set @cols = ''

declare @join_cols varchar(max) set @join_cols = ''

-- build strings

select @cols = @cols + t.column_name + ', ',

@join_cols = @join_cols + case

when t.key_ordinal_position is null then ''

when t.key_ordinal_position = 1 then 'src.' + t.column_name + ' = dst.' + t.column_name

else ' and src.' + t.column_name + ' = dst.' + t.column_name

end

from config.source_column_definitions('dbo', @source_table_name) t

order by t.key_ordinal_position asc, t.ordinal_position asc

-- remove trailing comma

set @cols = left(@cols, len(@cols) - 1)

-- merge audit logic first just because

set @pre_exec = replace(@pre_exec, '{table}', @source_table_name)

set @DDL = replace(@DDL, '{pre_exec}', @pre_exec)

set @post_exec = replace(@post_exec, '{table}', @source_table_name)

set @post_exec = replace(@post_exec, '{status}', 'SUCC')

set @DDL = replace(@DDL, '{post_exec_ok}', @post_exec)

set @post_exec = replace(@post_exec, '''SUCC''', '''FAIL''')

set @DDL = replace(@DDL, '{post_exec_fail}', @post_exec)

-- merge with template

set @DDL = replace (@DDL, '{table}', @source_table_name)

set @DDL = replace(@DDL, '{cols}', @cols)

set @cols = 'src.' + replace(@cols, ', ', ', src.')

set @DDL = replace(@DDL, '{src_cols}', @cols)

set @DDL = replace(@DDL, '{join_cols}', @join_cols)

return @DDL

end

If you have any business rules that can speed things up, then you add that to the view definition on a case by case basis. For example, after generating the "base" ddl for the source view for our invoice tables, we altered the view and added the predicate to exclude posted invoices.

The procedure can use openquery against the view so that any extra predicates are evaluated at the source. If you have such modifications, set the argument to use openquery. Otherwise you needn't bother.

MysteryJimbo SSC-Insane Points: 24203 More actions · Answer 5

IMO the most simple solution is to use the microsoft provided tool. Tablediff.exe. Its what its there for.

C:\Program Files\Microsoft SQL Server\100\COM\Tablediff.exe

http://msdn.microsoft.com/en-us/library/ms162843.aspx

Warwick Rudd Ten Centuries Points: 1240 More actions · Answer 6

You may have it already in your software suite, visual studio 2010 has a compare that you could use.

MCT
MCITP Database Admin 2008
MCITP Database Admin 2008
MCITP Database Dev 2008
www.jnrit.com.au/Blog.aspx

Don Halloran SSCertifiable Points: 5937 More actions · Answer 7

You could indeed use any of several existing (and sometimes free) out of the box tools. But you want to keep history (so there will necessarily be differences that you want, you're not just trying to replicate), and my way is much, much faster, particularly once tweaked based on your business rules! 😛