How to effectively compare Data in 2 same tables?

  • Hi All,

    I have seen lots of similar posts but couldn't find one that kinda matched my situation. So, here it is:

    We are getting source data in Table A (after all calculations) and putting it in datawarehouse Table B (This table is supposed to hold all modified rows and history). In other words, if yesterday 1 or more columns got updated in Table A, then we need a new row of data in Table B.

    Currently, there are about 70K rows with about 80 columns; and per our user, apart from 1, data in all columns is updatabale on their end.

    So, I have a view that shows the latest and distinct data that I am using to compare each column of each row.

    Now here is the problem - it is taking sooo... long.

    code sample -

    [font="Arial Narrow"]Insert into TableB

    Select A.* from TableA A

    WHERE EXISTS (select B.ID

    from vw_On_TableB B

    where A.id = B.ID

    and

    ( A.column2 <> B.Column2

    or A.Column3 <> B.Column3

    or A.....

    or A.Column80 <> B.Column80

    ))

    OR Not exists ( select 1 from vw_On_TableB B

    where A.id = B.ID

    )[/font]

    Can you suggest what I should use to make it more efficient? Pls help. Thanks in advance.

  • Instead of trying to build this yourself, I'd suggest you pick up one of the third party tools out there that are designed for doing exactly this. It'll make your life much easier. I'd suggest checking out Red Gate SQL Data Compare (disclosure, I work for them).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • That definitely sounds like a good idea, but I don't think we'd be able to buy another software just for that. Isn't there a T-sql way to do this?

  • You can use hash of the row data, transfer it across with the etl step, and use it as the basis for comparison.

    In fact, I have code on my blog that automatically generates all of the required procedures etc for you. There are some minor limitations, all explained in the posts.

    http://www.sqlservercentral.com/blogs/don_halloran/archive/2010/12/21/easy-cdc-in-5-hours-less-for-you-using-automatic-code-generation.aspx

    http://www.sqlservercentral.com/blogs/don_halloran/archive/2011/01/01/easy-cdc-part-2-automating-a-quick-initial-load-and-some-gotchas.aspx

    This method is very fast. Our system is significantly larger than what you're describing and it analyzes several hundred gig worth of data and transfers changes in about half an hour for all tables.

  • By the way, we were able to further improve the performance of this solution given some knowledge of business rules.

    For instance, in an invoice table, posted invoices cannot be modified, so there is no need to check to see whether they have changed.

    Now, the initial solution doesn't have the greatest support for adding additional predicates. You want the predicates to be evaluated locally at the source for maximum performance. As such, the solution I posted on my blog isn't complete. I've only just realised that I never updated it. So let me go over the process again here, with the extra code required.

    In brief, you need the following things for each table you want to capture:

    1) A way to get the metadata for the table you are capturing, so you can generate a destination table. (This is easy if you are doing one table, but we were doing over 200, which is why all of the procedures below spit out DDL to make the objects for you based on table name parameters)

    2) The destination table that will hold your captured data.

    3) A way to have useful efficiency-gaining predicates evaluated at the source. For this we use a view.

    4) A procedure to efficiently move data from the source view (3) to the destination table (2)

    I have the code to automatically generate all this in a "config" schema.

    Here's the code for (1), a function that returns the column definitions for your selected source table (passed in as the argument):

    ----------------------------------------------------------------------------------------------------------------------------

    -- return metadata about a given table

    ----------------------------------------------------------------------------------------------------------------------------

    ALTER function [config].[source_column_definitions] (@schema_name sysname, @table_name sysname) returns table as

    return (

    select

    tb.table_schema,

    quotename(tb.table_name) as table_name,

    quotename(co.column_name) as column_name,

    co.data_type +

    case

    when co.data_type like '%char%' or co.data_type like '%binary%' then

    case

    when co.character_maximum_length = -1 then '(max)'

    else '(' + cast(co.character_maximum_length as varchar(4)) + ')'

    end

    when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'

    else ''

    end as data_type,

    quotename(co.column_name) +

    co.data_type +

    case

    when co.data_type like '%char%' or co.data_type like '%binary%' then

    case

    when co.character_maximum_length = -1 then '(max)'

    else '(' + cast(co.character_maximum_length as varchar(4)) + ')'

    end

    when co.data_type in ('decimal', 'numeric') then '(' + cast(co.numeric_precision as varchar(4)) + ', ' + cast(co.numeric_scale as varchar(4)) + ')'

    else ''

    end +

    case

    when co.IS_NULLABLE = 'NO' then ' not null'

    else ''

    end as column_definition,

    co.ordinal_position,

    kc.ordinal_position as key_ordinal_position

    from OLTP.natlive.information_schema.tables tb

    join OLTP.natlive.information_schema.columns coon co.table_catalog = tb.table_catalog

    and co.table_schema = tb.table_schema

    and co.table_name = tb.table_name

    left join OLTP.natlive.information_schema.table_constraints tc on tc.table_catalog = tb.table_catalog

    and tc.table_schema = tb.table_schema

    and tc.table_name = tb.table_name

    and tc.constraint_type = 'PRIMARY KEY'

    left join OLTP.natlive.information_schema.key_column_usage kc on kc.constraint_catalog = tc.constraint_catalog

    and kc.constraint_schema = tc.constraint_schema

    and kc.constraint_name = tc.constraint_name

    and kc.column_name = co.column_name

    where (co.character_maximum_length is null or co.character_maximum_length between 0 and 255)

    AND co.DATA_TYPE != 'timestamp'

    and tb.table_schema like @schema_name

    and tb.table_name like @table_name

    )

    Here's the code for (2), a function which returns DDL, using (1), that generates a destination table:

    ----------------------------------------------------------------------------------------------------------------------------

    -- DDL to create CDC destination table and indexes

    ----------------------------------------------------------------------------------------------------------------------------

    ALTER function [config].[destination_table_DDL] (@source_table_name sysname) returns varchar(max) as begin

    -- inititalize with DDL common to all CDC tables

    declare @DDL varchar(max) set @DDL =

    'create table CDC.' + quotename(@source_table_name)

    + ' (row_effective_start_DT datetime not null, row_effective_end_DT datetime not null, row_hash varbinary(20) not null, '

    declare @index_ddl varchar(max) set @index_ddl = ''

    -- use concatenation trick to build column and index definitions

    select @DDL = @DDL + t.column_definition + ', ',

    @index_ddl = @index_ddl + case when t.key_ordinal_position is null then '' else t.column_name + ', ' end

    from config.source_column_definitions('dbo', @source_table_name) t

    order by t.key_ordinal_position asc, t.ordinal_position asc

    -- remove trailing comma, close table definition and add terminator

    set @DDL = left(@DDL, len(@DDL) - 1) + '); '

    -- add primary key

    set @DDL = @DDL

    + 'alter table CDC.' + quotename(@source_table_name) + ' add constraint PK_' + @source_table_name

    + ' primary key clustered (row_effective_start_DT, ' + left(@index_ddl, len(@index_ddl) - 1) + '); '

    -- add unique index

    set @DDL = @DDL

    + 'create unique index ux_' + @source_table_name + ' on CDC.' + @source_table_name

    + '(' + @index_ddl + 'row_effective_end_DT) include (row_hash, row_effective_start_DT); '

    -- add foreign key to audit schema. Comment this out if there is no audit schema

    set @DDL = @DDL

    + 'alter table CDC.' + quotename(@source_table_name) + ' add constraint FK_' + @source_table_name

    + '__ETL_executions foreign key (row_effective_start_DT) references CDC.ETL_executions (ETL_cutoff_DT);'

    set @DDL = @DDL

    + 'create index IX_' + @source_table_name + '_dates on CDC.' + @source_table_name + '(row_effective_end_DT, row_effective_start_DT);'

    return @DDL

    end

    Here is the code that generates the view (3) on the source side:

    ----------------------------------------------------------------------------------------------------------------------------

    -- gets the ddl for view to create on source

    ----------------------------------------------------------------------------------------------------------------------------

    ALTER function [config].[source_view_DDL](@source_table_name sysname) returns varchar(max) as begin

    declare @cols varchar(max) set @cols = ''

    declare @DDL varchar(max) set @DDL =

    'create view CDC.' + quotename(@source_table_name) + ' as /n' +

    'selectetl.cutoff_DT,/n' +

    ' src.*/n' +

    'from(/n' +

    ' selecthashbytes(''SHA1'', {cast_cols}) as row_hash,/n' +

    '{cols}/n' +

    ' fromnatlive.dbo.' + quotename(@source_table_name) + '/n' +

    ' ) src/n' +

    'outer apply (select top 1 cutoff_DT from BI.natlive_CDC.audit.ETL_table_loads where status_code = ''SUCC'' and table_name = ''' + @source_table_name + ''' order by cutoff_DT desc) etl/n' +

    'go'

    select @cols = @cols + t.column_name + ', '

    from config.source_column_definitions('dbo', @source_table_name) t

    order by t.key_ordinal_position asc, t.ordinal_position asc

    set @cols = left(@cols, len(@cols) - 1)

    set @DDL = replace(@DDL, '{cols}', @cols)

    set @cols = 'isnull(cast(' + replace(@cols, ', ', ' as varchar(255)), '''') + ''|'' + isnull(cast(') + ' as varchar(255)), '''') + ''|'''

    set @DDL = replace(@DDL, '{cast_cols}', @cols)

    return @DDL

    end

    And finally (4), the procedure that gets data from the view and puts it into the destination table (don't forget to replace the /n's with actual newlines!!):

    ----------------------------------------------------------------------------------------------------------------------------

    -- gets the ddl required to create the procedure to ETL changed data from the source to the CDC system

    ----------------------------------------------------------------------------------------------------------------------------

    ALTER function [config].[procedure_DDL] (@source_table_name sysname, @use_openquery bit = 0) returns varchar(max) as begin

    -- used for easy search and replace to allow reformat of output to match template formatting

    declare @n char(2) set @n = '&'

    declare @pre_exec varchar(2000)

    set @pre_exec = 'insert audit.etl_table_loads (cutoff_DT, status_code, table_name, process_start_DT) values (@ETL_Cutoff_DT, ''PROC'', ''{table}'', getdate())'

    declare @post_exec varchar(2000)

    set @post_exec = 'update audit.etl_table_loads set status_code = ''{status}'', process_end_DT = GETDATE(), row_count = @rc where [Table_name] = ''{table}'' and cutoff_DT = @ETL_cutoff_DT'

    -- initialize with template for merge

    declare @DDL varchar(max) set @DDL =

    'create procedure CDC.ETL_{table} (@ETL_cutoff_DT datetime, @debug_flag bit = 0) with recompile as begin /n' +

    ' set nocount on /n' +

    ' /n' +

    ' declare @rc int /n' +

    ' /n' +

    ' {pre_exec} /n' +

    ' /n' +

    ' begin tran /n' +

    ' /n' +

    ' begin try /n' +

    ' /n' +

    ' insert CDC.[{table}] (row_effective_start_DT, row_effective_end_DT, row_hash, {cols}) /n' +

    ' select @ETL_cutoff_DT, ''9999-12-30'', src.row_hash, {src_cols} /n' +

    case when @use_openquery = 0 then

    ' from OLTP.natlive_ext.CDC.[{table}] src /n'

    else

    ' from openquery(OLTP, ''select * from natlive_ext.CDC.[{table}]'') src /n'

    end+

    ' left join CDC.[{table}] dst on {join_cols} /n' +

    ' and dst.row_effective_end_DT = ''9999-12-31'' /n' +

    ' and src.row_hash = dst.row_hash /n' +

    ' where dst.row_effective_end_DT is null /n' +

    ' /n' +

    ' set @rc = @@rowcount /n' +

    ' /n' +

    ' if (@RC > 0) begin /n' +

    ' update dst /n' +

    ' set dst.row_effective_end_DT = case /n' +

    ' when dst.row_effective_end_DT = ''9999-12-31'' then @ETL_cutoff_DT /n' +

    ' when dst.row_effective_end_Dt = ''9999-12-30'' then ''9999-12-31'' /n' +

    ' end /n' +

    ' from CDC.[{table}] dst /n' +

    ' join CDC.[{table}] src on {join_cols} /n' +

    ' and dst.row_effective_end_DT between ''9999-12-30'' and ''9999-12-31'' /n' +

    ' and src.row_effective_start_DT = @ETL_cutoff_DT /n' +

    ' /n' +

    ' end /n' +

    ' /n' +

    ' commit /n' +

    ' /n' +

    ' {post_exec_ok} /n' +

    ' /n' +

    ' end try begin catch /n' +

    ' /n' +

    ' rollback /n' +

    '/n' +

    ' {post_exec_fail}/n' +

    ' /n' +

    ' end catch /n' +

    'end /n' +

    'GO'

    declare @cols varchar(max) set @cols = ''

    declare @join_cols varchar(max) set @join_cols = ''

    -- build strings

    select @cols = @cols + t.column_name + ', ',

    @join_cols = @join_cols + case

    when t.key_ordinal_position is null then ''

    when t.key_ordinal_position = 1 then 'src.' + t.column_name + ' = dst.' + t.column_name

    else ' and src.' + t.column_name + ' = dst.' + t.column_name

    end

    from config.source_column_definitions('dbo', @source_table_name) t

    order by t.key_ordinal_position asc, t.ordinal_position asc

    -- remove trailing comma

    set @cols = left(@cols, len(@cols) - 1)

    -- merge audit logic first just because

    set @pre_exec = replace(@pre_exec, '{table}', @source_table_name)

    set @DDL = replace(@DDL, '{pre_exec}', @pre_exec)

    set @post_exec = replace(@post_exec, '{table}', @source_table_name)

    set @post_exec = replace(@post_exec, '{status}', 'SUCC')

    set @DDL = replace(@DDL, '{post_exec_ok}', @post_exec)

    set @post_exec = replace(@post_exec, '''SUCC''', '''FAIL''')

    set @DDL = replace(@DDL, '{post_exec_fail}', @post_exec)

    -- merge with template

    set @DDL = replace (@DDL, '{table}', @source_table_name)

    set @DDL = replace(@DDL, '{cols}', @cols)

    set @cols = 'src.' + replace(@cols, ', ', ', src.')

    set @DDL = replace(@DDL, '{src_cols}', @cols)

    set @DDL = replace(@DDL, '{join_cols}', @join_cols)

    return @DDL

    end

    If you have any business rules that can speed things up, then you add that to the view definition on a case by case basis. For example, after generating the "base" ddl for the source view for our invoice tables, we altered the view and added the predicate to exclude posted invoices.

    The procedure can use openquery against the view so that any extra predicates are evaluated at the source. If you have such modifications, set the argument to use openquery. Otherwise you needn't bother.

  • IMO the most simple solution is to use the microsoft provided tool. Tablediff.exe. Its what its there for.

    C:\Program Files\Microsoft SQL Server\100\COM\Tablediff.exe

    http://msdn.microsoft.com/en-us/library/ms162843.aspx

  • You may have it already in your software suite, visual studio 2010 has a compare that you could use.

    MCT
    MCITP Database Admin 2008
    MCITP Database Admin 2008
    MCITP Database Dev 2008
    www.jnrit.com.au/Blog.aspx

  • You could indeed use any of several existing (and sometimes free) out of the box tools. But you want to keep history (so there will necessarily be differences that you want, you're not just trying to replicate), and my way is much, much faster, particularly once tweaked based on your business rules! 😛

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply