Data Compare in Two Tables

  • I have two tables 'Tbl1' and 'Tbl2' in the same database. Both tables have different number of rows and columns. Typically the number of rows is in the range 3 to 5 million.

    I want to check (and list) what rows in Tbl1 do not exist in Tbl2.

    The check will be based on 3 columns in each table with same data types; i.e. ColumnA in both tables is 'varchar(20)', ColumnB in both tables is 'datetime' (2009-01-16 21:43:35), and ColumnC in both cases is 'int'.

    In case any one of the three columns does not match then that row in Tbl1 does not exist in Tbl2. A deviation to this criteria is that a difference of approx 5 minutes above or below the value in ColumnB of Tbl2 with respect to Tbl1 is acceptable

    Could someone give me the sql code?

  • hi rahydri,

    To me what you described here is a Left Outer Join.

    select column

    from tableA

    left outer join table b

    on tableA.columnA = tableB.columnA

    where tableB.anycolumn is null

    I dont know your level of comfort with SQL queries so do you need help to write the actually query or?

    In your data is it possible that a record exists only in the second table?

    added missing where clause...

  • This sounds a lot like homework?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Maxim has pointed you in the right direction. Keep in mind that you can use all comparison operators in join's.

  • Hi Maxim

    thanks for the reply. replys to your questions are as under

    Yes, I would appreciate your help in writing the actual quary

    Yes, there are data rows in Tbl2 which do not exist in Tbl1

    thanks once again

  • Jack Corbett (1/19/2009)


    Maxim has pointed you in the right direction. Keep in mind that you can use all comparison operators in join's.

    jack isn't is good to use WHERE EXISTS clause instead of JOIN, as it took less CPU time as well as elasped time.

    Abhijit - http://abhijitmore.wordpress.com

  • Yes, WHERE EXISTS/NOT EXISTS can perform slightly better. Until recently I was more comfortable reading the LEFT JOIN syntax, and in this case both will work, one, IMO, the LEFT JOIN, is easier for less experienced SQL folks to read.

  • hi again!

    The left join will include all records from first table even if no records match in the second table. All fields from second table are listed as NULL values when no match was found. You can use that as a condition to keep or filter records that do not exist.

    When you join two tables you can list multiple predicates and since your columns are of the same datatype in both tables there is no need to do any conversion.

    To compare two dates you can use the datediff function. The first parameter is the part of the date you want to compare. In this case I chose ms (millisecond) for a better precision.

    select

    Tbl1.ColumnA

    from

    Tbl1 left outer join Tbl2

    on Tbl1.ColumnA = Tbl2.ColumnA

    and datediff(ms,Tbl1.ColumnB,Tbl2.ColumnB) > -300001

    and datediff(ms,Tbl1.ColumnB,Tbl2.ColumnB) < 300001

    and Tbl1.ColumnC = Tbl2.ColumnC

    where

    Tbl2.ColumnA is null

    If this is part of a homework as RBY suggests well I guess you will have to understand how to write a query like this because sooner or later you will be facing an exam and I am pretty sure you wont have access to SSC!

    Let me know if you have more questions!

  • Hi maxim

    Thanks again for the reply. I do not have to appear in any exam so no issue on that account. I am new to SQL. am learning it on my own, I want to learn it due to the job i have recently switched.

    I think I may be able to do the remaining code myself. If I can't I would request you again.

    Regards ans best wishes

  • Forget homework. This problem, AS STATED, is simply not possible to solve. The problem is how to determine that any two given records are a matched set, when the criteria quite clearly might allow one record in one table to match multiple records in the other table. Until you can guarantee the ability to generate a list of matched pairs of records, the rest of the problem is irrelevant.

    However, with the ASSUMPTION that the +/- 5 minutes scenario will NEVER cause a 1 to many situation, you generate the list of Tbl1 records that have matches in Tbl2, and then select those records in Tbl1 that aren't in that list. That last will require comparing ALL fields between Tbl1 and the list.

    Steve

    (aka smunson)

    :):):)

    rahydri (1/19/2009)


    I have two tables 'Tbl1' and 'Tbl2' in the same database. Both tables have different number of rows and columns. Typically the number of rows is in the range 3 to 5 million.

    I want to check (and list) what rows in Tbl1 do not exist in Tbl2.

    The check will be based on 3 columns in each table with same data types; i.e. ColumnA in both tables is 'varchar(20)', ColumnB in both tables is 'datetime' (2009-01-16 21:43:35), and ColumnC in both cases is 'int'.

    In case any one of the three columns does not match then that row in Tbl1 does not exist in Tbl2. A deviation to this criteria is that a difference of approx 5 minutes above or below the value in ColumnB of Tbl2 with respect to Tbl1 is acceptable

    Could someone give me the sql code?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Dear Smunson

    Thanks for the interest

    If you analyze the problem i formulated, I do not want to check the matched items. I want to list the rows from Tbl1 which do not exist in Tbl2. It does not matter if a row in Tbl1 exists once, twice or morer times in Tbl2. The tolerance value of +/- five minutes is acceptable to us, it does not matter how many records match in this duration.

    Regards

  • select tb1.*

    from tb1 left outer join tb2 on (tb1.key = tb2.key)

    where tb2.key is null

  • The method remains the same. One still needs to identify whether or not a given record in Tbl1 has a "match" in Tbl2 or not, and if not, output it. Try this:

    ;WITH MATCH_TWO AS (

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN TBL2 AS T2

    ON T1.INTFIELD = T2.INTFIELD AND

    T1.VARCHARFIELD = T2.VARCHARFIELD AND

    T1.DATEFIELD BETWEEN DATEADD(minute,-5,T2.DATEFIELD) AND DATEADD(minute,5,T2.DATEFIELD)

    )

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN MATCH_TWO AS MT

    ON T1.INTFIELD <> MT.INTFIELD AND

    T1.VARCHARFIELD <> MT.VARCHARFIELD AND'

    T1.DATEFIELD <> MT.DATEFIELD AND

    T1.OTHERFIELD1 <> MT.OTHERFIELD1 AND ...

    .

    .

    .

    This is untested, and you'll have to add EVERY field in TBL1 to the final query. I suspect there may be a better way to do this, but without any test data to work with, I have nothing to go on.

    Steve

    (aka smunson)

    :):):)

    rahydri (1/22/2009)


    Dear Smunson

    Thanks for the interest

    If you analyze the problem i formulated, I do not want to check the matched items. I want to list the rows from Tbl1 which do not exist in Tbl2. It does not matter if a row in Tbl1 exists once, twice or morer times in Tbl2. The tolerance value of +/- five minutes is acceptable to us, it does not matter how many records match in this duration.

    Regards

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • ;WITH MATCH_TWO AS (

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN TBL2 AS T2

    ON T1.INTFIELD = T2.INTFIELD AND

    T1.VARCHARFIELD = T2.VARCHARFIELD AND

    T1.DATEFIELD BETWEEN DATEADD(minute,-5,T2.DATEFIELD) AND DATEADD(minute,5,T2.DATEFIELD)

    )

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN MATCH_TWO AS MT

    ON T1.INTFIELD <> MT.INTFIELD AND

    T1.VARCHARFIELD <> MT.VARCHARFIELD AND'

    T1.DATEFIELD <> MT.DATEFIELD AND

    T1.OTHERFIELD1 <> MT.OTHERFIELD1 AND ...

    .

    .

    .

    The query you wrote here is basically a triangular join.

    The CTE MATCH_TWO returns the records that match between Tbl1 and Tbl2.

    The query you wrote after is a CROSS JOIN between Tbl1 and MATCH_TWO but you filter out the records where fields match in both tables.

    It will not only perform poorly on big tables but it also does not return the correct records... You will get all records from Tbl1 and a duplicate for all those that do not exist in Tbl2.

    I have test tables and data setup for this and I stand behind my query as one of the possible correct answers for the problem as described.

  • Oops... I find it difficult to code correctly when I have nothing to work with. Anyway, here's another shot at it. I'm not sure if it's any better on the performance issue, but it should at least work:

    ;WITH TABLE_ONE AS (

    SELECT ROW_NUMBER() OVER(ORDER BY INTFIELD, VARCHARFIELD, DATEFIELD) AS RN, T1.*

    FROM TBL1

    ),

    MATCH_TWO AS (

    SELECT T1.*

    FROM TABLE_ONE AS T1 INNER JOIN TBL2 AS T2

    ON T1.INTFIELD = T2.INTFIELD AND

    T1.VARCHARFIELD = T2.VARCHARFIELD AND

    T1.DATEFIELD BETWEEN DATEADD(minute,-5,T2.DATEFIELD) AND DATEADD(minute,5,T2.DATEFIELD)

    )

    SELECT *

    FROM TABLE_ONE

    WHERE RN NOT IN (SELECT RN FROM MATCH_TWO)

    Steve

    (aka smunson)

    :):):)

    Maxim Picard (1/22/2009)


    ;WITH MATCH_TWO AS (

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN TBL2 AS T2

    ON T1.INTFIELD = T2.INTFIELD AND

    T1.VARCHARFIELD = T2.VARCHARFIELD AND

    T1.DATEFIELD BETWEEN DATEADD(minute,-5,T2.DATEFIELD) AND DATEADD(minute,5,T2.DATEFIELD)

    )

    SELECT T1.*

    FROM TBL1 AS T1 INNER JOIN MATCH_TWO AS MT

    ON T1.INTFIELD <> MT.INTFIELD AND

    T1.VARCHARFIELD <> MT.VARCHARFIELD AND'

    T1.DATEFIELD <> MT.DATEFIELD AND

    T1.OTHERFIELD1 <> MT.OTHERFIELD1 AND ...

    .

    .

    .

    The query you wrote here is basically a triangular join.

    The CTE MATCH_TWO returns the records that match between Tbl1 and Tbl2.

    The query you wrote after is a CROSS JOIN between Tbl1 and MATCH_TWO but you filter out the records where fields match in both tables.

    It will not only perform poorly on big tables but it also does not return the correct records... You will get all records from Tbl1 and a duplicate for all those that do not exist in Tbl2.

    I have test tables and data setup for this and I stand behind my query as one of the possible correct answers for the problem as described.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply