data insering from staging table to final table how to avoid duplicates? please

  • I have two tables Table1(has 10 columns), Table2(has 22 columns).

    every time I insert data from table1(fresh load everyday kidn of staging data) to table2.(appends data to exist data).

    So here I would like to check before insert data to table2 from table1, need to make sure the record doesn't exist in table2 (based on 4 columns, which exist in both tables, for example col1, col2,col3,col4 in both tables).

    if already exist based on these 4 columns then don't insert otherwise it can insert? how can I do this? please advise


  • INSERT Table2(Col1, ...)
    SELECT
       Col1, ...
    FROM Table1 t1
    LEFT OUTER JOIN Table2 t2
    ON
       t1.col1=t2.col1
    AND    t1.col2=t2.col2
    AND    t1.col3=t2.col3
    AND    t1.col4=t2.col4
    WHERE  t2.Col1 IS NULL;

  • Joe Torre - Tuesday, September 12, 2017 9:23 AM


    INSERT Table2(Col1, ...)
    SELECT
       Col1, ...
    FROM Table1 t1
    LEFT OUTER JOIN Table2 t2
    ON
       t1.col1=t2.col1
    AND    t1.col2=t2.col2
    AND    t1.col3=t2.col3
    AND    t1.col4=t2.col4
    WHERE  t2.Col1 IS NULL;

    This method runs the risk of adding duplicates, unless (col1, col2, col3, col4) is a unique index in both tables (which it may well be, but better safe than sorry). It's also likely to be slower than the NOT EXISTS method, so I suggest that you at least try both & compare performance.
    INSERT Table2
    (
      Col1...
    )
    SELECT Col1...
    FROM Table1 t1
    WHERE NOT EXISTS
    (
      SELECT 1
      FROM Table2 t2
      WHERE
        t1.col1  = t2.col1
        AND t1.col2 = t2.col2
        AND t1.col3 = t2.col3
        AND t1.col4 = t2.col4
    );

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil Parkin - Tuesday, September 12, 2017 10:28 AM

    Joe Torre - Tuesday, September 12, 2017 9:23 AM


    INSERT Table2(Col1, ...)
    SELECT
       Col1, ...
    FROM Table1 t1
    LEFT OUTER JOIN Table2 t2
    ON
       t1.col1=t2.col1
    AND    t1.col2=t2.col2
    AND    t1.col3=t2.col3
    AND    t1.col4=t2.col4
    WHERE  t2.Col1 IS NULL;

    This method runs the risk of adding duplicates, unless (col1, col2, col3, col4) is a unique index in both tables (which it may well be, but better safe than sorry). It's also likely to be slower than the NOT EXISTS method, so I suggest that you at least try both & compare performance.
    INSERT Table2
    (
      Col1...
    )
    SELECT Col1...
    FROM Table1 t1
    WHERE NOT EXISTS
    (
      SELECT 1
      FROM Table2 t2
      WHERE
        t1.col1  = t2.col1
        AND t1.col2 = t2.col2
        AND t1.col3 = t2.col3
        AND t1.col4 = t2.col4
    );

    Use merge - https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql
    Since the EXISTS operator returns either TRUE or FALSE, the following code may produce duplicate rows containing NULLS for all the 4 cols(rarely possible but still, be safe). This will cause the NULL values to be ADDED each time you run this code. Lets say initially u inserted 10 rows having nulls in all the 4 cols, then again you ran this code and get those 10 rows of nulls inserted making it 20 then 30 ,40 and so on.

    EXISTS operator is not 3 valued logic i.e for (unknown) it return False and eventually converting the predicate to TRUE for NOT exists.

    First solve the problem then write the code !

  • TheCTEGuy - Saturday, September 30, 2017 6:54 AM

    Use merge - https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql
    Since the EXISTS operator returns either TRUE or FALSE, the following code may produce duplicate rows containing NULLS for all the 4 cols(rarely possible but still, be safe). This will cause the NULL values to be ADDED each time you run this code. Lets say initially u inserted 10 rows having nulls in all the 4 cols, then again you ran this code and get those 10 rows of nulls inserted making it 20 then 30 ,40 and so on.

    EXISTS operator is not 3 valued logic i.e for (unknown) it return False and eventually converting the predicate to TRUE for NOT exists.

    You have identified a rather unlikely edge case.
    Your recommended alternative has its own fair share of problems which are, in my opinion, far more likely to occur in the real world. Take a look here, for example.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply