how to find duplicates in a table as per particular column

  • Hi All,

    I am having a main table and temptable.

    Every sunday, new data will be loaded from temptable to main table. I have to make sure that, duplicates does not get loaded from temptable to maintable.

    For example, if last sunday a record gets loaded from temp to main. If this sunday also the same record is present then it means that is a duplicate.

    The duplicate is decided on below scenario

    select 'CodeChanges: ', count(*) from CodeChanges a, CodeChanges_Temp b

    where a.AccountNumber = b.AccountNumber and

    a.HexaNumber = b.HexaNumber and

    a.HexaEffDate = b.HexaEffDate and

    a.HexaId = b.HexaId and

    a.TransactionEffDate = b.TransactionEffDate and

    isnull(a.TransactionExpDate,getdate()) = isnull(b.TransactionExpDate,getdate()) and

    a.HexaCode = b.HexaCode and

    and a.CompanyCode = b.CompanyCode

    a.CategoryCode = b.CategoryCode and

    a.AdjustmentUserId = b.AdjustmentUserId

    Yesterday (Sunday) , data from temp got loaded onto maintable but with duplicates.

    There is a log which just displays number of duplicates.

    Yesterday the log displayed 8 duplicates found. I need to find out the 8 duplicates which got loaded yesterday and delete it off from main table.

    There is a column in both tables which is 'creation date and time'. Every Sunday when the load happens this column will have that day's date .

    Now i need to find out what are all the duplicates which got loaded on this sunday.

    The total rows in temp table is : 363

    No of duplicates present is : 8

    I used below query to find out the duplicates but it is returning all the 363 rows from the maintable instead of the 8 duplicates.

    Select 'CodeChanges: ', * from CodeChanges a where

    exists

    ( Select 1 from CodeChanges_Temp b where

    a.HexaNumber = b.HexaNumber and

    a.HexaEffDate = b.HexaEffDate and

    a.HexaId = b.HexaId and

    a.TransactionEffDate = b.TransactionEffDate and

    isnull(a.TransactionExpDate,getdate()) = isnull(b.TransactionExpDate,getdate()) and

    a.HexaCode = b.HexaCode and

    and a.CompanyCode = b.CompanyCode

    a.CategoryCode = b.CategoryCode and

    a.AdjustmentUserId = b.AdjustmentUserId

    b.CreationDateTime = '2015-11-01 00:00:00.000'

    )

    and a.CreationDateTime = '2015-11-01 00:00:00.000'

    Please help me in finding the duplicate records which has creation date time as '2015-11-01 00:00:00.000'

    and all the above columns mentioned in the query matches.

    Example

    Few colums only metioned below

    creationdateandtime HexaNumber HexaCode

    1. 1987-10-01 00:00:00.000 5 3

    2. 2015-11-01 00:00:00.000 5 3

    So here the second record is duplicate. This is what I am trying to find.

  • A query like this

    SELECT

    rn = ROW_NUMBER() OVER(PARTITION BY AccountNumber, HexaNumber, HexaEffDate, HexaId, TransactionEffDate, TransactionExpDate, HexaCode, CompanyCode, CategoryCode, AdjustmentUserId

    ORDER BY (SELECT NULL)),

    *

    FROM CodeChanges

    should identify your dupes. You might have to adjust the column list in the PARTITION BY clause. When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Thanks very much for the advise !

    I have one doubt, please clarify on the below quotes,

    You might have to adjust the column list in the PARTITION BY clause

    You mean if I have anymore columns to add in the condition, then I need to add, right ?

    When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.

    you mean the output will show all the records which is unique as well as duplicate ?

    I just need the duplicate records.

    The count of CodeChanges table is 588067. I think its very high.

    So just wanted to confirm before running the query to avoid any database issue.

    Please advise .

  • muralikrishna2489 (11/2/2015)


    Hi Chris,

    Thanks very much for the advise !

    I have one doubt, please clarify on the below quotes,

    You might have to adjust the column list in the PARTITION BY clause

    You mean if I have anymore columns to add in the condition, then I need to add, right ?

    When you are sure that the column list is correct, "primary" values have rn = 1 and "duplicate" values have rn > 1.

    you mean the output will show all the records which is unique as well as duplicate ?

    I just need the duplicate records.

    The count of CodeChanges table is 588067. I think its very high.

    So just wanted to confirm before running the query to avoid any database issue.

    Please advise .

    The column list in the PARTITION BY clause should exactly match the column list which defines a duplicate in your table.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • muralikrishna2489 (11/2/2015)


    you mean the output will show all the records which is unique as well as duplicate ?

    I just need the duplicate records.

    The count of CodeChanges table is 588067. I think its very high.

    So just wanted to confirm before running the query to avoid any database issue.

    Please advise .

    Take a look at Chris's note below his query. The primary rows will have rn = 1. The duplicates (as defined by the column list in the PARTITION BY) will have rn > 1. To select only your duplicates, simply add the WHERE clause to the outer query:

    WHERE rn > 1

    FYI: If you're going to delete the duplicate rows, you can delete from a CTE. Instead of SELECTing from the CTE, change it to a DELETE from the CTE. This gives you the opportunity to see what you're going to delete before you delete it. Example:

    WITH cte AS (

    SELECT rn = ROW_NUMBER() OVER(PARTITION BY AccountNumber, HexaNumber, HexaEffDate, HexaId, TransactionEffDate, TransactionExpDate, HexaCode, CompanyCode, CategoryCode, AdjustmentUserId ORDER BY (SELECT NULL)), *

    FROM CodeChanges

    )

    SELECT *

    FROM cte

    WHERE rn > 1;

    In this code, if you change the "SELECT * FROM" to a "DELETE FROM" you will delete the duplicates.

    Please make sure you understand this before running it in production.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply