Removing Duplicates

  • Hi,

    I have the below query that i need to remove some duplicates, but not quite as simple as it sounds.

    What i would like to do is something like:

    If [Read_code7],[Read_code] and[Staff_number] are the same then only include one row. The issue i have is that the caseload_holder and [Event_done_at] fields might have different values in them so i cant use a simple distinct.

    Hopefully i have explained that well enough!

    SELECT

    [Patient_Count],

    [Read_code7]

    ,[Read_code]

    ,[Current_Caseload_Holder]

    ,[Staff_number]

    ,[Event_date]

    ,[Event_done_at]

    FROM table

  • Use ROW_NUMBER, have a look here http://www.sqlservercentral.com/articles/duplicate+data/102383/

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • An example that may work for you (as mark suggested in the link provided);WITH Dups AS (

    SELECT

    [Patient_Count]

    ,[Read_code7]

    ,[Read_code]

    ,[Current_Caseload_Holder]

    ,[Staff_number]

    ,[Event_date]

    ,[Event_done_at],

    ROW_NUMBER() OVER (PARTITION BY [Read_code7],[Read_code],[Staff_number] ORDER BY [Read_code7]) AS Cnt

    FROM MyTable

    ) SELECT * FROM Dups WHERE Cnt = 1

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Many Thanks to both for your suggestions. I have managed to make it work!

    My only question: is there anyway i can include which row it chooses? For example in some of the duplicated rows the[Current_Caseload_Holder] field is populated and then null. I would like to keep the populated field if possible?

  • Yes, you just need to change the ORDER BY clause

    ;WITH Dups AS (

    SELECT

    [Patient_Count]

    ,[Read_code7]

    ,[Read_code]

    ,[Current_Caseload_Holder]

    ,[Staff_number]

    ,[Event_date]

    ,[Event_done_at],

    ROW_NUMBER() OVER (PARTITION BY [Read_code7],[Read_code],[Staff_number] ORDER BY CASE WHEN [Current_Caseload_Holder] IS NOT NULL THEN 0 ELSE 1 END, [Read_code7]) AS Cnt

    FROM MyTable

    )

    DELETE FROM Dups WHERE Cnt > 1

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Just add a condition Current_Caseload_Holder IS NOT NULL

    Or you can just add to your CTE and fetch records for the ones that are/are not null, using a UNION ALL

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • MUCH SIMPLER 🙂

    Mark-101232 (10/23/2013)


    Yes, you just need to change the ORDER BY clause

    ;WITH Dups AS (

    SELECT

    [Patient_Count]

    ,[Read_code7]

    ,[Read_code]

    ,[Current_Caseload_Holder]

    ,[Staff_number]

    ,[Event_date]

    ,[Event_done_at],

    ROW_NUMBER() OVER (PARTITION BY [Read_code7],[Read_code],[Staff_number] ORDER BY CASE WHEN [Current_Caseload_Holder] IS NOT NULL THEN 0 ELSE 1 END, [Read_code7]) AS Cnt

    FROM MyTable

    )

    DELETE FROM Dups WHERE Cnt > 1

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Thanks guys.....worked like a treat!

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply