Remove Some Duplicate Rows with the Oldest InsertDate

  • I have been trying to delete the duplicate rows from a table. I want to keep only the rows that have a single entry per person, or the first row entered of a multiple record row per person.

    Using the DataEntered column as the reference for the multiple Member rows, keep only the oldest row and all rows that have only a single Member entry

    The new data can be placed into a Table Variable or a Temp Table. Does not matter.

    IF OBJECT_ID('TempDB..#Member','U') IS NOT NULL

    DROP TABLE #Member

    --------------------------------------------------------

    CREATE TABLE #Member

    (

    DateEntered DATETIME

    , MemberControlID BIGINT

    , MemberExternalID BIGINT

    , MemberFullName VARCHAR(100)

    )

    --------------------------------------------------------

    INSERT INTO #Member

    (DateEntered, MemberControlID, MemberExternalID, MemberFullName)

    SELECT '2011-12-07 12:48:43.000',20111207130252,101368889, 'Alex R Bennett' UNION ALL -- Keep This Row

    SELECT '2011-12-07 10:30:20.000',20111207115324,101377900, 'Alan G Rivers' UNION ALL ------------------ DELETE THIS ROW --

    SELECT '2011-12-07 10:29:50.000',20111207115232,101379911, 'Alan G Rivers' UNION ALL ------------------ DELETE THIS ROW --

    SELECT '2011-12-07 10:17:00.000',20111207113463,101381234, 'Alan G Rivers' UNION ALL -- Keep This Row

    SELECT '2011-12-07 08:28:10.000',20111207101327,101381237, 'Gregory L Adkinson' UNION ALL -- Keep This Row

    SELECT '2011-12-07 08:28:45.000',20111207101392,713886750, 'Gregory L Adkinson' UNION ALL -------------- DELETE THIS ROW --

    SELECT '2011-12-07 12:46:37.000',20111207130010,713558747, 'Roger E Saunders' UNION ALL -- Keep This Row

    SELECT '2011-12-07 08:21:07.000',20111207100625,713878123, 'Alice A Rogers' UNION ALL -- Keep This Row

    SELECT '2011-12-07 10:19:31.000',20111207113756,713880946, 'Mike L Carry' UNION ALL -- Keep This Row

    SELECT '2011-12-07 11:19:56.000',20111207121532,713886667, 'Albert A Michaels' UNION ALL -- Keep This Row

    SELECT '2011-12-07 12:43:21.000',20111207129702,713887835, 'Alfred B Jones' UNION ALL -- Keep This Row

    SELECT '2011-12-07 12:18:53.000',20111207127338,713887858, 'Sam C White' -- Keep This Row

    --select * from #Member ORDER BY DateEntered, MemberControlID, MemberFullName ASC

    So that I would be left with only the following records that have a distinct MemberFullName, and a DateEntered as the oldest record

    SELECT '2011-12-07 12:48:43.000',20111207130252,101368889, 'Alex R Bennett' UNION ALL

    SELECT '2011-12-07 10:17:00.000',20111207113463,101381234, 'Alan G Rivers' UNION ALL

    SELECT '2011-12-07 08:28:10.000',20111207101327,101381237, 'Gregory L Adkinson' UNION ALL

    SELECT '2011-12-07 12:46:37.000',20111207130010,713558747, 'Roger E Saunders' UNION ALL

    SELECT '2011-12-07 08:21:07.000',20111207100625,713878123, 'Alice A Rogers' UNION ALL

    SELECT '2011-12-07 10:19:31.000',20111207113756,713880946, 'Mike L Carry' UNION ALL

    SELECT '2011-12-07 11:19:56.000',20111207121532,713886667, 'Albert A Michaels' UNION ALL

    SELECT '2011-12-07 12:43:21.000',20111207129702,713887835, 'Alfred B Jones' UNION ALL

    SELECT '2011-12-07 12:18:53.000',20111207127338,713887858, 'Sam C White'

    I am really close, but I cannot seem to get this correct. I greatly appreciate any and all help, comments, assistance with this.

    Thank You in advance for all your help

    Andrew SQLDBA

  • There are several methods you could follow. Is the MemberControlID unique? If it is, the follow SQL would pull out all the MemberControlID's that you want to keep:

    Select MemberControlID

    From #Member M

    where DateEntered = (select min(DateEntered) from #Member M2

    where MemberFullName = M.MemberFullName)

    You can use the above SQL with an IN clause to pull your records:

    Select * from #Member where MemberControlID

    IN

    (

    Select MemberControlID

    From #Member M

    where DateEntered = (select min(DateEntered) from #Member M2

    where MemberFullName = M.MemberFullName)

    )

    I think that is what you are looking for.

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • this will also give result that you want

    Select * from #Member except

    Select a.* from #Member a inner join #Member b

    on a.MemberFullName=b.MemberFullName where a.DateEntered>b.DateEntered

    java[/url]

  • ChazMan's approach works, bit using the nested IN subqueries results in scanning the #Member table three separate times. It can be done in a single scan using an approach like this:

    SELECT b.*

    FROM (

    SELECT MemberFullName, MIN(DateEntered) AS DateEntered

    FROM #Member

    GROUP BY MemberFullName

    ) AS a INNER JOIN

    #Member AS b ON a.MemberFullName = b.MemberFullName AND

    a.DateEntered = b.DateEntered

    Run against the provided sample data, with SET STATISTICS IO ON, this results in 1 scan and 1 logical read. This compares to 3 scans and 25 logical reads for the nested subquery approach. Doesn't really matter on a data set this small, but if you're going to end up running this regularly against a much larger table, it could make a significant difference. 🙂

  • I misspoke a bit...the subquery approach is even less performant than I first wrote, as the execution plan shows that it actually scans the #Member table 25 times. It has to do a full scan for each record in the table. That won't scale well at all. Unfortunately, salum's approach has the same issue. Both result in triangular joins, which perform very poorly with non-trivial record counts.

  • Excellent point Jon.

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • Thank You Everyone

    I was close, but just could not get it correct.

    Greatly appreciate all the help

    Andrew SQLDBA

  • AndrewSQLDBA (12/8/2011)


    Thank You Everyone

    I was close, but just could not get it correct.

    Greatly appreciate all the help

    Andrew SQLDBA

    You're welcome!

  • You can very easily use the row_number() function to do this.

    with cte as (

    select row_number() over (partition by MemberFullName order by DateEntered) as nr,

    *

    from #Member

    )

    delete from cte where nr > 1

    The query plan for this shows a single table scan plus a single sort, some more 0% actions and a table delete. If the table has an index on the MemberFullName the queryplan looks even better.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply