Marking duplicate rows

  • I have this query for marking duplicate rows. It works on a small database but takes more than 2 days to run if the db is large. Is there a way of making this execute faster. Thank you

    UPDATE DSObject_table

    SET Object_isDeleted = 1

    where handle_id IN

    (

    select handle_id from DSObject_table as dso

    where exists (select Civil_Case__original_file_name, Civil_Case_Files_size,Count(handle_id)

    from DSObject_table

    where DSObject_table.Civil_Case__original_file_name = dso.Civil_Case__original_file_name

    andDSObject_table.Civil_Case_Files_size = dso.Civil_Case_Files_size

    group by DSObject_table.Civil_Case__original_file_name, DSObject_table.Civil_Case_Files_size

    having count(DSObject_table.handle_id) > 1))

    and DSObject_table.handle_id NOT IN

    (

    SELECT Min(handle_id) FROM DSObject_table AS dso

    where exists (select Civil_Case__original_file_name, Civil_Case_Files_size,Count(handle_id)

    from DSObject_table

    where DSObject_table.Civil_Case__original_file_name = dso.Civil_Case__original_file_name

    andDSObject_table.Civil_Case_Files_size = dso.Civil_Case_Files_size

    group by DSObject_table.Civil_Case__original_file_name, DSObject_table.Civil_Case_Files_size

    having count(DSObject_table.handle_id) > 1)

    group by Civil_Case__original_file_name, Civil_Case_Files_size);

  • I think if you can speed up the update by using the row_number function:

    -- display rownr's assigned to each 'block' of similar rows (only for display purpose)

    select row_number() over (partition by Civil_Case__original_file_name, Civil_Case_Files_size order by handle_id) as rownr, *

    from DSObject_table

    order by Civil_Case__original_file_name, rownr

    -- update the value using the rownr definition (see select query above)

    update DSObject_table

    set Object_isDeleted = 1

    from (select row_number() over (partition by Civil_Case__original_file_name, Civil_Case_Files_size order by handle_id) as rownr, *

    from DSObject_table) DSObject_table

    where rownr > 1

    ** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **
  • Thanks, to display it only took 4min 11sec to execute. What does over partition by do?

  • It basically gives each record a unique number

    see this http://technet.microsoft.com/en-us/library/ms186734.aspx

  • imran.adam (8/21/2013)


    It basically gives each record a unique number

    see this http://technet.microsoft.com/en-us/library/ms186734.aspx%5B/quote%5D

    That is the correct link but your description is incorrect.

    Using PARTITION in the context of ROW_NUMBER divides the rows into sets which means that the value returned will restart at 1. This does not mean that each row gets a unique number, it restarts the counter.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (8/21/2013)


    imran.adam (8/21/2013)


    It basically gives each record a unique number

    see this http://technet.microsoft.com/en-us/library/ms186734.aspx%5B/quote%5D

    That is the correct link but your description is incorrect.

    Using PARTITION in the context of ROW_NUMBER divides the rows into sets which means that the value returned will restart at 1. This does not mean that each row gets a unique number, it restarts the counter.

    Yes that is what i meant, thanks for clarifying. Basically the data is split into subsets with a unique number e.g 1,2,3 then next will be 1,2 and then 1,2,3,4 and so on.....

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply