How to Efficiently Find and Remove Duplicates in a Large SQL Database?

  • I have a large SQL database with millions of records, and I've identified duplicate entries. What's the most efficient way to find and remove these duplicates without compromising database performance or data integrity? Are there any best practices or SQL queries that can help in this situation?"

    This question addresses a common database maintenance task and invites SQL experts to share their knowledge on efficiently identifying and dealing with duplicate records in a large database.

  • Microsoft has a couple of good examples. I'd strongly suggest you use option 2.

    But!!!

    The bigger question is, how did that happen in the first place? Cleaning it up is good. Preventing it from happening is even more important. Make sure you've got appropriate constraints in place. They not only help keep the data clean, but they can, in some circumstances, enhance performance too.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • To add to what Grant said, I would also like to know why you think that is a "common database maintenance task"? My opinion, database maintenance tasks include things that need administrative access, not data manipulation. As a DBA, data manipulation is NOT something I should be doing. I don't own the data! The business owns the data. My job is to make sure the data is accessible, is backed up, that data access performance is acceptable, and that the data is not corrupt. If the end users decide that putting duplicate data into the database, that's on them to figure out what exactly is duplicate data. I can work with them to clean up the duplicate data, but having a DBA fix bad data should be a "one-off" thing, not a "common task".

    I am not saying "don't fix the issue", but I would want to know why the system was designed to allow duplicates and if the end users are expecting duplicates in the system. It MAY be that what you consider duplicate is actually not a duplicate. Like if a different user put the data in OR the data was put in on a different date that it is NOT duplicate even if every other column shows it as a duplicate.

    As a DBA, I am VERY cautious about doing ANY data manipulation. Update the stored procedure? sure, I'm on it. Add/remove an index? Yep, after I investigate and test for issues. Add constraints to prevent data duplication? I can do that. But asking me to do data manipulation and considering a "common task" for a DBA sounds like a bad idea and not something I would do.

    The ONLY exception to me doing data manipulation is in ETL loads where I need to do some transforms on the data as it comes in. That and if I need to update some data in my admin tables, but the admin tables I created are for very specific uses and are limited in access to only the DBA group (dynamically created database backup and restore scripts for example).

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

  • DesNorton wrote:

    Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

    Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Phil Parkin wrote:

    DesNorton wrote:

    Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

    Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

     

    You can do a full table swap using SWITCH.

    However, I made an error above.  It can only be done with an empty table.

    So, you can swap the tables, then move the singles back.

  • Phil Parkin wrote:

    DesNorton wrote:

    Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

    Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

    yes but.... a single non partitioned table can still be subject to a partition switch.

    small example below

    drop table if exists dbo.demo_partition_source;
    drop table if exists dbo.demo_partition_target;

    create table dbo.demo_partition_source
    (id int identity (1, 1) not null
    ,name varchar(200)
    , constraint PK_demo_partition_source primary key clustered
    (id
    )
    )
    ;

    create table dbo.demo_partition_target
    (id int identity (1, 1) not null
    ,name varchar(200)
    , constraint PK_demo_partition_target primary key clustered
    (id
    )
    )
    ;

    insert into dbo.demo_partition_source
    values ('row 1')
    , ('row 2')

    select 'demo_partition_source' as tablename, *
    from dbo.demo_partition_source
    union
    select 'PK_demo_partition_target' as tablename, *
    from dbo.demo_partition_target

    alter table dbo.demo_partition_source switch to demo_partition_target

    select 'demo_partition_source' as tablename, *
    from dbo.demo_partition_source
    union
    select 'PK_demo_partition_target' as tablename, *
    from dbo.demo_partition_target

    drop table if exists dbo.demo_partition_source;
    drop table if exists dbo.demo_partition_target;
  • SET NOCOUNT ON;

    DECLARE @SAMPLE_SIZE BIGINT = 10000;

    DECLARE
    @MinInt INT = 10000
    , @MaxInt INT = 50000
    , @StartDate DATE = DATEADD(YY, DATEDIFF(YY, 0, GETDATE())-10, 0)
    , @EndDate DATE = GETDATE()
    , @MinAmount DECIMAL(18,2) = 100.00
    , @MaxAmount DECIMAL(18,2) = 1000.00;

    /***************************************************
    *** Create a base table with sample data
    ***************************************************/
    ;WITH H2 (N) AS ( SELECT 1 FROM (VALUES
    (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
    ) AS V(N)) --16^2 or 256 rows
    , H4 (N) AS (SELECT 1 FROM H2 AS a CROSS JOIN H2 AS b) --16^4 or 65,536 rows
    , H8 (N) AS (SELECT 1 FROM H4 AS a CROSS JOIN H4 AS b) --16^8 or 4,294,967,296 rows
    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM H8)
    SELECT
    [RowNum] = ISNULL(NM.N, 0)
    , [RandomInt] = ABS(CHECKSUM(NEWID())%(@MaxInt-@MinInt))+@MinInt
    , [RandomDatetime] = RAND(CHECKSUM(NEWID()))*DATEDIFF(DD,@StartDate,@EndDate)+CONVERT(DATETIME,@StartDate)
    , [RandomAmount] = CONVERT(DECIMAL(18,2),RAND(CHECKSUM(NEWID()))*(@MaxAmount-@MinAmount)+@MinAmount)
    , [RandomHex12] = RIGHT(NEWID(), 12)
    INTO dbo.OriginalTable
    FROM NUMS AS NM
    OPTION (RECOMPILE);

    ALTER TABLE dbo.OriginalTable ADD PRIMARY KEY CLUSTERED (RowNum);


    /***************************************************
    *** Create a parallel table with the same structure
    *** Also add any constrainst and indexes
    ***************************************************/
    SELECT RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12
    INTO dbo.ParalleTable
    FROM dbo.OriginalTable
    WHERE 1 = 2;

    ALTER TABLE dbo.ParalleTable ADD PRIMARY KEY CLUSTERED (RowNum);

    /***************************************************
    *** Swap the tables using a partition switch
    ***************************************************/
    ALTER TABLE dbo.OriginalTable /* PK__Original__4F4A6852AAE6A449 */
    SWITCH TO dbo.ParalleTable; /* PK__ParalleT__4F4A6852B2DC10B9 */

    /***************************************************
    *** Move the required records back to the original table
    ***************************************************/
    INSERT INTO dbo.OriginalTable (RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12)
    SELECT RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12
    FROM dbo.ParalleTable
    WHERE [RowNum] % 27 = 1;


    /***************************************************
    *** Check your data
    ***************************************************/
    SELECT COUNT(*) FROM dbo.OriginalTable;
    SELECT COUNT(*) FROM dbo.ParalleTable;

    /***************************************************
    *** Cleanup
    ***************************************************/
    DROP TABLE dbo.ParalleTable;
  • Frederico, Des thank you both for the examples. Taught me something new … appreciated.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Piyushbhatt wrote:

    I have a large SQL database with millions of records, and I've identified duplicate entries. What's the most efficient way to find and remove these duplicates without compromising database performance or data integrity? Are there any best practices or SQL queries that can help in this situation?"

    This question addresses a common database maintenance task and invites SQL experts to share their knowledge on efficiently identifying and dealing with duplicate records in a large database.

    Not really enough info here.  Does the table have any indexes on it and or constraints on it?  Does it have an IDENTITY column on it?  What is the size of the table in Megabytes?  Are any FKs pointing at the table?

    And for this type of thing, it's actually not possible to avoid compromising the database performance a bit, especially if you have a MAXDOP of 0 as a default.

    As for the data integrity, you don't actually have any because your table has duplicates. 😉

    Also, what percentage of the rows need to be removed as duplicates?  How often are you doing log file backups?  What Recovery Model is the database in and can it be changed?

    And, finally, how much downtime is allowed when the table won't be able to be accessed?

    Actually, that's not the last thing... what is your plan to prevent this from happening again?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... pretty typical for things like this.  The OP asks a question and then never participates in their own thread.  While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    Heh... pretty typical for things like this.  The OP asks a question and then never participates in their own thread.  While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.

    I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

  • Jonathan AC Roberts wrote:

    Jeff Moden wrote:

    Heh... pretty typical for things like this.  The OP asks a question and then never participates in their own thread.  While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.

    I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

    Yeah... those are great.  That's one of the reasons why I frequently quote the original post. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • > I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

    I am guessing the former, unless the OP is hard at work writing his "own" curriculum and wants to crib it from the best, based on this somewhat odd inclusion in the question:

    > This question addresses a common database maintenance task [..]

    Great thread, btw, thanks to everybody for the examples.

    >L<

     

  • As a bit o a sidebar, I've found that the cutover point between DELETES and COPY to a new table is actually about (depending on the table, of course) only 21%, meaning if more than 21% needs to be DELETEd, I'll do the ol' Swap'n'Drop with a new table... especially if I can use Minimal Logging.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply