How to Efficiently Find and Remove Duplicates in a Large SQL Database?

Question

Post reply

How to Efficiently Find and Remove Duplicates in a Large SQL Database?

Piyushbhatt

Old Hand

Points: 374
More actions
November 8, 2023 at 10:46 am

#4323725

I have a large SQL database with millions of records, and I've identified duplicate entries. What's the most efficient way to find and remove these duplicates without compromising database performance or data integrity? Are there any best practices or SQL queries that can help in this situation?"
This question addresses a common database maintenance task and invites SQL experts to share their knowledge on efficiently identifying and dealing with duplicate records in a large database.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 1

Microsoft has a couple of good examples. I'd strongly suggest you use option 2.

But!!!

The bigger question is, how did that happen in the first place? Cleaning it up is good. Preventing it from happening is even more important. Make sure you've got appropriate constraints in place. They not only help keep the data clean, but they can, in some circumstances, enhance performance too.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Mr. Brian Gale SSC-Insane Points: 24800 More actions · Answer 2

To add to what Grant said, I would also like to know why you think that is a "common database maintenance task"? My opinion, database maintenance tasks include things that need administrative access, not data manipulation. As a DBA, data manipulation is NOT something I should be doing. I don't own the data! The business owns the data. My job is to make sure the data is accessible, is backed up, that data access performance is acceptable, and that the data is not corrupt. If the end users decide that putting duplicate data into the database, that's on them to figure out what exactly is duplicate data. I can work with them to clean up the duplicate data, but having a DBA fix bad data should be a "one-off" thing, not a "common task".

I am not saying "don't fix the issue", but I would want to know why the system was designed to allow duplicates and if the end users are expecting duplicates in the system. It MAY be that what you consider duplicate is actually not a duplicate. Like if a different user put the data in OR the data was put in on a different date that it is NOT duplicate even if every other column shows it as a duplicate.

As a DBA, I am VERY cautious about doing ANY data manipulation. Update the stored procedure? sure, I'm on it. Add/remove an index? Yep, after I investigate and test for issues. Add constraints to prevent data duplication? I can do that. But asking me to do data manipulation and considering a "common task" for a DBA sounds like a bad idea and not something I would do.

The ONLY exception to me doing data manipulation is in ETL loads where I need to do some transforms on the data as it comes in. That and if I need to update some data in my admin tables, but the admin tables I created are for very specific uses and are limited in access to only the DBA group (dynamically created database backup and restore scripts for example).

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

DesNorton SSC-Insane Points: 24386 More actions · Answer 3

Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

How to post data/code on a forum to get the best help.

Phil Parkin SSC Guru Points: 246960 More actions · Answer 4

DesNorton wrote:

Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions

Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

DesNorton SSC-Insane Points: 24386 More actions · Answer 5

Phil Parkin wrote:

DesNorton wrote:
Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions
Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

You can do a full table swap using SWITCH.

However, I made an error above. It can only be done with an empty table.

So, you can swap the tables, then move the singles back.

How to post data/code on a forum to get the best help.

frederico_fonseca SSCoach Points: 16154 More actions · Answer 6

Phil Parkin wrote:

DesNorton wrote:
Also, if the %of duplicates is higher than the % of singles, then it may be better to copy the singles into a new table and switch out the partitions
Regarding percentages, I agree. But there is no mention of partitions in the OP's post.

yes but.... a single non partitioned table can still be subject to a partition switch.

small example below

drop table if exists dbo.demo_partition_source;
drop table if exists dbo.demo_partition_target;

create table dbo.demo_partition_source
(id int identity (1, 1) not null
,name varchar(200)
, constraint PK_demo_partition_source primary key clustered
(id
)
)
;

create table dbo.demo_partition_target
(id int identity (1, 1) not null
,name varchar(200)
, constraint PK_demo_partition_target primary key clustered
(id
)
)
;

insert into dbo.demo_partition_source
values ('row 1')
, ('row 2')

select 'demo_partition_source' as tablename, *
from dbo.demo_partition_source
union
select 'PK_demo_partition_target' as tablename, *
from dbo.demo_partition_target

alter table dbo.demo_partition_source switch to demo_partition_target

select 'demo_partition_source' as tablename, *
from dbo.demo_partition_source
union
select 'PK_demo_partition_target' as tablename, *
from dbo.demo_partition_target

drop table if exists dbo.demo_partition_source;
drop table if exists dbo.demo_partition_target;

DesNorton SSC-Insane Points: 24386 More actions · Answer 7

SET NOCOUNT ON;

DECLARE @SAMPLE_SIZE BIGINT = 10000;

DECLARE
      @MinInt       INT = 10000
    , @MaxInt       INT = 50000
    , @StartDate    DATE = DATEADD(YY, DATEDIFF(YY, 0, GETDATE())-10, 0)
    , @EndDate      DATE = GETDATE()
    , @MinAmount    DECIMAL(18,2) =  100.00
    , @MaxAmount    DECIMAL(18,2) = 1000.00;

/***************************************************
***  Create a base table with sample data
***************************************************/
;WITH H2  (N) AS ( SELECT 1 FROM (VALUES
                     (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ,(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
                    ) AS V(N))                              --16^2 or 256 rows
    , H4  (N) AS (SELECT 1 FROM H2 AS a CROSS JOIN H2 AS b) --16^4 or 65,536 rows
    , H8  (N) AS (SELECT 1 FROM H4 AS a CROSS JOIN H4 AS b) --16^8 or 4,294,967,296 rows
    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM H8)
SELECT
    [RowNum]         = ISNULL(NM.N, 0)
  , [RandomInt]      = ABS(CHECKSUM(NEWID())%(@MaxInt-@MinInt))+@MinInt
  , [RandomDatetime] = RAND(CHECKSUM(NEWID()))*DATEDIFF(DD,@StartDate,@EndDate)+CONVERT(DATETIME,@StartDate)
  , [RandomAmount]   = CONVERT(DECIMAL(18,2),RAND(CHECKSUM(NEWID()))*(@MaxAmount-@MinAmount)+@MinAmount)
  , [RandomHex12]    = RIGHT(NEWID(), 12)
INTO dbo.OriginalTable
FROM NUMS AS NM
OPTION (RECOMPILE);

ALTER TABLE dbo.OriginalTable ADD PRIMARY KEY CLUSTERED (RowNum);


/***************************************************
***  Create a parallel table with the same structure
***  Also add any constrainst and indexes
***************************************************/
SELECT RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12
INTO dbo.ParalleTable
FROM dbo.OriginalTable
WHERE 1 = 2;

ALTER TABLE dbo.ParalleTable ADD PRIMARY KEY CLUSTERED (RowNum);

/***************************************************
***  Swap the tables using a partition switch
***************************************************/
ALTER TABLE dbo.OriginalTable    /* PK__Original__4F4A6852AAE6A449 */
  SWITCH TO dbo.ParalleTable;    /* PK__ParalleT__4F4A6852B2DC10B9 */

/***************************************************
***  Move the required records back to the original table
***************************************************/
INSERT INTO dbo.OriginalTable (RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12)
SELECT RowNum, RandomInt, RandomDatetime, RandomAmount, RandomHex12
FROM dbo.ParalleTable
WHERE [RowNum] % 27 = 1;


/***************************************************
***  Check your data
***************************************************/
SELECT COUNT(*) FROM dbo.OriginalTable;
SELECT COUNT(*) FROM dbo.ParalleTable;

/***************************************************
***  Cleanup
***************************************************/
DROP TABLE dbo.ParalleTable;

How to post data/code on a forum to get the best help.

Phil Parkin SSC Guru Points: 246960 More actions · Answer 8

Frederico, Des thank you both for the examples. Taught me something new … appreciated.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

Jeff Moden SSC Guru Points: 1004324 More actions · Answer 9

Piyushbhatt wrote:

I have a large SQL database with millions of records, and I've identified duplicate entries. What's the most efficient way to find and remove these duplicates without compromising database performance or data integrity? Are there any best practices or SQL queries that can help in this situation?"
This question addresses a common database maintenance task and invites SQL experts to share their knowledge on efficiently identifying and dealing with duplicate records in a large database.

Not really enough info here. Does the table have any indexes on it and or constraints on it? Does it have an IDENTITY column on it? What is the size of the table in Megabytes? Are any FKs pointing at the table?

And for this type of thing, it's actually not possible to avoid compromising the database performance a bit, especially if you have a MAXDOP of 0 as a default.

As for the data integrity, you don't actually have any because your table has duplicates. 😉

Also, what percentage of the rows need to be removed as duplicates? How often are you doing log file backups? What Recovery Model is the database in and can it be changed?

And, finally, how much downtime is allowed when the table won't be able to be accessed?

Actually, that's not the last thing... what is your plan to prevent this from happening again?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004324 More actions · Answer 10

Heh... pretty typical for things like this. The OP asks a question and then never participates in their own thread. While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 11

Jeff Moden wrote:

Heh... pretty typical for things like this. The OP asks a question and then never participates in their own thread. While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.

I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

Jeff Moden SSC Guru Points: 1004324 More actions · Answer 12

Jonathan AC Roberts wrote:

Jeff Moden wrote:
Heh... pretty typical for things like this. The OP asks a question and then never participates in their own thread. While some great code has been posted, there are still a lot of unknowns that fairly well important and the OP needs to come back with more details.
I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

Yeah... those are great. That's one of the reasons why I frequently quote the original post. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lisa Slater Nicholls Hall of Fame Points: 3455 More actions · Answer 13

> I like the ones that ask a homework question or work related question then delete the question so nobody can find out. - lol

I am guessing the former, unless the OP is hard at work writing his "own" curriculum and wants to crib it from the best, based on this somewhat odd inclusion in the question:

> This question addresses a common database maintenance task [..]

Great thread, btw, thanks to everybody for the examples.

>L<

Jeff Moden SSC Guru Points: 1004324 More actions · Answer 14

As a bit o a sidebar, I've found that the cutover point between DELETES and COPY to a new table is actually about (depending on the table, of course) only 21%, meaning if more than 21% needs to be DELETEd, I'll do the ol' Swap'n'Drop with a new table... especially if I can use Minimal Logging.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)