Antother Delete Duplicates Thread

Question

Antother Delete Duplicates Thread

SQL-Squid

Default port

Points: 1472
More actions
October 10, 2012 at 12:37 pm

#265519

I know this topic has been covered many times, but I cannot find the specific solution I am looking for.
We load data from an outside source and there are sometimes duplicates being loaded, sometimes there are not. (This is another story)
I have a basic query that will handle my duplicate delete issue without any logic added:
SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE
GO
TUNCATE TABLE SOURCE_TABLE
GO
INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE
GO
DROP TABLE TEMP_TABLE
GO
I am plannig to setup a job to run on off peak times that will remove the duplicates (until I can fix the source data feed)
I want to build some logic that will look at the total records and compare to the distinct. If total is > than distinct, remove duplicates.
I attempted to build a IF statement but it doesnt like me using multiple select statements. Here is what I tried:
IF (Select * From SoureTable) > (Select Distinct * From SourceTable)
BEGIN
SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE
TUNCATE TABLE SOURCE_TABLE
INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE
DROP TABLE TEMP_TABLE
END
ERROR: Msg 116, Level 16, State 1, Line 3
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
I've also seem some examples that set the database recovery model to bulk logged during the duplicate removal operations and setting back to its previous recovery model after the operation is complete. It makes sense to me as the log file be getting entries for all the inserting going on. Is this correct and should I set to bulk logged? (A few of the tables have more than 2 million rows)
If there is a better way to handle this I am open to other ideas. Thanks!

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286609 More actions · Answer 1

Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

The error you are getting is because you trying to compare entire result sets and what you want is the row count.

IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:

IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)

Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SQL-Squid Default port Points: 1472 More actions · Answer 2

Sean Lange (10/10/2012)
Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.
The error you are getting is because you trying to compare entire result sets and what you want is the row count.
IF (Select * From SoureTable) > (Select Distinct * From SourceTable)
That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:
IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)
Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.

Thanks Sean,

You are correct all i want to do is just delete the duplicate rows. Looks like I need to revisit this another way and look for better fixes using row counts or some other option.

Sean Lange SSC Guru Points: 286609 More actions · Answer 3

Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.

create table #Dupes

(

IdentityValue int identity not null,

UserValue varchar(20),

AnotherValue varchar(10)

)

insert #Dupes

select 'asdf', 'Another' union all

select 'qwer', 'Another' union all

select 'unique', 'Another' union all

select 'another', 'Another' union all

select 'asdf', 'Another' union all

select 'qwer', 'Another'

select * from #Dupes

;with cte as

(

select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum

from #Dupes

)

delete cte where RowNum > 1

select * from #Dupes

drop table #Dupes

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SQL-Squid Default port Points: 1472 More actions · Answer 4

Sean Lange (10/10/2012)
Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.
create table #Dupes
(
IdentityValue int identity not null,
UserValue varchar(20),
AnotherValue varchar(10)
)
insert #Dupes
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another' union all
select 'qwer', 'Another' union all
select 'unique', 'Another' union all
select 'another', 'Another' union all
select 'another', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another'
select * from #Dupes
;with cte as
(
select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum
from #Dupes
)
delete cte where RowNum > 1
select * from #Dupes
drop table #Dupes

GOT IT! Thanks!

Here is what I ended up using:

TABLE:

[dbo].[ActivityGroup]

(

[ACTIVITYGROUP_DESC] [nvarchar](255) NULL,

[REQUEST_Id] [int] NULL,

[PullDate] [date] NULL

) ON [PRIMARY]

WITH DUPLICATES AS

(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate

ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups

FROM ActivityGroup

Delete DUPLICATES where Dups > 1

This is a great way to get rid of exact duplicates! Im sure its less over head than my previous approach 🙂 Many Thanks!

Sean Lange SSC Guru Points: 286609 More actions · Answer 5

You are quite welcome. Thanks for letting me know that worked for you.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

SQL-Squid Default port Points: 1472 More actions · Answer 6

Two more questions. I am new to using a CTE, so bear with me 🙂

There are tables with 20 or more columns. Can I use a wild card like *, if not I have no problem listing them out, just checking. (I tried replacing the column names with a *, but it didnt work)

I wanted to check for duplicates across all columns, which I got sorted thanks to your help.

Now I have some duplicates where the activitygroup_desc and request_id are the same but the pulldate is different. (due to my bad usage of the update and insert routine i wrote - its now fixed) The table will have the request_id used multiple times as its tied to the activitygroup_desc. Here is an example of the data:

ACTIVITYGROUP_DESCREQUEST_ID PullDate

Customer Requests39 2012-10-05

Customer Requests39 2012-09-27

Lab Services 39 2012-09-27

Customer Requests40 2012-10-04

Lab Services 40 2012-10-02

Customer Requests40 2012-10-02

Customer Requests41 2012-09-17

Customer Requests42 2012-10-08

Any recomendations to modify the CTE to look at column1, column2, and when there is a duplicate remove the one with the older pull date in column3? Does that make sense?

sam.dahl Hall of Fame Points: 3112 More actions · Answer 7

My understanding is that there's no way to use *, instead generate a list of columns and use that.

Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

Your code:

WITH DUPLICATES AS

(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate

ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups

FROM ActivityGroup

The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

WITH DUPLICATES AS

(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields

ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1

) AS Dups

FROM ActivityGroup

Any rows with a value greater than one are older duplicates and can be deleted.

Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).

Minor disclaimer: I haven't tested the code yet!

SQL-Squid Default port Points: 1472 More actions · Answer 8

sam.dahl (10/11/2012)
My understanding is that there's no way to use *, instead generate a list of columns and use that.
Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.
Your code:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup
The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.
Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields
ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1
) AS Dups
FROM ActivityGroup
Any rows with a value greater than one are older duplicates and can be deleted.
Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).
Minor disclaimer: I haven't tested the code yet!

Sam,

This works perfectly (had to fix a few syntax errors and typos, but that was no biggie)

The explanation and the comments in code really brought it home for me. I was reading about all this yesterday after testing the example posted by Sean. Now it all makes sense (light bulb above my head now)

Gotta give thanks to Sean and Sam! This forum rocks! One day I will be able to help others and keep the cycle going 🙂

SQL-Squid Default port Points: 1472 More actions · Answer 9

A side note for any newbies reading this:

I was manually performing:

SELECT * FROM ACTIVITY GROUP --Gave me total count of rows

SELECT DISTINCT * FROM ACTIVITY GROUP --Thought this was a true Distinct count of the distinct values, in many cases this would be a good count. I had previously made a mistake and the 3rd column (Pulldate) was incorrectly assigned and duplicated to many rows.

Using the CTE examples provided above I was able to remove the true duplicates.

So to check the CTE was doing as it was suposed to, I needed to be more specific in my test query, I needed the distinct count from columns 1 and 2, excluding column 3.

Here is the simple query I used:

SELECT DISTINCT ACTIVITYGROUP_DESC, REQUEST_ID

FROM ACTIVITYGROUP

^Now I am only looking at the first two columns, this gave me a better count of actual distinct values.

Hope this makes sense to anyone that reads this.

Thanks to the following post for help with DISTINCT.

http://www.sqlservercentral.com/articles/DISTINCT/71551/