Remove duplicate records based on record id, annoying GROUP BY statement

Question

Post reply

Remove duplicate records based on record id, annoying GROUP BY statement

petervdkerk

SSCrazy

Points: 2155
More actions
November 24, 2010 at 9:30 am

#227191

I want to remove duplicates from the DB. here's my code for that:
DECLARE @Count int
DECLARE @companyname nvarchar(50)
DECLARE @id int
DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
OPEN dublicate_cursor
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
END
CLOSE dublicate_cursor
DEALLOCATE dublicate_cursor
As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.
So, Im trying to change a small part of the code:
this gives me the companynames with as a second column the amount it occurs in the DB
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
I changed it to, the following, which gives me NO results (no error, just no results)
SELECT id,companyname, Count(companyname) - 1
FROM prospects
GROUP BY companyname,id
HAVING Count(companyname) > 1
why does it give NO results? Propably because id is unique, but how can I select it anyway?
Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

spaghettidba SSC Guru Points: 105732 More actions · Answer 1

It doesn't make much sense to me.

Can you post table script and some sample data?

Take a look at the article linked in my signature line to see how you can post effectively.

-- Gianluca Sartori

spaghettidba SSC Guru Points: 105732 More actions · Answer 2

Look like this one. Duplicate post?

-- Gianluca Sartori

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 3

CELKO (11/24/2010)
>> I want to remove duplicates from the DB. here's my code for that: <<
Your code is terrible and using a cursor is really bad SQL programming. What you have done is re-discover the way that we did data processing with magnetic tapes six decades ago.
Why did you use a reserved word for a local variable? Can you be more vague than "id"? What entity is it an identifier of? Remember, mag tapes count records and RDBMS uses set of rows. Rows are not anything like records.
Throw out this whole thing and use:
CREATE VIEW Duplicate_Prospects (company_name, company_occurrences)
AS
SELECT company_name, COUNT(*) AS company_occurrences
FROM Prospects
GROUP BY company_name;
The VIEW will always be current and correct as you add more data to the Prospects table. Unlike punch cards and magnetic tape files, a table can be virtual. Please read a book on basic RDBMS, so you can stop writing COBOL and assembly language in SQL and start using SQL for RDBMS.

Uh-huh... let's see the code that will delete the duplicates using that view. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul_Blackwell Ten Centuries Points: 1085 More actions · Answer 4

SELECT id,companyname, Count(companyname) - 1

FROM prospects

GROUP BY companyname,id

HAVING Count(companyname) > 1

why does it give NO results? Propably because id is unique, but how can I select it anyway?

You have the companyname and id in the Group By, so you won't get a count of more than 1 on the company name.

Jeff Moden SSC Guru Points: 1004434 More actions · Answer 5

petervdkerk (11/24/2010)
I want to remove duplicates from the DB. here's my code for that:
DECLARE @Count int
DECLARE @companyname nvarchar(50)
DECLARE @id int
DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
OPEN dublicate_cursor
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
END
CLOSE dublicate_cursor
DEALLOCATE dublicate_cursor
As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.
So, Im trying to change a small part of the code:
this gives me the companynames with as a second column the amount it occurs in the DB
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
I changed it to, the following, which gives me NO results (no error, just no results)
SELECT id,companyname, Count(companyname) - 1
FROM prospects
GROUP BY companyname,id
HAVING Count(companyname) > 1
why does it give NO results? Propably because id is unique, but how can I select it anyway?

Um... you're deleting dupes based on a single column... companyname. My question would be, with the exception of the ID, do all the other columns have identical information in them, as well? And what kind of column is ID??? Is it an INT and does it carry the IDENTITY property? If not all the other columns have identical information in them, which row for a given companyname do you want to keep?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

petervdkerk SSCrazy Points: 2155 More actions · Answer 6

Ok, I think we've concluded now (you guys mostly;) my code is horrible 🙂

Actually I just ripped it of some blog and used, since I didnt know cursors are that bad. Well...now I do know 🙂

To answer all questions:

- yes, im only removing duplicates on a SINGLE column value, because all other fields are the same (an earlier method has done that).

- Please find the scripts to create tables and sample data below. In the sample data not all columsn are the same, but thats ok, since those other columns may be ignored in my SQL to remove the duplicates.

- As you will see in the sample data and tables, some records occur only in "prospects" (called p from now), whilst others also have a reference in "prospectsincampaigns" (called pic from now)

What am I trying to achieve:

A duplicate record is found and removed based on exact match on value in column "companyname"

Scenario's

1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

- all records, except for 1, are deleted from p table

2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

3 duplicate records ALL have reference in pic table:

- no records are deleted

preferably the result of the quesries is the amount of records that were deleted, but if thats hard nevermind....

/****** Object: Table [dbo].[ProspectsInCampaigns] Script Date: 11/25/2010 18:36:26 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[ProspectsInCampaigns](

[prospectid] [int] NOT NULL,

[campaignid] [int] NOT NULL

) ON [PRIMARY]

GO

/****** Object: Table [dbo].[prospects] Script Date: 11/25/2010 18:25:55 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[prospects](

[id] [int] IDENTITY(1,1) NOT NULL,

[companyname] [nvarchar](255) NULL,

[companytype] [int] NOT NULL,

[companyaddress] [nvarchar](255) NULL,

[companyhousenr] [nvarchar](255) NULL,

[companyhousenradd] [nvarchar](255) NULL,

[companyzipcode] [nvarchar](255) NULL,

[companycity] [nvarchar](255) NULL,

CONSTRAINT [PK_intercity] PRIMARY KEY CLUSTERED

(

[id] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

GO

ALTER TABLE [dbo].[prospects] ADD CONSTRAINT [DF_prospects_companytype_1] DEFAULT ((0)) FOR [companytype]

GO

************************** INSERTING THE SAMPLE DATA **************************

declare @newid int

INSERT INTO prospects(companyname,companytype,companyaddress,companyzipcode) VALUES('MyCompany',2,'Mystreet','65434')

set @newid=SCOPE_IDENTITY()

INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany','Mystreet','126','90210')

set @newid=SCOPE_IDENTITY()

INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

INSERT INTO prospects(companyname,companyhousenr,companyzipcode,companycity) VALUES('MyCompany','126','65474','New York')

set @newid=SCOPE_IDENTITY()

INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

--not inserted in a campaign, so should be deleted if duplicate

INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',0,'','','')

INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',5,'','','')

INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('Other company',1,'Rosestreet','1','1002')

INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('Other company','Rosestreet','1','100245')

INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane 676','1','15444')

INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane','1','15444')

Thanks so much for the good replies so far 🙂

LutzM SSC Guru Points: 107049 More actions · Answer 7

Here's my attempt. I'm confident there are easier solutions not requiring the multiple sort operations...

Edit: please explain the difference between this thread and the one Gianluca mentioned earlier. At the other thread you told the issue has been resolved followed by your reply over here. Sounds strange...

; WITH cte AS

(

SELECT

COUNT(*) OVER (PARTITION BY companyname,

CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) AS cnt_part, -- number of rows per companyname with and without entry in ProspectsInCampaigns

MIN(p.id) OVER (PARTITION BY companyname,

CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) min_part, -- lowest prospects.id per companyname with and without entry in ProspectsInCampaigns

COUNT(*) OVER (PARTITION BY companyname) cnt_total, -- total number of rows per companyname

p.id,

prospectid

FROM prospects p

LEFT OUTER JOIN ProspectsInCampaigns c ON p.id=c.campaignid

)

--DELETE prospects

SELECT p.*

FROM prospects p

INNER JOIN cte ON p.id=cte.id

WHERE

(prospectid IS NULL -- only rows with no reference to the ProspectsInCampaigns table

AND (cnt_part < cnt_total -- either there are still some rows left for that companyname with entries in ProspectsInCampaigns

OR (cnt_part = cnt_total AND p.id>min_part) -- or use all rows except the one with the earliest entry

)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

mister.magoo SSC-Forever Points: 47174 More actions · Answer 8

I think it should only delete rows with id in 4, 5, 7, 9

It would help of course if we had expected results 😉

Here is my version (which is similar to Lutz' one but only recommends deleting 4,5,7 and 9)

;with dupes

as

(

select

id

,companyname

--== counter > 1 means there is at least one duplicate based on companyname

,COUNT(*) over(partition by companyname) AS counter

--== rownum = 1 is the record to keep when there are no "pic" records

,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum

--== pic is the count of records in the ProspectsInCampaign table for the current "id"

,pic AS pic

--== totalpic = 0 means we can delete all but the first company id

,SUM(pic) over(partition by companyname) AS totalpic

--== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any

,MIN(pic) over(partition by companyname) AS minpic

from prospects

outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)

)

select *

from dupes

cross apply (

SELECT

COALESCE(

case

when totalpic =0 and rownum>1 then 'DELETE ME'

else null

end, --== no pic records = delete all but first (ordered by id)

case

when minpic>0 then NULL

when totalpic>0 and pic=0 then 'DELETE ME'

else null

end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

) AS deleteFlag ) x

where counter>1

-- uncomment the following line to select only the rows that should be deleted.

-- AND deleteFlag='DELETE ME'

IF you want to know what is being deleted, either store the results of this query in a table and use it as the basis for the DELETE or use an OUTPUT clause to insert the deleted rows into another table when you do the DELETE.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

sivaj2k Ten Centuries Points: 1188 More actions · Answer 9

Hi

Check the following code. This may be useful.

declare @test_tab table (slno int,name varchar(10))

insert into @test_tab values (1,'a')

insert into @test_tab values (1,'b')

insert into @test_tab values (1,'c')

insert into @test_tab values (2,'a')

insert into @test_tab values (2,'b')

insert into @test_tab values (2,'c')

insert into @test_tab values (3,'a')

insert into @test_tab values (4,'a')

insert into @test_tab values (5,'a')

--select * from @test_tab

declare @test_tab1 table (slno int,name varchar(10),r int)

insert into @test_tab1

select * from (

select slno,name,DENSE_RANK() over (partition by slno order by name) r

from @test_tab ) a where r = 1

select * from @test_tab1

Thanks

Siva Kumar J.

petervdkerk SSCrazy Points: 2155 More actions · Answer 10

Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!

I ran your query and it marks the correct records for deletion.

1 question:

you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?

Brandie Tarvin SSC Guru Points: 173081 More actions · Answer 11

It's always good to keep the original data of the table stashed in a different DB / different table just in case you have to restore data from that one table because you wrote your code wrong. I actually have a USER database where we keep data like this all the time. Easier to do than snapshots / backups because I don't lose all other production data when I restore (stuff that occurred after my fubar).

If the IDs are different, but all other columns are the same, there are easier ways to do this code than the CTE. Here's my preferred method.

SELECT *

INTO TemporaryDumpTable

FROM prospects;

--This keeps the data in case you make big mistake and need to reload the table

DELETE FROM prospects

FROM Prospects p1

INNER JOIN (Select MAX(ID) AS ID, CompanyName

FROM Prospects

GROUP BY CompanyName) p2

ON p1.CompanyName = p2.CompanyName

AND p1.ID = p2.ID;

--Deletes only the MAX id.

Now, if you have more than 1 duplicate, you might do a LEFT OUTER JOIN and delete where p1.ID <> p2.ID. That will work too. HOWEVER, this is code you should test in a dev environment before putting into production, as one misplaced symbol or JOIN statement will destroy your entire table's data.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 12

petervdkerk (11/27/2010)
Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!
I ran your query and it marks the correct records for deletion.
1 question:
you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?

Hi Peter, just like this:

;with dupes

as

(

select

id

,companyname

--== counter > 1 means there is at least one duplicate based on companyname

,COUNT(*) over(partition by companyname) AS counter

--== rownum = 1 is the record to keep when there are no "pic" records

,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum

--== pic is the count of records in the ProspectsInCampaign table for the current "id"

,pic AS pic

--== totalpic = 0 means we can delete all but the first company id

,SUM(pic) over(partition by companyname) AS totalpic

--== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any

,MIN(pic) over(partition by companyname) AS minpic

from prospects

outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)

)

DELETE p

output deleted.*

from dupes

cross apply (

SELECT

COALESCE(

case

when totalpic =0 and rownum>1 then 'DELETE ME'

else null

end, --== no pic records = delete all but first (ordered by id)

case

when minpic>0 then NULL

when totalpic>0 and pic=0 then 'DELETE ME'

else null

end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

) AS deleteFlag ) x

join prospects p

on p.id = dupes.id

where counter>1

AND deleteFlag='DELETE ME'

Note that I included an OUTPUT clause to show you the deleted records - if you want you can insert that result set into another table as long as the table already exists.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

GSquared SSC Guru Points: 260824 More actions · Answer 13

There's a simpler way to dedupe since SQL 2005.

;with DupeCheck as

(select row_number() over (partition by company_name order by ID) as Row

from dbo.prospects)

delete from DupeCheck

where Row > 1;

If you want to store the rows that are deleted, for auditing or recovery, use an Output statement in the delete clause.

I recommend having an Output statement, and running the whole thing in a transaction that you roll back, then verifying what was deleted, then change the rollback to commit and run it again.

begin transaction;

create table #T (ID int);

;with DupeCheck as

(select row_number() over (partition by company_name order by ID) as Row

from dbo.prospects)

delete from DupeCheck

output deleted.ID into #T (ID)

where Row > 1;

select top 10 * from #T;

rollback;

Check the data you get from the Output clause, make sure you really want to get rid of those rows, and then you're good to go once you've confirmed that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

mister.magoo SSC-Forever Points: 47174 More actions · Answer 14

@Brandie and GSquared,

There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);