November 24, 2010 at 9:30 am
I want to remove duplicates from the DB. here's my code for that:
DECLARE @Count int
DECLARE @companyname nvarchar(50)
DECLARE @id int
DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
OPEN dublicate_cursor
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
END
CLOSE dublicate_cursor
DEALLOCATE dublicate_cursor
As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.
So, Im trying to change a small part of the code:
this gives me the companynames with as a second column the amount it occurs in the DB
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
I changed it to, the following, which gives me NO results (no error, just no results)
SELECT id,companyname, Count(companyname) - 1
FROM prospects
GROUP BY companyname,id
HAVING Count(companyname) > 1
why does it give NO results? Propably because id is unique, but how can I select it anyway?
November 24, 2010 at 9:48 am
It doesn't make much sense to me.
Can you post table script and some sample data?
Take a look at the article linked in my signature line to see how you can post effectively.
-- Gianluca Sartori
November 24, 2010 at 3:11 pm
CELKO (11/24/2010)
>> I want to remove duplicates from the DB. here's my code for that: <<Your code is terrible and using a cursor is really bad SQL programming. What you have done is re-discover the way that we did data processing with magnetic tapes six decades ago.
Why did you use a reserved word for a local variable? Can you be more vague than "id"? What entity is it an identifier of? Remember, mag tapes count records and RDBMS uses set of rows. Rows are not anything like records.
Throw out this whole thing and use:
CREATE VIEW Duplicate_Prospects (company_name, company_occurrences)
AS
SELECT company_name, COUNT(*) AS company_occurrences
FROM Prospects
GROUP BY company_name;
The VIEW will always be current and correct as you add more data to the Prospects table. Unlike punch cards and magnetic tape files, a table can be virtual. Please read a book on basic RDBMS, so you can stop writing COBOL and assembly language in SQL and start using SQL for RDBMS.
Uh-huh... let's see the code that will delete the duplicates using that view. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 24, 2010 at 4:21 pm
SELECT id,companyname, Count(companyname) - 1
FROM prospects
GROUP BY companyname,id
HAVING Count(companyname) > 1
why does it give NO results? Propably because id is unique, but how can I select it anyway?
You have the companyname and id in the Group By, so you won't get a count of more than 1 on the company name.
November 24, 2010 at 7:46 pm
petervdkerk (11/24/2010)
I want to remove duplicates from the DB. here's my code for that:DECLARE @Count int
DECLARE @companyname nvarchar(50)
DECLARE @id int
DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
OPEN dublicate_cursor
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname
FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count
END
CLOSE dublicate_cursor
DEALLOCATE dublicate_cursor
As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.
So, Im trying to change a small part of the code:
this gives me the companynames with as a second column the amount it occurs in the DB
SELECT companyname, Count(*) - 1
FROM prospects
GROUP BY companyname
HAVING Count(*) > 1
I changed it to, the following, which gives me NO results (no error, just no results)
SELECT id,companyname, Count(companyname) - 1
FROM prospects
GROUP BY companyname,id
HAVING Count(companyname) > 1
why does it give NO results? Propably because id is unique, but how can I select it anyway?
Um... you're deleting dupes based on a single column... companyname. My question would be, with the exception of the ID, do all the other columns have identical information in them, as well? And what kind of column is ID??? Is it an INT and does it carry the IDENTITY property? If not all the other columns have identical information in them, which row for a given companyname do you want to keep?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 25, 2010 at 10:46 am
Ok, I think we've concluded now (you guys mostly;) my code is horrible 🙂
Actually I just ripped it of some blog and used, since I didnt know cursors are that bad. Well...now I do know 🙂
To answer all questions:
- yes, im only removing duplicates on a SINGLE column value, because all other fields are the same (an earlier method has done that).
- Please find the scripts to create tables and sample data below. In the sample data not all columsn are the same, but thats ok, since those other columns may be ignored in my SQL to remove the duplicates.
- As you will see in the sample data and tables, some records occur only in "prospects" (called p from now), whilst others also have a reference in "prospectsincampaigns" (called pic from now)
What am I trying to achieve:
A duplicate record is found and removed based on exact match on value in column "companyname"
Scenario's
1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:
- all records, except for 1, are deleted from p table
2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:
- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table
3 duplicate records ALL have reference in pic table:
- no records are deleted
preferably the result of the quesries is the amount of records that were deleted, but if thats hard nevermind....
/****** Object: Table [dbo].[ProspectsInCampaigns] Script Date: 11/25/2010 18:36:26 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[ProspectsInCampaigns](
[prospectid] [int] NOT NULL,
[campaignid] [int] NOT NULL
) ON [PRIMARY]
GO
/****** Object: Table [dbo].[prospects] Script Date: 11/25/2010 18:25:55 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[prospects](
[id] [int] IDENTITY(1,1) NOT NULL,
[companyname] [nvarchar](255) NULL,
[companytype] [int] NOT NULL,
[companyaddress] [nvarchar](255) NULL,
[companyhousenr] [nvarchar](255) NULL,
[companyhousenradd] [nvarchar](255) NULL,
[companyzipcode] [nvarchar](255) NULL,
[companycity] [nvarchar](255) NULL,
CONSTRAINT [PK_intercity] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[prospects] ADD CONSTRAINT [DF_prospects_companytype_1] DEFAULT ((0)) FOR [companytype]
GO
************************** INSERTING THE SAMPLE DATA **************************
declare @newid int
INSERT INTO prospects(companyname,companytype,companyaddress,companyzipcode) VALUES('MyCompany',2,'Mystreet','65434')
set @newid=SCOPE_IDENTITY()
INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)
INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany','Mystreet','126','90210')
set @newid=SCOPE_IDENTITY()
INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)
INSERT INTO prospects(companyname,companyhousenr,companyzipcode,companycity) VALUES('MyCompany','126','65474','New York')
set @newid=SCOPE_IDENTITY()
INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)
--not inserted in a campaign, so should be deleted if duplicate
INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',0,'','','')
INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',5,'','','')
INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('Other company',1,'Rosestreet','1','1002')
INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('Other company','Rosestreet','1','100245')
INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane 676','1','15444')
INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane','1','15444')
Thanks so much for the good replies so far 🙂
November 25, 2010 at 2:14 pm
Here's my attempt. I'm confident there are easier solutions not requiring the multiple sort operations...
Edit: please explain the difference between this thread and the one Gianluca mentioned earlier. At the other thread you told the issue has been resolved followed by your reply over here. Sounds strange...
; WITH cte AS
(
SELECT
COUNT(*) OVER (PARTITION BY companyname,
CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) AS cnt_part, -- number of rows per companyname with and without entry in ProspectsInCampaigns
MIN(p.id) OVER (PARTITION BY companyname,
CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) min_part, -- lowest prospects.id per companyname with and without entry in ProspectsInCampaigns
COUNT(*) OVER (PARTITION BY companyname) cnt_total, -- total number of rows per companyname
p.id,
prospectid
FROM prospects p
LEFT OUTER JOIN ProspectsInCampaigns c ON p.id=c.campaignid
)
--DELETE prospects
SELECT p.*
FROM prospects p
INNER JOIN cte ON p.id=cte.id
WHERE
(prospectid IS NULL -- only rows with no reference to the ProspectsInCampaigns table
AND (cnt_part < cnt_total -- either there are still some rows left for that companyname with entries in ProspectsInCampaigns
OR (cnt_part = cnt_total AND p.id>min_part) -- or use all rows except the one with the earliest entry
)
)
November 25, 2010 at 4:33 pm
I think it should only delete rows with id in 4, 5, 7, 9
It would help of course if we had expected results 😉
Here is my version (which is similar to Lutz' one but only recommends deleting 4,5,7 and 9)
;with dupes
as
(
select
id
,companyname
--== counter > 1 means there is at least one duplicate based on companyname
,COUNT(*) over(partition by companyname) AS counter
--== rownum = 1 is the record to keep when there are no "pic" records
,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum
--== pic is the count of records in the ProspectsInCampaign table for the current "id"
,pic AS pic
--== totalpic = 0 means we can delete all but the first company id
,SUM(pic) over(partition by companyname) AS totalpic
--== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any
,MIN(pic) over(partition by companyname) AS minpic
from prospects
outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)
)
select *
from dupes
cross apply (
SELECT
COALESCE(
case
when totalpic =0 and rownum>1 then 'DELETE ME'
else null
end, --== no pic records = delete all but first (ordered by id)
case
when minpic>0 then NULL
when totalpic>0 and pic=0 then 'DELETE ME'
else null
end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)
) AS deleteFlag ) x
where counter>1
-- uncomment the following line to select only the rows that should be deleted.
-- AND deleteFlag='DELETE ME'
IF you want to know what is being deleted, either store the results of this query in a table and use it as the basis for the DELETE or use an OUTPUT clause to insert the deleted rows into another table when you do the DELETE.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
November 26, 2010 at 6:44 am
Hi
Check the following code. This may be useful.
declare @test_tab table (slno int,name varchar(10))
insert into @test_tab values (1,'a')
insert into @test_tab values (1,'b')
insert into @test_tab values (1,'c')
insert into @test_tab values (2,'a')
insert into @test_tab values (2,'b')
insert into @test_tab values (2,'c')
insert into @test_tab values (3,'a')
insert into @test_tab values (4,'a')
insert into @test_tab values (5,'a')
--select * from @test_tab
declare @test_tab1 table (slno int,name varchar(10),r int)
insert into @test_tab1
select * from (
select slno,name,DENSE_RANK() over (partition by slno order by name) r
from @test_tab ) a where r = 1
select * from @test_tab1
Thanks
Siva Kumar J.
November 27, 2010 at 7:09 am
Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!
I ran your query and it marks the correct records for deletion.
1 question:
you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?
November 29, 2010 at 5:13 am
It's always good to keep the original data of the table stashed in a different DB / different table just in case you have to restore data from that one table because you wrote your code wrong. I actually have a USER database where we keep data like this all the time. Easier to do than snapshots / backups because I don't lose all other production data when I restore (stuff that occurred after my fubar).
If the IDs are different, but all other columns are the same, there are easier ways to do this code than the CTE. Here's my preferred method.
SELECT *
INTO TemporaryDumpTable
FROM prospects;
--This keeps the data in case you make big mistake and need to reload the table
DELETE FROM prospects
FROM Prospects p1
INNER JOIN (Select MAX(ID) AS ID, CompanyName
FROM Prospects
GROUP BY CompanyName) p2
ON p1.CompanyName = p2.CompanyName
AND p1.ID = p2.ID;
--Deletes only the MAX id.
Now, if you have more than 1 duplicate, you might do a LEFT OUTER JOIN and delete where p1.ID <> p2.ID. That will work too. HOWEVER, this is code you should test in a dev environment before putting into production, as one misplaced symbol or JOIN statement will destroy your entire table's data.
November 29, 2010 at 8:10 am
petervdkerk (11/27/2010)
Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!I ran your query and it marks the correct records for deletion.
1 question:
you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?
Hi Peter, just like this:
;with dupes
as
(
select
id
,companyname
--== counter > 1 means there is at least one duplicate based on companyname
,COUNT(*) over(partition by companyname) AS counter
--== rownum = 1 is the record to keep when there are no "pic" records
,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum
--== pic is the count of records in the ProspectsInCampaign table for the current "id"
,pic AS pic
--== totalpic = 0 means we can delete all but the first company id
,SUM(pic) over(partition by companyname) AS totalpic
--== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any
,MIN(pic) over(partition by companyname) AS minpic
from prospects
outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)
)
DELETE p
output deleted.*
from dupes
cross apply (
SELECT
COALESCE(
case
when totalpic =0 and rownum>1 then 'DELETE ME'
else null
end, --== no pic records = delete all but first (ordered by id)
case
when minpic>0 then NULL
when totalpic>0 and pic=0 then 'DELETE ME'
else null
end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)
) AS deleteFlag ) x
join prospects p
on p.id = dupes.id
where counter>1
AND deleteFlag='DELETE ME'
Note that I included an OUTPUT clause to show you the deleted records - if you want you can insert that result set into another table as long as the table already exists.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
November 29, 2010 at 2:16 pm
There's a simpler way to dedupe since SQL 2005.
;with DupeCheck as
(select row_number() over (partition by company_name order by ID) as Row
from dbo.prospects)
delete from DupeCheck
where Row > 1;
If you want to store the rows that are deleted, for auditing or recovery, use an Output statement in the delete clause.
I recommend having an Output statement, and running the whole thing in a transaction that you roll back, then verifying what was deleted, then change the rollback to commit and run it again.
begin transaction;
create table #T (ID int);
;with DupeCheck as
(select row_number() over (partition by company_name order by ID) as Row
from dbo.prospects)
delete from DupeCheck
output deleted.ID into #T (ID)
where Row > 1;
select top 10 * from #T;
rollback;
Check the data you get from the Output clause, make sure you really want to get rid of those rows, and then you're good to go once you've confirmed that.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 29, 2010 at 2:59 pm
@Brandie and GSquared,
There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 15 posts - 1 through 15 (of 33 total)
You must be logged in to reply to this topic. Login to reply