February 2, 2016 at 12:33 pm
This will delete all of the records that have more than 1 occurrence including the original.
February 2, 2016 at 12:49 pm
John Donaldson (2/2/2016)
This will delete all of the records that have more than 1 occurrence including the original.
I guess you didn't try the code. Here, let me give you the opportunity to run it by giving you sample data and the code all in one.
CREATE TABLE Duplicates(
CustID INT,
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (1, 100, 'ABC'),
(2, 200, 'ABC'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(4, 200, 'CCC'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(6, 500, 'DDA');
SELECT * FROM Duplicates ORDER BY Col1, col2;
SET NOCOUNT ON
SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
DELETE
FROM Duplicates
WHERE Col1 IN
(SELECT Col1
FROM Duplicates
GROUP BY Col1
HAVING COUNT(*) > 1)
IF @@Rowcount = 0
BREAK ;
END
SET ROWCOUNT 0
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
This technique is bad because it uses a deprecated feature and deletes one row at a time, but it actually does what it promises to do.
February 2, 2016 at 1:20 pm
Luis Cazares (2/1/2016)
The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).With that option, the code results in something easier to read and generate. Here's an example of your first case.
--CASE 1:
CREATE TABLE Duplicates(
RowID INT IDENTITY (1,1),
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (100, 'ABC'),
(200, 'ABC'),
(300, 'BBB'),
(300, 'BBB'),
(200, 'CCC'),
(400, 'EEE'),
(400, 'EEE'),
(400, 'EEE'),
(500, 'DDA');
--Validate the data before the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
/* --Original code (rearranged)
DELETE FROM Duplicates
FROM (SELECT *,
ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1
FROM Duplicates) As T
WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)
*/
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row
order by RowID --Ability to define which row will be left
) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
--Validate the data after the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.
CREATE TABLE Duplicates(
CustID INT,
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (1, 100, 'ABC'),
(2, 200, 'ABC'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(4, 200, 'CCC'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(6, 500, 'DDA');
SELECT * FROM Duplicates ORDER BY Col1, col2;
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.
Is this the best solution? It depends.
With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.
Thank you, Luis, for posting an approach that makes sense, is flexible to include multiple columns of multiple types and is performant.
February 2, 2016 at 1:23 pm
Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.
February 2, 2016 at 1:31 pm
Ed Wagner (2/2/2016)
Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.
Yes, I tested it after finding Paul's post. The question is: would you use it in production code?
February 2, 2016 at 1:47 pm
Luis Cazares (2/2/2016)
Ed Wagner (2/2/2016)
Luis Cazares (2/2/2016)
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.
Yes, I tested it after finding Paul's post. The question is: would you use it in production code?
I don't see a use for production code, but maybe for troubleshooting internals.
February 2, 2016 at 10:50 pm
I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.
In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.
February 3, 2016 at 2:09 am
Not sure if I am missing something here, but why can't the following be used?
delete Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
EDIT: I have subsequently realised (following responses) that CustID is not a PK or other unique constraint so am aware this will not work in this instance.
February 3, 2016 at 5:24 am
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
Because that will delete all the non duplicate customers as well.
February 3, 2016 at 7:24 am
o.kaesmann (2/2/2016)
I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.
I belive that Oracle's ROWID is the fastest mechanism to access a specific row in a table. Do you know if the same is true of MS-SQL?
February 3, 2016 at 7:37 am
Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
Because that will delete all the non duplicate customers as well.
No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.
February 3, 2016 at 7:58 am
Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.
I'll return to my cave.
February 3, 2016 at 9:07 am
SQLian (2/3/2016)
Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.I'll return to my cave.
Your approach is certainly better than the one proposed in the article. It just won't work in certain cases, but it's certainly not bad per se.
February 3, 2016 at 9:50 am
Luis Cazares (2/3/2016)
Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete [FROM] Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
--Brandie added FROM clause
Because that will delete all the non duplicate customers as well.
No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.
Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?
If not, what am I missing?
February 3, 2016 at 9:56 am
Brandie Tarvin (2/3/2016)
Luis Cazares (2/3/2016)
Brandie Tarvin (2/3/2016)
SQLian (2/3/2016)
Not sure if I am missing something here, but why can't the following be used?
delete [FROM] Customers
where CustID NOT in (
select min(CustID)
from Customers
group by CustName
)
--Brandie added FROM clause
Because that will delete all the non duplicate customers as well.
No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.
Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?
If not, what am I missing?
You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.
Viewing 15 posts - 46 through 60 (of 76 total)
You must be logged in to reply to this topic. Login to reply