February 1, 2016 at 1:45 pm
Thanks!
When I CAST the text field as a VARCHAR(MAX) I was able to locate the duplicate text field values.
WITH cte AS (
SELECT a.*,
ROW_NUMBER() OVER(PARTITION BY [varcharfield],CAST([textfield] as VARCHAR(MAX)) ORDER BY CAST([textfield] as VARCHAR(MAX))) rrn
FROM dbo.[table_with_dups] a
)
SELECT * FROM cte WHERE rrn > 1
February 1, 2016 at 2:07 pm
The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)
February 1, 2016 at 2:21 pm
venkataprasanth (2/1/2016)
The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)
What do you mean with having a unique identifier? The ROW_NUMBER option allows you to define what constitutes a duplicate (either one column, some columns or all the columns). The method shown in the article is bad in terms that it needs the table to have a single column that defines duplication and the worst part is that it deletes one row at a time using a deprecated option.
I'm sorry that Ginger Keys got such a harsh reaction, but the alternatives are certainly much better.
February 1, 2016 at 2:36 pm
Luis Cazares (2/1/2016)
venkataprasanth (2/1/2016)
..but the alternatives are certainly much better.
I unnecessarily second the motion. :Whistling:
February 1, 2016 at 2:49 pm
This is what I mean by the unique identifier using a ROW_NUMBER() methodology
CASE 1:
CREATE TABLE Duplicates(RowID INT IDENTITY (1,1), Col1 INT, Col2 VARCHAR(100))
INSERT INTO Duplicates VALUES (100, 'ABC'),(200, 'ABC'),(300, 'BBB'),(300, 'BBB'),(200, 'CCC'),(400, 'EEE'),(400, 'EEE'),(400, 'EEE'),(500, 'DDA')
DELETE FROM Duplicates FROM (SELECT *,ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1 FROM Duplicates) As T
WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)
The RowID being the anchor column in deciding which rows can be deleted from the table that are formed by the duplicates of rest of the columns.
CASE2: NOTE the RowID is replaced by CustID column which is not a unique identifier.
CREATE TABLE Duplicates(CustID INT, Col1 INT, Col2 VARCHAR(100))
INSERT INTO Duplicates VALUES (1, 100, 'ABC'),(2, 200, 'ABC'),(3, 300, 'BBB'),(3, 300, 'BBB'),(3, 300, 'BBB'),(4, 200, 'CCC'),(5, 400, 'EEE'),(5, 400, 'EEE'),(5, 400, 'EEE'),(6, 500, 'DDA')
Can you achieve the delete functionality using the ROW_NUMBER() from the above example. I am guessing you cannot. You can post if you have an answer.
So my point is CASE1 is not a table with duplicate rows as there is a unique identifier (even though col1 and col2 have duplicate values). But CASE2 is.
You can delete the duplicate valued rows of col1 and col2 from CASE1 using ROW_NUMBER(). But cannot do this for CASE2.
February 1, 2016 at 3:19 pm
The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).
With that option, the code results in something easier to read and generate. Here's an example of your first case.
--CASE 1:
CREATE TABLE Duplicates(
RowID INT IDENTITY (1,1),
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (100, 'ABC'),
(200, 'ABC'),
(300, 'BBB'),
(300, 'BBB'),
(200, 'CCC'),
(400, 'EEE'),
(400, 'EEE'),
(400, 'EEE'),
(500, 'DDA');
--Validate the data before the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
/* --Original code (rearranged)
DELETE FROM Duplicates
FROM (SELECT *,
ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1
FROM Duplicates) As T
WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)
*/
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row
order by RowID --Ability to define which row will be left
) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
--Validate the data after the delete
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.
CREATE TABLE Duplicates(
CustID INT,
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (1, 100, 'ABC'),
(2, 200, 'ABC'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(4, 200, 'CCC'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(6, 500, 'DDA');
SELECT * FROM Duplicates ORDER BY Col1, col2;
WITH CTE AS(
SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum
FROM Duplicates
)
DELETE FROM CTE
WHERE RowNum > 1;
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.
Is this the best solution? It depends.
With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.
February 1, 2016 at 3:41 pm
I am not aware that you can do DML operations on CTEs. I learned this today. This works well too and certainly better than looping. Thank you for posting your answer.
February 2, 2016 at 1:46 am
My faforite method to clean-up duplicated rows is this (I used the example code from above):
CREATE TABLE Duplicates(
CustID INT,
Col1 INT,
Col2 VARCHAR(100));
INSERT INTO Duplicates
VALUES (1, 100, 'ABC'),
(2, 200, 'ABC'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(3, 300, 'BBB'),
(4, 200, 'CCC'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(5, 400, 'EEE'),
(6, 500, 'DDA');
SELECT * FROM Duplicates ORDER BY Col1, col2;
delete from Duplicates where %%physloc%% not in (select max (%%physloc%%) from Duplicates group by CustID, Col1, Col2)
SELECT * FROM Duplicates ORDER BY Col1, col2;
GO
--Clean my DB
DROP TABLE Duplicates;
February 2, 2016 at 1:55 am
The article on deleting duplicates worked well for me. But, now every time I try to create the PK a new duplicate record is created. I'm not sure what's going on maybe it's a technical issue in SQL 2005?
February 2, 2016 at 7:57 am
I also thought it would delete all the duplicated records. However, when I tested it the code did work. The key is setting ROWCOUNT = 1 which means that all subsequent queries affect only 1 row.
When I looked up setting ROWCOUNT there was a note stating "Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax. For more information, see TOP (Transact-SQL)."
An advantage of using the "SELECT ROW_NUMBER() Over" option is that it would be set-based vs. working on only 1 row at a time and is less coding.
February 2, 2016 at 8:07 am
o.kaesmann (2/2/2016)
My faforite method to clean-up duplicated rows is this (I used the example code from above):
That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.
February 2, 2016 at 8:58 am
But this method eliminates all duplicate records
that is wrong
February 2, 2016 at 9:27 am
marquito_61 (2/2/2016)
But this method eliminates all duplicate recordsthat is wrong
As far as I know, none of the code posted in the article of the forum eliminates all rows with duplicates, they all remove duplicates and leave the desired rows.
February 2, 2016 at 10:22 am
It seems that the author is using SET ROWCOUNT 1 to limit the DELETE to just affecting one record at a time. Checking the MSDN documentation for SET ROWCOUNT it contains this warning:
Important
Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax.
In light of this fact, you may want to come up with a better method for removing duplicates.
February 2, 2016 at 11:03 am
andy_111 (1/31/2016)
Not really useful article.
Not really a useful comment, either. Please explain why you think it's not useful.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 31 through 45 (of 76 total)
You must be logged in to reply to this topic. Login to reply