Remove duplicate record based on criteria

Question

Remove duplicate record based on criteria

micang

SSC Eights!

Points: 878
More actions
January 30, 2018 at 1:58 pm

#328731

Hi All,
EDIT: I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.
I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.

Information:
There will only ever be 2 duplicated records per 'DuplicatingValue'.
There is no unique identifier in the data set.
Requirements\rules:
Referring to the sample data below:
Within a duplicate set - if the values are 'Red' for one record and 'Green' for the other record, then delete the 'Red' record and keep the 'Green' record ('DuplicatingValues' 1 and 4 in the sample below).
For any other combination in the duplicate set, do nothing - therefore in this example for 'DuplicatingValues' 2 and 3, do nothing.
CREATE TABLE [dbo].[CentralSample]( [Value1] [varchar](50) NULL, [DuplicatingValue] [varchar](50) NULL ) ON [PRIMARY] GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'1') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'1') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'4') GO INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'4') GO
The result expected from the above sample data once the duplicates are removed, would be:
select 'Green' as Value1,1 as DuplicatingValue union all select 'Red',2 union all select 'Red',2 union all select 'Green',3 union all select 'Green',3 union all select 'Green',4

Value1 DuplicatingValue
Green 1
Red 2
Red 2
Green 3
Green 3
Green 4
I managed to identify the duplicates, but I have not figured out how to remove the unwanted records.
Thanks in advance for any help in pointing me in the right direction.
Regards,
Michael

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63517 More actions · Answer 1

I think this is close... back up your table or wrap this in a transaction and roll it back when testing.. The INTERSECT returns the DuplicatingValue where there is at least one "Green" record and at least one "Red" record..

DELETE FROM CentralSample WHERE DuplicatingValue IN ( SELECT DuplicatingValue FROM CentralSample WHERE Value1 = 'Green' INTERSECT SELECT DuplicatingValue FROM CentralSample WHERE Value1 = 'Red');;

micang SSC Eights! Points: 878 More actions · Answer 2

pietlinden, thank you very much for the reply, this will lead me in the right direction, appreciate your time.

Regards,
Michael

Luis Cazares SSC Guru Points: 183706 More actions · Answer 3

Here's a different option that might be better, but you should still test for performance.
WITH CTE AS( SELECT *, ROW_NUMBER() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rn, RANK() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rnk FROM dbo.CentralSample AS cs ) DELETE FROM CTE WHERE CTE.rn = 2 AND CTE.rnk = 2 AND CTE.Value1 = N'Red'

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Luis Cazares SSC Guru Points: 183706 More actions · Answer 4

And yet another option:
WITH CTE AS( SELECT *, LAG(cs.Value1) OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1 ) prev_Value1 FROM dbo.CentralSample AS cs ) DELETE FROM CTE WHERE CTE.prev_Value1 = N'Green' AND CTE.Value1 = N'Red'

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

micang SSC Eights! Points: 878 More actions · Answer 5

Luis,
Thanks so much for the 2 options, I can't use the 2nd option because I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.
I really appreciate your time.

Thanks
Michael

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 6

Yes, or this, which is very similar. It relies on Red and Green being the only two possible values of Value1. I think Piet's solution deleted too many rows.

WITH Duplicates AS ( SELECT DuplicatingValue , Value1 , MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue , MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue FROM CentralSample ) DELETE FROM Duplicates WHERE MaxValue > MinValue AND Value1 = 'Red';

micang SSC Eights! Points: 878 More actions · Answer 7

John Mitchell-245523 - Wednesday, January 31, 2018 8:34 AM
Yes, or this, which is very similar. It relies on Red and Green being the only two possible values of Value1. I think Piet's solution deleted too many rows.
WITH Duplicates AS ( SELECT DuplicatingValue , Value1 , MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue , MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue FROM CentralSample ) DELETE FROM Duplicates WHERE MaxValue > MinValue AND Value1 = 'Red';

Thanks John - yes, you are correct, Piet's was deleting all the rows that have a combination of red and green - but it helped me in identifying the records.
Your code gives the correct results - thank you very much for your input too.

Regards,
Michael

jcelko212 32090 SSCrazy Eights Points: 9303 More actions · Answer 8

micang - Tuesday, January 30, 2018 1:58 PM

I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.

>> There will only ever be 2 duplicated records [sic] per 'DuplicatingValue'. <<

Rows are not record [sic] s. There is a huge and significant difference between the two concepts.

>> There is no unique identifier in the data set. <<

without a key, you cannot have a table. This is by definition, so nothing you are doing applies to an SQL problem. You have a deck of punch cards 🙁

I have an article coming out on Coddâ€™s deGreen of duplication, which was his way of solving this sort of problem you might want to read it.

>> Within a duplicate set - if the are 'Red' for one record [sic] and 'Green' for the other record [sic], then delete the 'Red' record [sic] and keep the 'Green' record [sic] ('DuplicatingValues' 1 and 4 in the sample below). For any other combination in the duplicate set, do nothing - therefore in this example for 'DuplicatingValues' 2 and 3, do nothing. <<

Dr. Codd had what is called the information principle in his 12 rules of RDBMS. If I understand what youâ€™re saying is that what you are calling â€œduplicating_valuesâ€ is the identifier for one of these sets.

Another problem with your posting is that you have no idea what the ISO 11179 naming rules are, and that a table must have a key. Then why is everything a variable string? You also donâ€™t know syntax for the insert into statement. Letâ€™s do some corrections on the DDL you did post (by the way, at least you tried to post DDL; so many people here donâ€™t even bother with that courtesy)

CREATE TABLE Central_Samples
(something_color VARCHAR(5) NOT NULL
CHECK (something_color IN (â€˜Redâ€™, â€˜Greenâ€™),
set_nbr INTEGER NOT NULL
CHECK (set_nbr > 0),
PRIMARY KEY (something_color, set_nbr)
);

I want you to think about this. If you actually had a primary key, there would be no need to remove the duplicates. You could have prevented them. Would you buy a house from a carpenter who tells you that he didnâ€™t finish the roof but donâ€™t worry about it. You can fix it later!

INSERT INTO Central_Samples
('Green', 1), ('Red', 1),
('Red', 2), ('Red', 2), --- dups
('Green', 3), ('Green', 3), --- dups
('Green', 4), ('Red', 4);

>> The result expected from the above sample data once the duplicates are removed, would be:

'Green',1
'Red',2
'Red',2
'Green',3
'Green',3
'Green',4
<<

no, thatâ€™s not what the results would be. You still have duplicated rows! Try this:

Green    1
Red    1
Red    2
Green    3
Green    4

This will find the sets that have both red and green. You can use it in the delete statement, if you wish.

SELECT set_nbr
FROM Central_Samples
GROUP BY set_nbr
HAVING MIN(something_color) = â€˜Greenâ€™
AND MAX(something_color)= â€˜Redâ€™;

Please post DDL and follow ANSI/ISO standards when asking for help.