January 30, 2018 at 1:58 pm
Hi All,
EDIT: I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.
I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.
Information:
There will only ever be 2 duplicated records per 'DuplicatingValue'.
There is no unique identifier in the data set.
Requirements\rules:
Referring to the sample data below:
Within a duplicate set - if the values are 'Red' for one record and 'Green' for the other record, then delete the 'Red' record and keep the 'Green' record ('DuplicatingValues' 1 and 4 in the sample below).
For any other combination in the duplicate set, do nothing - therefore in this example for 'DuplicatingValues' 2 and 3, do nothing.
CREATE TABLE [dbo].[CentralSample](
[Value1] [varchar](50) NULL,
[DuplicatingValue] [varchar](50) NULL
) ON [PRIMARY]
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'1')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'1')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'4')
GO
INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'4')
GO
The result expected from the above sample data once the duplicates are removed, would be:
select 'Green' as Value1,1 as DuplicatingValue
union all
select 'Red',2
union all
select 'Red',2
union all
select 'Green',3
union all
select 'Green',3
union all
select 'Green',4
Value1 | DuplicatingValue |
Green | 1 |
Red | 2 |
Red | 2 |
Green | 3 |
Green | 3 |
Green | 4 |
I managed to identify the duplicates, but I have not figured out how to remove the unwanted records.
Thanks in advance for any help in pointing me in the right direction.
Regards,
Michael
January 30, 2018 at 8:31 pm
I think this is close... back up your table or wrap this in a transaction and roll it back when testing.. The INTERSECT returns the DuplicatingValue where there is at least one "Green" record and at least one "Red" record..
DELETE FROM CentralSample
;
WHERE DuplicatingValue IN (
SELECT DuplicatingValue
FROM CentralSample
WHERE Value1 = 'Green'
INTERSECT
SELECT DuplicatingValue
FROM CentralSample
WHERE Value1 = 'Red');
January 31, 2018 at 7:26 am
pietlinden, thank you very much for the reply, this will lead me in the right direction, appreciate your time.
Regards,
Michael
January 31, 2018 at 8:10 am
Here's a different option that might be better, but you should still test for performance.
WITH CTE AS(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rn,
RANK() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rnk
FROM dbo.CentralSample AS cs
)
DELETE
FROM CTE
WHERE CTE.rn = 2
AND CTE.rnk = 2
AND CTE.Value1 = N'Red'
January 31, 2018 at 8:15 am
And yet another option:
WITH CTE AS(
SELECT *,
LAG(cs.Value1) OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1 ) prev_Value1
FROM dbo.CentralSample AS cs
)
DELETE
FROM CTE
WHERE CTE.prev_Value1 = N'Green'
AND CTE.Value1 = N'Red'
January 31, 2018 at 8:28 am
Luis,
Thanks so much for the 2 options, I can't use the 2nd option because I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.
I really appreciate your time.
Thanks
Michael
January 31, 2018 at 8:34 am
Yes, or this, which is very similar. It relies on Red and Green being the only two possible values of Value1. I think Piet's solution deleted too many rows.
WITH Duplicates AS (
SELECT
DuplicatingValue
, Value1
, MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue
, MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue
FROM CentralSample
)
DELETE FROM Duplicates
WHERE MaxValue > MinValue
AND Value1 = 'Red';
January 31, 2018 at 8:49 am
John Mitchell-245523 - Wednesday, January 31, 2018 8:34 AMYes, or this, which is very similar. It relies on Red and Green being the only two possible values of Value1. I think Piet's solution deleted too many rows.
WITH Duplicates AS (
SELECT
DuplicatingValue
, Value1
, MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue
, MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue
FROM CentralSample
)
DELETE FROM Duplicates
WHERE MaxValue > MinValue
AND Value1 = 'Red';
Thanks John - yes, you are correct, Piet's was deleting all the rows that have a combination of red and green - but it helped me in identifying the records.
Your code gives the correct results - thank you very much for your input too.
Regards,
Michael
January 31, 2018 at 3:19 pm
I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.
>> There will only ever be 2 duplicated records [sic] per 'DuplicatingValue'. <<
Rows are not record [sic] s. There is a huge and significant difference between the two concepts.
>> There is no unique identifier in the data set. <<
without a key, you cannot have a table. This is by definition, so nothing you are doing applies to an SQL problem. You have a deck of punch cards 🙁
I have an article coming out on Codd’s deGreen of duplication, which was his way of solving this sort of problem you might want to read it.
>> Within a duplicate set - if the are 'Red' for one record [sic] and 'Green' for the other record [sic], then delete the 'Red' record [sic] and keep the 'Green' record [sic] ('DuplicatingValues' 1 and 4 in the sample below). For any other combination in the duplicate set, do nothing - therefore in this example for 'DuplicatingValues' 2 and 3, do nothing. <<
Dr. Codd had what is called the information principle in his 12 rules of RDBMS. If I understand what you’re saying is that what you are calling “duplicating_values†is the identifier for one of these sets.
Another problem with your posting is that you have no idea what the ISO 11179 naming rules are, and that a table must have a key. Then why is everything a variable string? You also don’t know syntax for the insert into statement. Let’s do some corrections on the DDL you did post (by the way, at least you tried to post DDL; so many people here don’t even bother with that courtesy)
CREATE TABLE Central_Samples
(something_color VARCHAR(5) NOT NULL
CHECK (something_color IN (‘Red’, ‘Green’),
set_nbr INTEGER NOT NULL
CHECK (set_nbr > 0),
PRIMARY KEY (something_color, set_nbr)
);
I want you to think about this. If you actually had a primary key, there would be no need to remove the duplicates. You could have prevented them. Would you buy a house from a carpenter who tells you that he didn’t finish the roof but don’t worry about it. You can fix it later!
INSERT INTO Central_Samples
('Green', 1), ('Red', 1),
('Red', 2), ('Red', 2), --- dups
('Green', 3), ('Green', 3), --- dups
('Green', 4), ('Red', 4);
>> The result expected from the above sample data once the duplicates are removed, would be:
'Green',1
'Red',2
'Red',2
'Green',3
'Green',3
'Green',4
<<
no, that’s not what the results would be. You still have duplicated rows! Try this:
Green 1
Red 1
Red 2
Green 3
Green 4
This will find the sets that have both red and green. You can use it in the delete statement, if you wish.
SELECT set_nbr
FROM Central_Samples
GROUP BY set_nbr
HAVING MIN(something_color) = ‘Green’
AND MAX(something_color)= ‘Red’;
Please post DDL and follow ANSI/ISO standards when asking for help.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply