November 6, 2014 at 7:44 am
I'm looking at various methods for deleting duplicate rows. Among the alternatives, one works just fine but gives me results that make me go, "huh?".
Consider this script:
declare @t table (a int, b int, c int, d int, e int)
insert into @t (a, b, c, d, e) values
(1, 2, 3, 4, 5),
(3, 4, 2, 3, 4),
(1, 2, 3, 4, 5)
select a,b,c,d,e, rn = row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e)
from @t;
with cte as (
select a,b,c,d,e, rn = row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e)
from @t
)
delete
from cte
output deleted.*
where rn > 1;
The code works -- that is, the duplicate row is deleted. However the output clause returns:
abcdern
123451
So....why? Why does the output clause show that the row with rn=1 was deleted, when the where clause stipulates rn > 1?
Gerald Britton, Pluralsight courses
November 6, 2014 at 8:20 am
I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number? You could not use this information to relate back to the original data and match the duplicate rows one to one.
If you add another dupe row, you'll see that the deleted row numbers are 1 and 2. So it seems that the SQL engine removes the correct rows and then somehow loses track of the row which remains. Looks a bit strange, but not really a problem.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
November 6, 2014 at 8:31 am
Phil Parkin (11/6/2014)
I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number?
The row number is generated by the ROW_NUMBER() function and the sort order is specified. AFAIK there is no "real" row number in a clustered index either. The closest you get is the CI key which maps to RID which is just a file/page/rowid.
Even if I change the table definition to:
declare @t table (id int identity(1,1) primary key, a int, b int, c int, d int, e int)
to make it a CI, the output clause still returns incorrect results.
Gerald Britton, Pluralsight courses
November 6, 2014 at 8:46 am
g.britton (11/6/2014)
Phil Parkin (11/6/2014)
I presume this is just an 'out of interest' type question, because when you're deleting exact duplicates from a heap, there is no real row number?The row number is generated by the ROW_NUMBER() function and the sort order is specified. AFAIK there is no "real" row number in a clustered index either. The closest you get is the CI key which maps to RID which is just a file/page/rowid.
Even if I change the table definition to:
declare @t table (id int identity(1,1) primary key, a int, b int, c int, d int, e int)
to make it a CI, the output clause still returns incorrect results.
Well, using your revised table definition, the delete should be done like this:
delete t
output deleted.*
from @t t
join cte on t.Id = cte.id
where rn > 1;
And everything works.
The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin
November 6, 2014 at 9:51 am
My guess is it's just some quirk with the output clause and ROW_NUMBER(), because the expected row gets kept even when the OUTPUT seems to indicate otherwise:
declare @t table (a int, b int, c int, d int, e int, f int)
insert into @t (a, b, c, d, e, f) values
(1, 2, 3, 4, 5, 99),
(3, 4, 2, 3, 4, 98),
(1, 2, 3, 4, 5, 96),
(1, 2, 3, 4, 5, 97)
select a,b,c,d,e, rn = row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e)
from @t;
with cte as (
select a,b,c,d,e, rn = row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e,f)
from @t
)
delete
from cte
output deleted.*
where rn > 1;
select * from @t
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
November 6, 2014 at 9:53 am
well, isn't that interesting!
Don't know about the "should be done" bit though. My version doesn't need a self-join. It's just weird that SQL loses track of what it just deleted.
Gerald Britton, Pluralsight courses
November 6, 2014 at 10:02 am
g.britton (11/6/2014)
well, isn't that interesting!Don't know about the "should be done" bit though. My version doesn't need a self-join. It's just weird that SQL loses track of what it just deleted.
You're missing something.
The issue has to do with using the row_number() against the DELETED "magic table". If you were to add in the ID number like you mentioned and control which one to delete you will see it's wiping out the expected row. That said the ROW_NUMBER value isn't a physical value in the table, so DELETED doesn't have it to return. As a result the query attempt to provide based on what it knows from the CTE, but only find one row (in the delete), ergo RN is 1.
Confirm the results this way:
declare @t table (a int, b int, c int, d int, e int, id int identity(1,1))--<-- adding in the identity
insert into @t (a, b, c, d, e) values
(1, 2, 3, 4, 5),
(3, 4, 2, 3, 4),
(1, 2, 3, 4, 5)
select a,b,c,d,e,ID, row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e, ID) as RN --<-- making SURE that the "second" duplicate gets deleted
from @t;
with cte as (
select a,b,c,d,e,ID, row_number() over (
partition by a,b,c,d,e
order by a,b,c,d,e, ID) as RN
from @t
)
delete
from cte
output deleted.*
where rn > 1;
You will see that ID = 3 is the one that is gone. The ID is reliable since it's a physical column in the table which is what exists in the DELETED pseudo table
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 6, 2014 at 10:17 am
Quick thought, there are actually two row_number instances in the plan, one used for the delete and another for the output. As only one row is outputted, the row number is of course 1.
😎
[Table Scan]-->[Sort]--[Segment]-->[Sequence Project]-->[Filter]-->[Table Delete]-->[Sort]-->[Segment]-->[Sequence Project]-->[DELETE]
|<---------- 1st row number -------->| delete |<--------- 2nd row number ---------->| output
November 6, 2014 at 10:53 am
Good observation, though the output becomes misleading. I never doubted that the "right" row was deleted. I just found the results from the output clause to be confusing. I've lowered that to "misleading" 🙂
Gerald Britton, Pluralsight courses
November 6, 2014 at 11:19 am
I actually encountered this exact behaviour last weekend. Got me a bit worried!
My explanation would be that the CTE is never materialised as such, so the rn value is never stored, and the ROW_NUMBER() clause is re-evaluated in the deleted table.
November 6, 2014 at 11:29 am
Gazareth (11/6/2014)
I actually encountered this exact behaviour last weekend. Got me a bit worried!My explanation would be that the CTE is never materialised as such, so the rn value is never stored, and the ROW_NUMBER() clause is re-evaluated in the deleted table.
That's in a way correct, there are two row number (segment-sequence project) operators in the execution plan, one for the delete and one for the output.
😎
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply