eliminating duplicates

Question

eliminating duplicates

Viewing 15 posts - 76 through 90 (of 137 total)

You must be logged in to reply to this topic. Login to reply

tosscrosby SSCoach Points: 18045 More actions · Answer 1

As far as a framework, Jeff Moden posts this link in his signature (and I believe others do as well).

It makes it very clear what should be posted to receive quicker, more accurate responses. As far as SQL2000 vs. SQL2005, sometimes it's critical as there are changes/additions to functionality depending on version, correct? Sometimes posts are posted to the wrong forum and if people are thinking it's 2005, that's what most - if not all, responses will be geared toward. I don't like wasting my time trying to do something one way and find out, due to version, it's not going to work. And yes, sometimes, shouting at posters is the only way to be HEARD. The important thing is that people are giving their time, and talent, freely here to try and help someone out. Sometimes the OP also has to help themselves....IMHO :rolleyes:

http://www.sqlservercentral.com/articles/Best+Practices/61537/

-- You can't be late until you show up.

d_sysuk SSC Eights! Points: 809 More actions · Answer 2

To the lovely Shawn: d_sysuk: dude you just don't get it do you?

--> Stop posting, we're fed up:..had enough of this yesterday.

Advice: STOP READING POST !

I like many others do not live on USA timezones, picked this question up my time today.

Had a look at this only for a couple of hours, but this highlights an interesting issue - almost valid of a MS connect feedback scenario - "To allow row_number deletes to apply - not to delete ALL duplicate entrys" ..., where it would be achieve a single SQL 2005 statement version :

delete from dbo.my_tab

from (select row_number() over (order by z.acctproc_id, z.[name], z.acct_id) row,

acctproc_id, name, acct_id

from my_tab z) SOURCE

JOIN

( select a.acctproc_id, a.[name], a.acct_id,

-- a.row,

c.count_row,

max(a.row) row_to_delete

from ( select z.acctproc_id, z.[name], z.acct_id,

row_number() over (order by z.acctproc_id, z.[name], z.acct_id) row

from dbo.my_tab z) a

join

(select b.acctproc_id, b.name, b.acct_id,count(*) count_row

from dbo.my_tab b

group by b.acctproc_id, b.name, b.acct_id

) c

on c.acctproc_id = a.acctproc_id

and c.name = a.name

and (c.acct_id = a.acct_id or c.acct_id is null and a.acct_id is null)

group by a.acctproc_id, a.name, a.acct_id, c.count_row) FILTEREDSET ON

--SOURCE.acctproc_id = FILTEREDSET.acctproc_id and

--SOURCE.[name] = FILTEREDSET.[name]

--and (SOURCE.acct_id = FILTEREDSET.acct_id

-- or Source.acct_id is null and filteredset.acct_id is null)

--and

SOURCE.row= FILTEREDSET.row_to_delete

to get the results we expect.

Where it is using the actual row_numer to perform the inline delete.

Shaun McGuile SSCarpal Tunnel Points: 4141 More actions · Answer 3

d_sysuk: I am not in the USA, I'm ahead of them which could mean 16 hours behind 😀 as the earth spins. that aside your SQL2005 solution looks quite minimal in size.

implying 2005 > 2000 and not just numerically 😉

but original question was SQL2000 :Whistling:

😀 Shaun not Shawn 😀

Hiding under a desk from SSIS Implemenation Work :crazy:

Ron McCullough SSC Guru Points: 63877 More actions · Answer 4

per rbarryyoung:

Except that that is NOT what was wanted. What was wanted was:
Eliminate One row from each distinct Group.
With the Following restrictions:
No intermediate tables!
and
No additional Identity columns
SQL Server 2000 (not 2005)

This will delete all non duplicated entries and the first row of a duplicated entry - leaving all other duplicated rows in the table.

;with numbered as(SELECT rowno=row_number() over

(partition by [your column] order by [your column name] ), [your column name] from [your table] )

delete from numbered where rowno =1

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Jeff Moden SSC Guru Points: 1004311 More actions · Answer 5

bitbucket (7/1/2008)
per rbarryyoung:
Except that that is NOT what was wanted. What was wanted was:
Eliminate One row from each distinct Group.
With the Following restrictions:
No intermediate tables!
and
No additional Identity columns
SQL Server 2000 (not 2005)
This will delete all non duplicated entries and the first row of a duplicated entry - leaving all other duplicated rows in the table.
;with numbered as(SELECT rowno=row_number() over
(partition by [your column] order by [your column name] ), [your column name] from [your table] )
delete from numbered where rowno =1

Heh... RTFS! That won't work in SQL Server 2000 as requested!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004311 More actions · Answer 6

d_sysuk (7/1/2008)
Well going to the first post, its clear the the post is most unclear in its requirements. Shouting at posters wont help under any circumstances or imposing random conditions , e.g. its now SQL 2000, not SQL 2005 etc...

The whole problem is that you won't stop... you and a lot of other folks keep writing code that doesn't even come close to the specs and it's going to be confusing for other folks that may read this thread.

We've tried to tell everyone to stop because they're not reading the problem but they keep on posting any way... yourself included. So, I don't blame anyone for yelling because you... won't... listen... :hehe:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

Chris Morris (7/1/2008)
Oh all right then. No funny stuff either.
DECLARE @ItemID INT
DECLARE @FirstCol varChar(5), @SecondCol INT, @ThirdCol Char(1)
SET @ItemID = 1
UPDATE #Testing SET ThirdCol = CASE @ItemID WHEN 1 THEN 'D' ELSE ThirdCol END,
@ItemID = CASE WHEN @FirstCol = FirstCol AND @SecondCol = SecondCol AND @ThirdCol = ThirdCol THEN @ItemID+1 ELSE 1 END,
@FirstCol = FirstCol, @SecondCol = SecondCol, @ThirdCol = ThirdCol
DELETE FROM #Testing WHERE ThirdCol = 'D'

Very nice, chris. Though don't forget that you need a covering Clustered Index to make this work.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Ron McCullough SSC Guru Points: 63877 More actions · Answer 8

Jeff as usual you are CORRECT ... I must have gotten carried away reading all these posts ... tell me one thing what the heck is RTFS and if it can not be said in mixed company send a PM....

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Shaun McGuile SSCarpal Tunnel Points: 4141 More actions · Answer 9

RTFS = Read The (some not nice word beginning with F, usually emphasized) Spec - if I'm not mistaken 😀

I could be wrong

Hiding under a desk from SSIS Implemenation Work :crazy:

d_sysuk SSC Eights! Points: 809 More actions · Answer 10

Yes the solution I posted, like others would work on SQL 2005, and not SQL 2000.

Sometimes, we'll look at things not in the exact way a poster wants, but then downgrade a script.

Something that has been missed from this nice analysis:

I was looking at this problem in more detail, and what some seemed to miss was positional updates is just not possible in SQL Server.

Something that you CAN do in ORACLE !

I.e. every row has a unique identifier, which means if we wanted a table full of duplicates, we could delete one copy of a duplicate within a delete statement.

The problem being a delete x,y,z from will delete every instance of that record which matchs, and not the first one from a set with no way of positioning the update..(which would probly make the partition idea also fail for this case) but alas this fine point has been missed, and the opportunity to improve something in the future.

In terms of delete the records from the table for a SQL 2000 solution, what I would have suggested would have been turn the problem on its head for this one..

Just SELECT out the data required at a view layer... would have been something like this (creating a view, as rqd):

-- this requires some hard-coding with count=N

-- Get records which have 0 duplicates

--select acctproc_id, name, acct_id ,'U' duplicate

-- from dbo.my_tab

-- group by acctproc_id, name, acct_id

-- having count(*) = 1

--union

-- Get records which have 1 DUPLICATE !

-- output 1 Row, remove 1 dup row

(select acctproc_id, name, acct_id ,'D' duplicate

from dbo.my_tab

group by acctproc_id, name, acct_id

having count(*) =2 )

union

-- GRab this set twice, as we have 2 dups..

(select acctproc_id, name, acct_id ,'D' duplicate

from dbo.my_tab

group by acctproc_id, name, acct_id

having count(*) = 3)

union

(select acctproc_id, name, acct_id ,'D' duplicate

from dbo.my_tab

group by acctproc_id, name, acct_id

having count(*) =3)

-- ETC Rqd..

... Yes, requries repeating syntax for max number of dup items expected.

The view then act as a temporary interface layer for the application, and tidy the data up in the background as required.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

rbarryyoung (7/1/2008)
Though don't forget that you need a covering Clustered Index to make this work.

Thanks for pointing that out Barry. This solution will only work if the data is ordered such that identical rows are sequential^*, or there's a clustered covering index.

* Probably - there are threads arguing that this may not always be the case.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

karthik M One Orange Chip Points: 29139 More actions · Answer 12

Jeff,

Heh... this all reminds me of another of my favorite 4 letter acronyms... RTFS!!!!!

what do you mean by RTFS ?

karthik

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 13

karthikeyan (7/2/2008)
Jeff,
Heh... this all reminds me of another of my favorite 4 letter acronyms... RTFS!!!!!
what do you mean by RTFS ?

Read the friggin' spec 😛

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Shaun McGuile SSCarpal Tunnel Points: 4141 More actions · Answer 14

Read

The

Final

Spec.

or

Read

The

Full

Spec.

are the polite suggestions I can immediately think of 😀

Hiding under a desk from SSIS Implemenation Work :crazy:

Saritha-450752 Say Hey Kid Points: 690 More actions · Answer 15

Try this

set rowcount 1

select 1

while @@rowcount > 0

delete Emp1 where 1 < (select count(*) from Emp1 a2

where Emp1.Eno = a2.Eno

and Emp1.Ename = a2.Ename

)

set rowcount 0