Which one is the efficient Way from the below two

Question

Which one is the efficient Way from the below two

DBTeam

SSC Eights!

Points: 956
More actions
November 24, 2009 at 3:03 am

#218036

Hi All,
Which one is the efficient Way from the below two for 50M database
having 30+ fields
with EmailRecords as (
select
row_number() over (partition by Email order by rowid desc) as RowNumber ,
[Company],[webAddress] ,[Prefix] ,[Contactname] ,[FirstName] ,[MiddleName] ,
[LastName] ,[Title] ,[Address] ,[Address1] ,[Address2] ,[Address3] ,[City] ,
[State] ,[Pincode] ,[STDcode] ,[Phone] ,[Phone1] ,[Phone2] ,[Phone3] ,
[FaxNumber] ,[Mobile] ,[Email] ,[Industry] ,[Product Code] ,[Revenue] ,
[Experience] ,[Dateofbirth] ,[dob] ,[age] ,[martialstatus] ,[Keyskills] ,
[education] ,[category] ,[Dealer]
from dbo.table2
)
insert into dbo.[NewTable]
select [Company],[webAddress] ,[Prefix] ,[Contactname] ,[FirstName] ,[MiddleName] ,
[LastName] ,[Title] ,[Address] ,[Address1] ,[Address2] ,[Address3] ,[City] ,
[State] ,[Pincode] ,[STDcode] ,[Phone] ,[Phone1] ,[Phone2] ,[Phone3] ,
[FaxNumber] ,[Mobile] ,[Email] ,[Industry] ,[Product Code] ,[Revenue] ,
[Experience] ,[Dateofbirth] ,[dob] ,[age] ,[martialstatus] ,[Keyskills] ,
[education] ,[category] ,[Dealer]
from EmailRecords
where RowNumber = 1;
---------------------------------------------------------
DELETE FROM table2
WHERE rowid IN
(
SELECT a.rowid
FROM table2 a,table2 b
WHERE a.rowid!= b.rowid
and a.rowid< b.rowid
and a.[Email]= b.[Email]
)

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

spaghettidba SSC Guru Points: 105732 More actions · Answer 1

The two queries do different things, but I guess you are trying to detect duplicate records.

Putting "distinct" records into a new table involves copying a lot of data, so I don't think it will be very efficient. Detecting duplicates with a ROW_NUMBER() function requires a scan, a segment and a sort. What are you doing then with the non-duplicates in the new table? At some point you will have to put them back into the original table.

I think deleting duplicates is the way to go, but I would not re-invent the wheel:

Choose your favourite technique

Edited: a piece of the post was deleted. Strange: it never happened before.

-- Gianluca Sartori

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 2

It also depends on how many duplicate records vs uniqie records you have in the table. How many unique rows (based on email) are there in the table?

DBTeam SSC Eights! Points: 956 More actions · Answer 3

DBTeam

SSC Eights!

Points: 956

November 24, 2009 at 7:14 am

#1083435

5 Million records are duplicates

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 4

Lynn Pettis

SSC Guru

Points: 442470

November 24, 2009 at 7:22 am

#1083439

Just to be sure, is rowid unique?

DBTeam SSC Eights! Points: 956 More actions · Answer 5

DBTeam

SSC Eights!

Points: 956

November 24, 2009 at 9:56 pm

#1083804

Ya row ID is unique

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 6

Okay, here is one way to delete five million rows of data from a table with thirty seven million rows of data.

declare @Batch int;

set @Batch = 10000; -- Batch size to delete. Set this to what ever size you feel works best.

while @Batch > 0

begin

with EmailRecords as (

select

row_number() over (partition by Email order by rowid desc) as RowNumber ,

RowID

from

dbo.table2

)

delete top (@Batch)

from

dbo.table2

from

dbo.table2 t2

inner join EmailRecords er

on (t2.RowID = er.RowID)

where

er.RowNumber > 1;

set @Batch = @@ROWCOUNT; -- Capture how many rows were deleted

-- backup log [yourDBName] ...

-- Here would be a good place to backup your transaction log

-- should your database be using the FULL or BULK_LOGGED recovery model.

-- This will keep the log from growing excessively andd keep your log chain

-- intact.

end

Here is an article I wrote that discusses this process:

http://www.sqlservercentral.com/articles/T-SQL/67898/

In addition to the article, you should also read the discussion thread as well.

Edit: Fix broken link. (Thank you Jeff.)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

Lynn... the click link you posted is broken...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 8

Lynn Pettis

SSC Guru

Points: 442470

November 26, 2009 at 1:21 am

#1084475

Thanks Jeff. I fixed the link. 😉