November 24, 2009 at 3:03 am
Hi All,
Which one is the efficient Way from the below two for 50M database
having 30+ fields
with EmailRecords as (
select
row_number() over (partition by Email order by rowid desc) as RowNumber ,
[Company],[webAddress] ,[Prefix] ,[Contactname] ,[FirstName] ,[MiddleName] ,
[LastName] ,[Title] ,[Address] ,[Address1] ,[Address2] ,[Address3] ,[City] ,
[State] ,[Pincode] ,[STDcode] ,[Phone] ,[Phone1] ,[Phone2] ,[Phone3] ,
[FaxNumber] ,[Mobile] ,[Email] ,[Industry] ,[Product Code] ,[Revenue] ,
[Experience] ,[Dateofbirth] ,[dob] ,[age] ,[martialstatus] ,[Keyskills] ,
[education] ,[category] ,[Dealer]
from dbo.table2
)
insert into dbo.[NewTable]
select [Company],[webAddress] ,[Prefix] ,[Contactname] ,[FirstName] ,[MiddleName] ,
[LastName] ,[Title] ,[Address] ,[Address1] ,[Address2] ,[Address3] ,[City] ,
[State] ,[Pincode] ,[STDcode] ,[Phone] ,[Phone1] ,[Phone2] ,[Phone3] ,
[FaxNumber] ,[Mobile] ,[Email] ,[Industry] ,[Product Code] ,[Revenue] ,
[Experience] ,[Dateofbirth] ,[dob] ,[age] ,[martialstatus] ,[Keyskills] ,
[education] ,[category] ,[Dealer]
from EmailRecords
where RowNumber = 1;
---------------------------------------------------------
DELETE FROM table2
WHERE rowid IN
(
SELECT a.rowid
FROM table2 a,table2 b
WHERE a.rowid!= b.rowid
and a.rowid< b.rowid
and a.[Email]= b.[Email]
)
November 24, 2009 at 4:41 am
The two queries do different things, but I guess you are trying to detect duplicate records.
Putting "distinct" records into a new table involves copying a lot of data, so I don't think it will be very efficient. Detecting duplicates with a ROW_NUMBER() function requires a scan, a segment and a sort. What are you doing then with the non-duplicates in the new table? At some point you will have to put them back into the original table.
I think deleting duplicates is the way to go, but I would not re-invent the wheel:
Choose your favourite technique
Edited: a piece of the post was deleted. Strange: it never happened before.
-- Gianluca Sartori
November 24, 2009 at 6:42 am
It also depends on how many duplicate records vs uniqie records you have in the table. How many unique rows (based on email) are there in the table?
November 24, 2009 at 7:14 am
5 Million records are duplicates
November 24, 2009 at 7:22 am
Just to be sure, is rowid unique?
November 24, 2009 at 9:56 pm
Ya row ID is unique
November 25, 2009 at 12:04 am
Okay, here is one way to delete five million rows of data from a table with thirty seven million rows of data.
declare @Batch int;
set @Batch = 10000; -- Batch size to delete. Set this to what ever size you feel works best.
while @Batch > 0
begin
with EmailRecords as (
select
row_number() over (partition by Email order by rowid desc) as RowNumber ,
RowID
from
dbo.table2
)
delete top (@Batch)
from
dbo.table2
from
dbo.table2 t2
inner join EmailRecords er
on (t2.RowID = er.RowID)
where
er.RowNumber > 1;
set @Batch = @@ROWCOUNT; -- Capture how many rows were deleted
-- backup log [yourDBName] ...
-- Here would be a good place to backup your transaction log
-- should your database be using the FULL or BULK_LOGGED recovery model.
-- This will keep the log from growing excessively andd keep your log chain
-- intact.
end
Here is an article I wrote that discusses this process:
http://www.sqlservercentral.com/articles/T-SQL/67898/
In addition to the article, you should also read the discussion thread as well.
Edit: Fix broken link. (Thank you Jeff.)
November 25, 2009 at 8:35 pm
Lynn... the click link you posted is broken...
--Jeff Moden
Change is inevitable... Change for the better is not.
November 26, 2009 at 1:21 am
Thanks Jeff. I fixed the link. 😉
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply