November 20, 2009 at 2:50 am
Hi All,
We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails
But other fields are different.
Below is the way we have in our mind, Pls suggest me better than this.
Insert into New
select * from [37million]
where Rid in(select max(Rid) from [37million] group by email)
Pls suggest better than this..............
Thnx in advance
November 20, 2009 at 2:59 am
We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails
But other fields are different.
Insert into New
select * from [37million]
where Rid in(select max(Rid) from [37million] group by email)
You'll need to do a select for distinct and count where the same address > 1...
select distinct from [37million]
group by
having count(distinct) > 1
note - this will identify all the emails appearing more than once, you will need an extra line of filtering when deleting it or inserting into, or else you'll move all the qualifying rows, not only the duplicates.
Use your RID to identify the rows to move...
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle
November 20, 2009 at 3:19 am
Deleting duplicates in that will take long time i guess,
so i want create a fresh table with all the records including non duplicates. once it got inserted , after that i'll delete that table.
Please suggest me the best way to do that
November 20, 2009 at 6:45 am
Insert into New
select * from [37million]
where Rid in(select max(Rid) from [37million] group by email)
or
DELETE FROM [Table]
WHERE [RID] IN
(
SELECT a.[RID]
FROM [Table] a,
[Table] b
WHERE a.[RID]!= b.[RID]
and a.[RID]< b.[RID]
AND a.[Email]= b.[Email]
)
Please tell me which one is better among the above two.......
November 20, 2009 at 6:59 am
Can you provide a partial DDL for the table, primary key(s) and e-mail columns, and some sample data for the table. Also, how do you want to identify which record of the duplicate e-mail address do you want to keep?
November 20, 2009 at 7:07 am
Sample Data is
Id FirstName LastName Address1 Address2 City State Zip .......
20+ columns, Emailaddress
all having datatype nvarchar2(255) except ID Column
Any emailaddress is Ok No priority like that.......
November 20, 2009 at 7:30 am
with EmailRecords as (
select
row_number() over (partition by Emailaddress order by Id desc) as RowNumber -- Assuming Highest Id is most recent
Id,
FirstName,
LastName[,...]
from
dbo.[37MillionRecTable]
)
insert into dbo.[NewTable]
select
Id,
FirstName[,...] -- Note, do not include RowNumber
from
EmailRecords
where
RowNumber = 1;
/*
After running this, you can drop the original table then rename [NewTable] to the original table name.
*/
The above code is untested and truly incomplete as you didn't really provide us with anything to allow testing.
November 20, 2009 at 1:45 pm
How many duplicates do you have?
This might do it for you. set @rcnt toi the number of duplicates.
I know this is a cursor. But I do come from the oracle world
You will need to change it for your tables but I just used this to delete over 600 duplicates in 600,000 records. You can use @@recordcount and not use the @rcnt.
declare @rcntint
select @rcnt = 700
While @rcnt > 0
BEGIN
delete top (1)
from tmp_industry_stock
where customer in (Select Customer
from tmp_industry_stock
group BY Country,Sector,industry,Customer
having count(*) > 1);
select @rcnt = @rcnt - 1;
END
November 23, 2009 at 12:49 pm
LOL had same issue (~100M) .... BUT do you have a proper way to say WHICH of those records are the VALID ones? It might not be the MAX(ID) take that into consideration...
Solution I used was to delete in batches the duplicates (not sure what servers other pple have but mine would just crawl if I'd try to move such big amounts in 1 transaction on live and the log file would just blow). For the starters try to remove the duplicates from every X MILS records (note that duplicate emails won't be too apart ID's speaking pple have the tendency to double register in short period of times not too far apart anyway, this is assuming no index on email column). When your done with this type of removing duplicates either go bigger XX or try the whole table. Watch your performance indicators before deciding to go bigger 🙂
Vasc
November 24, 2009 at 7:35 am
bwilliams:
1) welcome to sql server and the sqlservercentral forums
2) PLEASE PLEASE PLEASE do yourself a favor and learn set-based operations. that code you posted would be an unbelievable DOG on sql server. massively inefficient compared to how sql server best operates. you are in a different world now and when in Rome . . . 🙂
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply