January 6, 2011 at 12:28 pm
Hello,
I have a table with 150,000 records. The application was developed to catch the users entries and insure there would not be any duplicates.
Well that didn't work.
As I was reviewing the table, I noticed there are many duplicates. Somehow the users found a way to introduce duplicates. Therefore, I think this needs to be addressed at the database/table level and not with the developer's code.
I need to add a primary key for guestid and a unique index.
CREATE TABLE [dbo].[tblguest](
[GuestID] [int] IDENTITY(1,1) NOT NULL, --needs to be a primary key
[FirstName] [nvarchar](255) NULL, --needs to be nvarchar(50)
[LastName] [nvarchar](255) NULL, --needs to be nvarchar(50)
[BookDate] [datetime] NULL,
[Telephone] [nvarchar](20) NULL,
[EmailAddress] [nvarchar](255) NULL,
[Birthdate] [datetime] NULL,
[Address] [nvarchar](255) NULL,
[EmployeeID] [int] NULL,
[City] [nvarchar](255) NULL,
[State] [nvarchar](2) NULL,
[Zip] [nvarchar](15) NULL,
[TransDate] [datetime] NULL,
[TransUser] [nvarchar](20) NULL
I then need to create a unique nonclustered index on firstname, lastname, emailaddress and phone. However, there are some duplicates of these four fields.
What would be the best approach to make to create the unique index and get rid of the duplicates?
This is a production table.
Thanks.
Tony
Things will work out. Get back up, change some parameters and recode.
January 6, 2011 at 12:36 pm
WebTechie38 (1/6/2011)
Hello,What would be the best approach to make to create the unique index and get rid of the duplicates?
I'd start with getting a straight count on the number of pure duplicates (all columns are equivalent), and the number of address only duplicates. If they match, you only have pure duplicates, which can be corrected easier because then you just need to reassociate the data to a single record and remove the other (my personal preference is to always keep the lowest ID one).
If they are un-equal, you'll have to use the pure duplicate list to clean up what you can, then pull out the address only ones and have a manual review on the records to determine accuracy. You could shortcut this by only using the most recent one, but I'd have signoff in triplicate from business before you did that.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 6, 2011 at 12:47 pm
Well, you have to get rid of the duplicates before doing the unique index. Here's how I'd do it:
WITH CTE AS(
SELECT [GuestID], [FirstName], [LastName], [BookDate],
[Telephone], [EmailAddress], [Birthdate], [Address],
[EmployeeID], [City], [State], [Zip], [TransDate], [TransUser],
RN = ROW_NUMBER() OVER (PARTITION BY firstname, lastname, emailaddress, telephone ORDER BY TransDate)
FROM tblguest
)
DELETE FROM CTE
WHERE RN > 1;
Edit: change the ROW_NUMBER() ORDER BY TO TransDate DESC to keep the newest; keep it as is to keep the first.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
January 6, 2011 at 7:06 pm
January 7, 2011 at 12:57 am
luckysql.kinda (1/6/2011)
MS SQL Corruption: "Msg 2627 - Violation of PRIMARY KEY constraint"
Um, no. This is not a corrupt database, it's a table that has duplicates that need removing.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
January 9, 2011 at 12:59 pm
Thanks Wayne,
I followed your suggestion and was able to clean the table up.
I've heard that CTEs are things to avoid. However, in this instance it worked fine.
All cleaned up and I've created the appropriate indices.
Thanks again.
Things will work out. Get back up, change some parameters and recode.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply