unique Index for existing table with duplicates

  • Hello,

    I have a table with 150,000 records. The application was developed to catch the users entries and insure there would not be any duplicates.

    Well that didn't work.

    As I was reviewing the table, I noticed there are many duplicates. Somehow the users found a way to introduce duplicates. Therefore, I think this needs to be addressed at the database/table level and not with the developer's code.

    I need to add a primary key for guestid and a unique index.

    CREATE TABLE [dbo].[tblguest](

    [GuestID] [int] IDENTITY(1,1) NOT NULL, --needs to be a primary key

    [FirstName] [nvarchar](255) NULL, --needs to be nvarchar(50)

    [LastName] [nvarchar](255) NULL, --needs to be nvarchar(50)

    [BookDate] [datetime] NULL,

    [Telephone] [nvarchar](20) NULL,

    [EmailAddress] [nvarchar](255) NULL,

    [Birthdate] [datetime] NULL,

    [Address] [nvarchar](255) NULL,

    [EmployeeID] [int] NULL,

    [City] [nvarchar](255) NULL,

    [State] [nvarchar](2) NULL,

    [Zip] [nvarchar](15) NULL,

    [TransDate] [datetime] NULL,

    [TransUser] [nvarchar](20) NULL

    I then need to create a unique nonclustered index on firstname, lastname, emailaddress and phone. However, there are some duplicates of these four fields.

    What would be the best approach to make to create the unique index and get rid of the duplicates?

    This is a production table.

    Thanks.

    Tony

    Things will work out.  Get back up, change some parameters and recode.

  • WebTechie38 (1/6/2011)


    Hello,

    What would be the best approach to make to create the unique index and get rid of the duplicates?

    I'd start with getting a straight count on the number of pure duplicates (all columns are equivalent), and the number of address only duplicates. If they match, you only have pure duplicates, which can be corrected easier because then you just need to reassociate the data to a single record and remove the other (my personal preference is to always keep the lowest ID one).

    If they are un-equal, you'll have to use the pure duplicate list to clean up what you can, then pull out the address only ones and have a manual review on the records to determine accuracy. You could shortcut this by only using the most recent one, but I'd have signoff in triplicate from business before you did that.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Well, you have to get rid of the duplicates before doing the unique index. Here's how I'd do it:

    WITH CTE AS(

    SELECT [GuestID], [FirstName], [LastName], [BookDate],

    [Telephone], [EmailAddress], [Birthdate], [Address],

    [EmployeeID], [City], [State], [Zip], [TransDate], [TransUser],

    RN = ROW_NUMBER() OVER (PARTITION BY firstname, lastname, emailaddress, telephone ORDER BY TransDate)

    FROM tblguest

    )

    DELETE FROM CTE

    WHERE RN > 1;

    Edit: change the ROW_NUMBER() ORDER BY TO TransDate DESC to keep the newest; keep it as is to keep the first.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Um, no. This is not a corrupt database, it's a table that has duplicates that need removing.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Wayne,

    I followed your suggestion and was able to clean the table up.

    I've heard that CTEs are things to avoid. However, in this instance it worked fine.

    All cleaned up and I've created the appropriate indices.

    Thanks again.

    Things will work out.  Get back up, change some parameters and recode.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply