Delete duplicate records

  • I need to determine and then delete duplicate records in one table (Order) in which there are about 50 columns.

    Only OrderBy, OrderDate and OrderCity are the same will be deleted (but only keep one record).

    How to code it?

    For example, the following records 3), 6) need to be deleted.

    OrderBy OrderDate OrderCity

    1) A 02/22/2011 NEW YORK

    2) A 02/22/2011 LONDON

    3) A 02/22/2011 NEW YORK

    4) B 02/22/2011 NEW YORK

    5) A 05/05/2011 NEW YORK

    6) A 02/22/2011 NEW YORK

  • How do you determine which one to keep? note that SQL doesn't order records, so the order you see from a SELECT ORDRBY FROM MYTABLE isn't necessarily what you'll always get.

  • And more importantly do you have a primary key on this table? This is more challenging if you don't.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I dare say that in this case you are probably missing part of the reason for apparent dupes. Do you have an OrderDetail table? Or does the Order table contain all of the pertinent order item detail?

    Each of your "duplicate" entries in this table may not really be duplicates because you have not identified that each record may refer to a different piece of that order. Also, does this table account for refunds or postponed fulfillment?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Yes, there is a primary key on ID column.

    I want to keep min(ID)

  • This will do what you are asking for:

    create table #List

    (

    ListID int identity not null,

    OrderBy char(1),

    OrderDate datetime,

    OrderCity varchar(20)

    )

    insert #List (OrderBy, OrderDate, OrderCity)

    values

    ('A', '02/22/2011', 'NEW YORK'),

    ('A', '02/22/2011', 'LONDON'),

    ('A', '02/22/2011', 'NEW YORK'),

    ('B', '02/22/2011', 'NEW YORK'),

    ('A', '05/05/2011', 'NEW YORK'),

    ('A', '02/22/2011', 'NEW YORK')

    delete #List where ListID not in

    (

    select Min(ListID) as ListIDToKeep

    from #List

    group by OrderBy, OrderDate, OrderCity

    )

    select * from #List

    drop table #List

    Note that I created consumable ddl and data so others can quickly and easily look at this.

    I agree that this is probably not really want you to do but I will leave that decision up to you. It is pretty unusual to have true duplicates in an order table. As suggested there is probably an OrderDetail table or something that holds the line items for the order. If so, this will either fail if there is an actual foreign key on the detail table OR it will create orphaned detail records.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Can you do something simple like this or am I missing the boat on this one?

    CREATE TABLE #temp (OrderBy varchar(1), OrderDate datetime, OrderCity varchar(15))

    GO

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('A','02/22/2011','NEW YORK')

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('A','02/22/2011','LONDON')

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('A','02/22/2011','NEW YORK')

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('B','02/22/2011','NEW YORK')

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('A','05/05/2011','NEW YORK')

    INSERT INTO #temp (OrderBy , OrderDate , OrderCity) VALUES ('A','02/22/2011','NEW YORK')

    DELETE T FROM

    (SELECT ROW_NUMBER() OVER(PARTITION BY OrderBy, OrderDate, OrderCity

    ORDER BY OrderBy, OrderDate, OrderCity) As RowNumber,* FROM #Temp )

    T

    WHERE RowNumber > 1

    SELECT * FROM #temp

    DROP TABLE #temp

    Output:

    A2011-02-22 00:00:00.000LONDON

    A2011-02-22 00:00:00.000NEW YORK

    B2011-02-22 00:00:00.000NEW YORK

    A2011-05-05 00:00:00.000NEW YORK

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Sorry Sean, you had posted your reply while I was typing in mine, didn't mean to duplicate the effort! 😉

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • MyDoggieJessie (9/23/2011)


    Sorry Sean, you had posted your reply while I was typing in mine, didn't mean to duplicate the effort! 😉

    No problem. Your code accomplished the exact same thing but does it a completely different way, once again proving that there are as many ways to do something as there people that want it done. 😉

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • adonetok (9/23/2011)


    Yes, there is a primary key on ID column.

    I want to keep min(ID)

    I think the best thing you could do is provide DDL for the tables involved. Otherwise, solutions provided could be far from what you truly need.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thank you for help.

    I think that both ways will work.

  • Firstly, here is a script which should list all tables and records which are duplicated, PK is removed from this comparison.

    So the script will dynamically group by all the columns excluding the pk.

    Be warned, aggregating a 2 million row table with 20 columns takes loooong.

    DECLARE @sql varchar(max), @table varchar(100)

    DECLARE table_cur CURSOR

    FOR SELECT table_name

    FROM INFORMATION_SCHEMA.TABLES t

    LEFT OUTER JOIN sys.columns c ON c.object_id = OBJECT_ID(t.table_name)

    AND c.is_identity = 1 -- GET ALL IDENTITY COLUMNS

    LEFT OUTER JOIN sys.indexes i ON i.object_id = OBJECT_ID(t.table_name)

    AND i.is_unique = 1 -- GET ALL UNIQUE INDEXES

    LEFT OUTER JOIN sysobjects O ON o.id = OBJECT_ID(t.table_name)

    AND o.xtype = 'PK'

    WHERE c.object_id IS NULL -- ie an identity column doesn't exist

    AND i.object_id IS NULL -- ie a unique index doesn't exist

    AND o.id IS null

    AND t.table_type = 'BASE TABLE'

    AND t.TABLE_name NOT LIKE '%temp%' -- These would be uninteresting.

    AND t.TABLE_NAME NOT LIKE '%tmp%'

    ORDER BY table_name DESC

    OPEN table_cur

    FETCH NEXT FROM table_cur INTO @table

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @sql = 'COUNT(*) '+@table

    SELECT @sql = RTRIM(@sql) + ', [' + c.column_name + ']'

    FROM INFORMATION_SCHEMA.COLUMNS c

    WHERE table_name = +RTRIM(@table)

    SET @sql = 'SELECT ' + RTRIM(@sql) + ' FROM [' + RTRIM(@table) + '] '

    +'GROUP BY ' + REPLACE(@sql, 'COUNT(*) '+@table+',', '') +' HAVING COUNT(*) > 1'

    PRINT @sql

    EXEC ( @sql

    )

    FETCH NEXT FROM table_cur INTO @table

    END

    CLOSE table_cur

    DEALLOCATE table_cur

    You can probably then do something like

    select into HoldingTable min(pk),col1,col2,col3,col4 from MyTable group by col1,col2,col3,col4

    alter MyTable name to myTableBackup.

    drop mytable

    select into mytable * from HoldingTable

    --------------

    You might need to drop the pk before dropping the table. Here is a script which will give you a list of all the fk's originating from that pk, which need to be dropped first before dropping the pk, then recreate the table and then re enable all the constraints.

    I suggest you run this and save the entire script somewhere.

    I ever so strongly suggest you do this on a QA environment first and prove it to a fellow dev before playing with production data.

    ------------

    DECLARE @PKTableName VARCHAR(100),

    @PKName varchar(100),

    @FKName varchar(100),

    @sql varchar(max),

    @PKcolumnName varchar(30),

    @table VARCHAR(30),

    @FKColumnName VARCHAR(100)

    SET @PKTableName = 'MyTable'

    set @PKName = ( SELECT name

    FROM sys.indexes

    WHERE OBJECT_NAME(object_id) = @PKTableName

    AND is_primary_key = 1

    )

    set @PKcolumnName = ( SELECT name

    FROM sys.columns

    WHERE OBJECT_NAME(object_id) = @PKTableName

    AND is_identity = 1

    )

    IF EXISTS ( SELECT *

    FROM sys.indexes

    WHERE object_ID(@PKTableName) = object_id

    AND index_id = 0 ) -- IF a heap

    begin

    IF EXISTS ( SELECT *

    FROM sys.tables

    WHERE object_id = OBJECT_ID(N'[dbo].[FKAgainstTableList]') )

    BEGIN

    DROP TABLE FKAgainstTableList

    END

    SELECT OBJECT_NAME(sys.foreign_key_columns.parent_object_id) [Table],

    sys.columns.name [FKColumnName],

    sys.foreign_keys.name [FKName]

    INTO FKAgainstTableList

    FROM sys.foreign_keys

    INNER JOIN sys.foreign_key_columns ON sys.foreign_keys.object_id = sys.foreign_key_columns.constraint_object_id

    INNER JOIN sys.columns ON sys.columns.object_id = sys.foreign_keys.parent_object_id

    AND sys.columns.column_id = sys.foreign_key_columns.parent_column_id

    WHERE OBJECT_NAME(sys.foreign_keys.referenced_object_id) = @PKTableName

    DECLARE table_cur1 CURSOR

    FOR SELECT *

    FROM FKAgainstTableList

    -------------------------------Disable constraint on FK Tables

    OPEN table_cur1

    FETCH NEXT FROM table_cur1 INTO @table, @FKColumnName, @FKName

    WHILE@@FETCH_STATUS = 0

    BEGIN

    SET @sql = 'ALTER TABLE ' + @table + ' DROP CONSTRAINT '

    + @FKName

    PRINT @sql

    FETCH NEXT FROM table_cur1 INTO @table, @FKColumnName, @FKName

    END

    CLOSE table_cur1

    DEALLOCATE table_cur1

    --------------------------------DROP AND recreate CLUSTERED pk

    IF EXISTS ( SELECT 1

    FROM sys.indexes

    WHERE object_id = OBJECT_ID(@PKTableName)

    AND name = @PKName )

    BEGIN

    SET @sql = 'ALTER TABLE ' + @PKTableName + ' DROP CONSTRAINT '

    + @PKName

    PRINT @sql

    END

    SET @sql = 'ALTER TABLE ' + @PKTableName + ' ADD CONSTRAINT '

    + @PKName + ' PRIMARY KEY CLUSTERED (' + @PKcolumnName

    + ' ASC)

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]'

    PRINT ( @sql )

    --------------------------------Enable FK constraints on FK tables.

    DECLARE table_cur2 CURSOR

    FOR SELECT *

    FROM FKAgainstTableList

    OPEN table_cur2

    FETCH NEXT FROM table_cur2 INTO @table, @FKColumnName, @FKName

    WHILE@@FETCH_STATUS = 0

    BEGIN

    SET @sql = 'ALTER TABLE ' + @table

    + ' WITH NOCHECK ADD CONSTRAINT ' + @FKName

    + ' FOREIGN KEY([' + @FKColumnName + '])

    REFERENCES [' + @PKTableName + '] ([' + @PKcolumnName + '])'

    PRINT ( @sql )

    SET @sql = 'ALTER TABLE ' + @table + ' CHECK CONSTRAINT '

    + @FKName

    PRINT ( @sql )

    FETCH NEXT FROM table_cur2 INTO @table, @FKColumnName, @FKName

    END

    CLOSE table_cur2

    DEALLOCATE table_cur2

    DROP TABLE FKAgainstTableList

    end

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply