Delete duplicate rows from ANY table.

  • ben.brugman (8/3/2015)


    This weekend I have been away from the keyboard. And as often the best idea's occure when away from the keyboard. E.g. on my bike (pushbike) or under the shower. This weekend I came up with the following:

    select * into #D from D1

    Delete D1

    insert into D1 select distinct * from #D

    This works for most tables.

    But does not work for tables which have an identity and not for tables with the larger datatypes like XML.

    Maybe I should take more showers.;-)

    Thanks all for you time and attention.

    Ben

    Both of those conditions are "detectable", and the IDENTITY problem can be gone around by SET IDENTITY INSERT ON. The problem with large data types is detectable, but probably not so easy to go around.

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (8/3/2015)


    ben.brugman (8/3/2015)


    This weekend I have been away from the keyboard. And as often the best idea's occure when away from the keyboard. E.g. on my bike (pushbike) or under the shower. This weekend I came up with the following:

    select * into #D from D1

    Delete D1

    insert into D1 select distinct * from #D

    This works for most tables.

    But does not work for tables which have an identity and not for tables with the larger datatypes like XML.

    Maybe I should take more showers.;-)

    Thanks all for you time and attention.

    Ben

    Both of those conditions are "detectable", and the IDENTITY problem can be gone around by SET IDENTITY INSERT ON. The problem with large data types is detectable, but probably not so easy to go around.

    Using SET IDENTITY INSERT ON won't help as it needs an explicit column list. Also, identity columns will very likely prevent duplicates to be detected unless the identity value was reset.

    Computed columns will also create a problem as you can't insert values into them.

    Why won't you use distinct in the select into to reduce writes? Why don't you use the option I gave you to reduce them even more?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • ben.brugman (8/3/2015)


    Enforcing constriants often prevents actions. Most of the time it prevents 'illegal' actions. But sometimes it prevents actions, which are functional.

    In general I am in favor to enforce constraints in the database whenever feasable.

    Ben

    If a constraint prevents actions that are valid, it's a design fault. Can you give an example on a constraint that prevents actions which are functional?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (8/3/2015)


    If a constraint prevents actions that are valid, it's a design fault. Can you give an example on a constraint that prevents actions which are functional?

    We have clients, each client exists in real live only once. In the model each client exists only once and almost all data is clustered on the client entity ID (artificial key).

    Sometimes the real life and the database do not agree, offcourse the real life is in error.

    So sometimes a client gets a second entry in the database. This should not happen, but it does. We consider this as an exception.

    Because all data is organised around clients and the clustering is around clients we like to keep the implementation where the data from the client is clustered. Most constraints are set around this client and often an extra item from a row (which makes the row unique for this client).

    (This design is not linked to the Delete duplicate rows question, the duplicated rows are in test/trial/development situations only.)

    So a lot of data is client orientated. In tables refering eachother several levels deep.

    So what if it is detected that a client has two distinct sets of data. (Or even more).

    Then we want to merge the two clients into a single client.

    What we want to do is updating the client Entity ID of one of the clients into the other client Entity ID.

    If you have a tables refering to each other this update is not allowed. Supose you have 4 levels. You can not alter the toplevel because it has children. You can not alter the bottem level because the top does not exist yet.

    We have solved this conundrum, because of identities, unique constraints, references this was not easy.

    Alternatives would be:

    On 'reading' the data first resolve the clients entity ID into more than one ID and work with that. Which would give a performance penalty (and more complex code).

    Not implementing all types of constraints. (Not nice).

    Switching off constraints during the action. (Not nice).

    Ben

    The above is a simplified version of the real life version of the problem.

    I am prepared to anwser any questions about the design. (And defend the design, because it is succesfull). Within the forum I am restricted in my anwsers.

  • ben.brugman (8/3/2015)


    Luis Cazares (8/3/2015)


    If a constraint prevents actions that are valid, it's a design fault. Can you give an example on a constraint that prevents actions which are functional?

    We have clients, each client exists in real live only once. In the model each client exists only once and almost all data is clustered on the client entity ID (artificial key).

    Sometimes the real life and the database do not agree, offcourse the real life is in error.

    So sometimes a client gets a second entry in the database. This should not happen, but it does. We consider this as an exception.

    Because all data is organised around clients and the clustering is around clients we like to keep the implementation where the data from the client is clustered. Most constraints are set around this client and often an extra item from a row (which makes the row unique for this client).

    (This design is not linked to the Delete duplicate rows question, the duplicated rows are in test/trial/development situations only.)

    So a lot of data is client orientated. In tables refering eachother several levels deep.

    So what if it is detected that a client has two distinct sets of data. (Or even more).

    Then we want to merge the two clients into a single client.

    What we want to do is updating the client Entity ID of one of the clients into the other client Entity ID.

    If you have a tables refering to each other this update is not allowed. Supose you have 4 levels. You can not alter the toplevel because it has children. You can not alter the bottem level because the top does not exist yet.

    We have solved this conundrum, because of identities, unique constraints, references this was not easy.

    Alternatives would be:

    On 'reading' the data first resolve the clients entity ID into more than one ID and work with that. Which would give a performance penalty (and more complex code).

    Not implementing all types of constraints. (Not nice).

    Switching off constraints during the action. (Not nice).

    Ben

    The above is a simplified version of the real life version of the problem.

    I am prepared to anwser any questions about the design. (And defend the design, because it is succesfull). Within the forum I am restricted in my anwsers.

    Entering the "same" client information happens all the time and merging accounts is very very common. We do this kind of thing all the time and don't have nearly this level of challenge. Merging accounts does not mean you have to create a third account and move all data from account 1 and account 2 into account 3. Just pick one of the two and move the data to that account.

    In your example of being unable to move data it is easily solved by moving the bottom most level to one of the existing accounts. Do the same thing all the way the chain and all you have left is an empty Client Master row. Delete that row and they are merged.

    Not saying there aren't real world situation where this isn't challenging but your example is not one where the constraints prevented a legitimate data change.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • ben.brugman (8/3/2015)


    This weekend I have been away from the keyboard. And as often the best idea's occure when away from the keyboard. E.g. on my bike (pushbike) or under the shower. This weekend I came up with the following:

    select * into #D from D1

    Delete D1

    insert into D1 select distinct * from #D

    This works for most tables.

    But does not work for tables which have an identity and not for tables with the larger datatypes like XML.

    Maybe I should take more showers.;-)

    Thanks all for you time and attention.

    Ben

    You can cancel the identity property by using a union or union all:

    select top (0) * into #D from D1 union all select top (0) * from d1

    insert into D1 select distinct * from #D

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ben.brugman (8/3/2015)


    This weekend I have been away from the keyboard. And as often the best idea's occure when away from the keyboard. E.g. on my bike (pushbike) or under the shower. This weekend I came up with the following:

    select * into #D from D1

    Delete D1

    insert into D1 select distinct * from #D

    This works for most tables.

    But does not work for tables which have an identity and not for tables with the larger datatypes like XML.

    Maybe I should take more showers.;-)

    Thanks all for you time and attention.

    Ben

    Also wouldn't work if you have a TIMESTAMP column. Actually, the SELECT * INTO would work but the INSERT INTO would fail.

    There is a way to make this idea work but there are some rules to make it happen.

  • Sean Lange (8/3/2015)

    In your example of being unable to move data it is easily solved by moving the bottom most level to one of the existing accounts. Do the same thing all the way the chain and all you have left is an empty Client Master row. Delete that row and they are merged.

    This would violate the constraint between the bottom most level and one level up. (There is a constraint which includes the client entity ID, there).

    If you have a tables refering to each other this update is not allowed. Supose you have 4 levels. You can not alter the toplevel because it has children. You can not alter the bottem level because the top does not exist yet.

    Sorry this should read were the next levels up do not not exist. Top does exist but all the levels below the top do not exist (yet). Sorry.

    Ben

  • ScottPletcher (8/3/2015)


    ben.brugman (8/3/2015)


    This weekend I have been away from the keyboard. And as often the best idea's occure when away from the keyboard. E.g. on my bike (pushbike) or under the shower. This weekend I came up with the following:

    select * into #D from D1

    Delete D1

    insert into D1 select distinct * from #D

    This works for most tables.

    But does not work for tables which have an identity and not for tables with the larger datatypes like XML.

    Maybe I should take more showers.;-)

    Thanks all for you time and attention.

    Ben

    You can cancel the identity property by using a union or union all:

    select top (0) * into #D from D1 union all select top (0) * from d1

    insert into D1 select distinct * from #D

    Nice tip, thanks Scott.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/4/2015)


    ScottPletcher (8/3/2015)


    You can cancel the identity property by using a union or union all:

    select top (0) * into #D from D1 union all select top (0) * from d1

    insert into D1 select distinct * from #D

    Nice tip, thanks Scott.

    But the code results in a

    'An explicit value for the identity column in table 'D1' can only be specified when a column list is used and IDENTITY_INSERT is ON.'

    Ben

    See:

    exec sp_drop #D

    exec sp_drop D1

    go

    Create table D1 (

    text varchar(300),

    number1 int,

    OK INT IDENTITY(1,1))

    insert into D1 values('A', 123)

    insert into D1 values('B', 234)

    insert into D1 values('C', 345)

    insert into D1 values('A', 234)

    insert into D1 values('b', 234)

    select * into #D from D1 union select * from d1

    insert into D1 select distinct * from #D -- This results in a 'can only be specified when a column list' error

    Ben

  • ben.brugman (8/4/2015)


    Sorry this should read were the next levels up do not not exist. Top does exist but all the levels below the top do not exist (yet).

    That's really really easy then, probably the best case scenario. Merge the duplicate top levels based on whatever rules you have for this situation then delete the one with no children.

    If both have children then follow this for each table in parent -> child order:

    If both client entities do not exist, create the missing one.

    Update both rows with same information except client id, based on your rules for merging each table.

    Once all are done, delete the redundant client entity in child -> parent order.

    You can shortcut this a bit once you reach a table with no child rows but it's probably simpler to do it this way to cover all cases.

  • ben.brugman (8/4/2015)


    ChrisM@Work (8/4/2015)


    ScottPletcher (8/3/2015)


    You can cancel the identity property by using a union or union all:

    select top (0) * into #D from D1 union all select top (0) * from d1

    insert into D1 select distinct * from #D

    Nice tip, thanks Scott.

    But the code results in a

    'An explicit value for the identity column in table 'D1' can only be specified when a column list is used and IDENTITY_INSERT is ON.'

    Ben

    See:

    exec sp_drop #D

    exec sp_drop D1

    go

    Create table D1 (

    text varchar(300),

    number1 int,

    OK INT IDENTITY(1,1))

    insert into D1 values('A', 123)

    insert into D1 values('B', 234)

    insert into D1 values('C', 345)

    insert into D1 values('A', 234)

    insert into D1 values('b', 234)

    select * into #D from D1 union select * from d1

    insert into D1 select distinct * from #D -- This results in a 'can only be specified when a column list' error

    Ben

    D1 has 3 columns including an identity column, so inserting * from #D which is 3 columns will try to insert into the identity column.

    Either of these should work, but you're back to the problem of determining column names:

    insert into D1 select distinct text, number1 from #D (will not preserve original values of OK column but should remove duplicates)

    or

    set identity_insert D1 on

    insert into D1 (text, number1, OK) select text, number1, OK from #D (won't remove duplicates whether or not distinct is specified)

    set identity_insert D1 off

  • Gazareth (8/4/2015)

    Either of these should work, but you're back to the problem of determining column names:

    Yes back to the problem of determining the column names. And then de solution becomes 'far' worse than a number of other presented solutions. (A row_number partinioned by the the column names is than superiour in all (or most) aspects).

    But because it is rare for a table with an identity column to have duplicates I am happy enough with the solution which does not cover identity situations.

    Ben

  • Gazareth (8/4/2015)That's really really easy then, probably the best case scenario. Merge the duplicate top levels based on whatever rules you have for this situation then delete the one with no children.

    If both have children then follow this for each table in parent -> child order:

    Starting at the top you can not merge the 'second' level to the other client because there is the constraint with the third level. So the second level can not be 'moved' if there is still a third level.

    My presentation of the simplified problem was maybe not extensive enough. Sorry for that, but I thought that I was clear enough that each level has constraints to the tables above including the client entity ID and has constraints with the table below including the client entity ID. Sorry if that was not clear.

    As said we have solved the problem, but is was not simple. Even the simplified version has a number of these caveats. And the real version has more. (For example that a merge action should be reversible.

    Ben

  • Starting at the top you can not merge the 'second' level to the other client because there is the constraint with the third level. So the second level can not be 'moved' if there is still a third level.

    The first example given stated no children, so there's no way you can violate a constraint in a child table.

    For cases where there are children, you can avoid violating constraints by deleting in child -> parent order as I specified.

    When I say 'merge' I mean the non-key properties, e.g. if you have 2 addresses for a client, one for each entity but with a difference in spelling or spacing, you'd need some logic to determine which value to keep.

    At the intermediate step of the clean-up, you should have 2 identical rows that only differ in the client entity value.

    My presentation of the simplified problem was maybe not extensive enough. Sorry for that, but I thought that I was clear enough that each level has constraints to the tables above including the client entity ID and has constraints with the table below including the client entity ID. Sorry if that was not clear.

    It's perfectly clear, and a fairly common problem.

    Perhaps my explanation of a solution wasn't clear enough. With some demo tables & data it would be easy enough to write a worked solution for you.

Viewing 15 posts - 31 through 45 (of 56 total)

You must be logged in to reply to this topic. Login to reply