Just For Fun: An Impossible Delete

  • You weren't the only one to make either of those mistakes. I see several contributors stuck in the same tar pits over the last 7 pages of comments.

    Andrew

    --Andrew

  • RBarry,

    Thanks for a great article and wonderful thread of comments. Others have said it, but I'll gladly reinforce the idea that the techniques and generalized information are the point here, and that goal has been more than met. Yes, the use of a tally table pushes up next to the constraint of no temp table, which may be read as no supplemental table, but so what? And the use of a RBAR solution that grinds and grinds forever in a multi-GB table would maybe still be OK in the 70's because in 1970 who knew what a megabyte was, never mind a gigabyte?

    Speaking of memory constraints, this reminds me of a few years ago hearing one of the old-timers who in the 1950's helped design the MICR (Magnetic Ink Character Recognition) cheque encoding system at Stanford Research Institute and Bank of America. He said they had to take great care to be efficient in their hand-built machine code so everything would fit into 8K. Oh, and the "K" was for KiloBits, not bytes.

  • Thanks for the feedback, John.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OK. So I couldn't let go and just had to try to come up with a query to do this. [p] To restate the problem, we want a list of the second and subsequent dup's for each row. All others, including singlets would be deleted. The overall process would be pretty much as Barry proposed:

    --Use the given table as temp-in-place to hold the desired final result set, with each row somehow distinguished from the original data. I chose to make the age negative, ignoring the possibility of age=0.

    --Remove all of the original data. (I don't show that code here)

    --Flip the distinguishing factor (the sign of "age") so that the result set has the characteristics of the original data. (I don't show that code here either)

    [/p]

    [p]Of course, the core question is in that first step how to create a list of the duplicates, not including the "original". If we have a Tally table, then this code will do the trick by allowing the db engine to hold the intermediate result internally as a subquery rather than in un-used bits in the original table. (I used the create/insert code from another post above -- this thread is a collaboration, right?):[/p]

    create table #tab (lname varchar(5), age smallint, sex char(1))

    insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

    insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'PQRS', 25, 'F'

    insert into #tab (lname, age, sex) select 'XYZ', 24, 'M'

    insert into #tab (lname, age, sex) select 'XYZ', 25, 'M'

    select lname

    ,age * -1 as negAge

    ,sex

    from

    (select T.lname

    ,T.age

    ,T.sex

    ,COUNT(*) as Cnt

    from #TAB T

    group by T.lname

    ,T.Age

    ,T.sex

    having count(*) > 1) as CountList

    join TALLY

    on CountList.Cnt -1 >= Tally.SerialNumber

    Now I'll confess this is the first time I've tried using the Tally table method for anything, so just made one in my sandbox db, but without an index of any sort, so I suspect a 3,000,000 row table scan is why this query seems to take 8 seconds no matter how many rows I've poured into the #TAB. And one more thing: this was a wonderful learning experience. I wasn't trying so much to improve on anyone's solution, but found that just trying to find an alternative approach helped me understand the concept of a tally table. The hour or so spent was well worth it just for that.

  • john.arnott (8/6/2008)


    I wasn't trying so much to improve on anyone's solution, but found that just trying to find an alternative approach helped me understand the concept of a tally table. The hour or so spent was well worth it just for that.

    Gotta love it when some unitended good comes about. 🙂

    I don't know if you need more on the Tally table, but if you do, I try to explain a bit in the following article. And, nice job using the Tally table for this solution!

    http://www.sqlservercentral.com/articles/TSQL/62867/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hallo, to the original joke, why not to return desired result directly like this?

    -- insert into SOURCE

    select [Name], Age, case(Sex) when 'M' then 'm' else 'f' end

    from (

    select * , Count(*) as cnt from SOURCE

    Group By Name, Age, Sex

    having 1 < Count(*)

    ) x

    join Tally

    on cnt > Tally.iValue

    -- Marek

  • Marek Skotnica (8/7/2008)


    Hallo, to the original joke, why not to return desired result directly like this?

    -- insert into SOURCE

    select [Name], Age, case(Sex) when 'M' then 'm' else 'f' end

    from (

    select * , Count(*) as cnt from SOURCE

    Group By Name, Age, Sex

    having 1 < Count(*)

    ) x

    join Tally

    on cnt > Tally.iValue

    Because that was not the problem. The problem was to change the existing table's contents, which is considerably harder than just returning a dataset.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • john.arnott (8/6/2008)


    Now I'll confess this is the first time I've tried using the Tally table method for anything, so just made one in my sandbox db, but without an index of any sort, so I suspect a 3,000,000 row table scan is why this query seems to take 8 seconds no matter how many rows I've poured into the #TAB.

    John,

    In the original article, RBarry used a filter in the select:

    Where Tally.Number Between 2 and 127

    Adding that and putting a clustered index on your tally table might improve your response time. (Next chance I get, I'm going to re-read Jeff's article on tally tables.) 🙂

    RBarry,

    I think Marek was just suggesting some minor tweaks to John's subselect:

    1) Changing case on gender instead of flipping the sign of age, and

    2) joining on count > tally.number instead of count-1 >= tally.number.

    I'll say it again: Great article!

    I'll bet you never thought it would generate so much discussion!

  • Andy DBA (8/7/2008)


    RBarry,

    ...

    I'll say it again: Great article!

    I'll bet you never thought it would generate so much discussion!

    Thanks, Andy. And yeah, I thought that hardly anyone would even be interested in this, let alone want to discuss it.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Andy DBA,

    Adding the clustered index to my TALLY table did the trick. As I suspected it would, the query returns instantly now without bothering with a range filter. Shame on me for not taking the 30 seconds to do that originally.

  • These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

    I believe you should be able to do this with out doing the insert if you just update the available bits instead.

    There are many times with a disk full condition that you would not be able to do the insert, but need to remove the dupes.

    Cheers,

    John R. Hanson

    Chief Applications Architect

    DesignMind Business Solutions

  • John - you missed the whole point - its to keep the duplicates 🙂

    Hiding under a desk from SSIS Implemenation Work :crazy:

  • John R. Hanson (8/21/2008)


    These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

    As I said in the article, the rules posed by the original poster did seem a little arbitrary. I freely admit that direct application of this puzzle to everyday situations today, might strain credulity a bit. In the (remote?) past it was somewhat different, space issues were very important. but structural issues were absolutely overwhelming. It was much easier to temporarily expand and then shrink an existing table (the shrink did have to be done off-line) than to create a whole new table in a managed database environment.

    I believe you should be able to do this with out doing the insert if you just update the available bits instead.

    I could not figure out how to do this John, but I would be very interested in seeing such a solution.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • To Shaun,

    I don't believe I missed the whole point.

    I may have mistyped.

    the general idea is that in a constrained environment, there are ways at updating the extra bits of data, rather than adding additional bits, rows or tables.

    after using the updated extra bits to help update / delete the target records in a solution set, one would also have to delete the the extra bits of information to return the table to its proper state.

    I am suggesting that one can use only manipulation of the extra bits available and not resort to adding additional rows to arrive at a solution.

    In the same way the author (who got my point) had to delete added rows, one would have delete any added meta data in unused bits as addressed above.

    Cheers,

    John R. Hanson

  • John R. Hanson (8/21/2008)


    These are Important Ideas even more so today, as Enterprise Data is growing. My concern with the solution is that you are inserting rows to the table, and I don't see the difference between this and using an intermediate table.

    There are many times with a disk full condition that you would not be able to do the insert, but need to remove the dupes.

    I absolutely agree...

    I believe you should be able to do this with out doing the insert if you just update the available bits instead.

    Would absolutely love to see that code... got any ideas on how you might be able to pull that off? 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 76 through 90 (of 156 total)

You must be logged in to reply to this topic. Login to reply