Just For Fun: An Impossible Delete

  • Very cool article, my hat is off to you. It's been a long time, but I've used a technique like your recovered key space before. I had an initial problem with your first thing about the two phone numbers as I would have put a constraint on the field so that there could only be two records per employee, which would have complicated matters slightly.

    We need to get Steve to add a Pingback because I'm definitely linking to this article from my blog.

    Oh, and you might find the Four Yorkshiremen, if memory serves, is actually (technically) NOT Monty Python: it was actually first on At Last, The 1946 Show. However, that show was written by Cleese and Chapman with occasional appearances of other soon-to-be Pythons such as Idle and I think Palin. Unfortunately most of that series was lost, I think there are only five episodes out on a two DVD set. Even though the video is really bad, it's still worth watching. I was overjoyed to find them available.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Wayne West (8/5/2008)


    Very cool article, my hat is off to you.

    Thanks, Wayne.

    Oh, and you might find the Four Yorkshiremen, if memory serves, is actually (technically) NOT Monty Python: it was actually first on At Last, The 1946 Show. However, that show was written by Cleese and Chapman with occasional appearances of other soon-to-be Pythons such as Idle and I think Palin. Unfortunately most of that series was lost, I think there are only five episodes out on a two DVD set. Even though the video is really bad, it's still worth watching. I was overjoyed to find them available.

    Huh, now that's another thing I learned today!

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Such a great article. I really like how you took the time to explain things. I came onto the computer scene when issues like that were still talked about as recent history, but not worried about. I never worried about every last bit anyway.

    The only thing that was missing for me (because I wanted to try out one of your queries) was the sample data for the main table and you provided that in an earlier comment. So, now you know you helped out at least 2 people. 😉

    Finally, I think humor is very important, even in technical articles. So, I wanted to give you public cudos for my favorite line in your article: "Now one byte may not seem like a lot today, but in the 70's an entire family of programmers could live in the space provided by a single byte."

  • Thanks for the feedback JJ. I think that (appropiate) humor is an important part of work and life too. Glad you liked the joke!

    😀

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Andy DBA (8/5/2008)


    Also, the concept of storing a tally table would be a pretty hard sell. "So you're suggesting we have a permanent table of nothing but consectutive numbers? REALLY? Well, how high should we count? I say we use long ints for this table of yours and we'll take the storage cost out of your salary."

    11,000 is my magic number... covers VARCHAR(8000) and a smidge more than 30 years of days for mortgages. For anything else in 2k5, a Cross Join CTE does well up to 121 million.

    Anyway, as reluctant as I am to suggest this, (please don't hate me, Jeff) if one assumes duplicate rows are grouped together, a RBAR solution to this interesting problem is likely what would have been adopted back then. Of course you can forget about concurrency if you're RBAR grinding through GBs of data.

    Well, maybe just a little... 😛

    Please don't get me wrong. I totally understand and promote set based solutions over RBAR and have used tally tables to great advantage. It's just interesting to see hybrid solutions combining 70s constraints, goals, and techiques with modern ones. We certainly have it better than we used to, but we're doing a lot more now, too.

    I absolutely agree... some of the best code I've ever seen has come about by folks "messing around" with code and hybrid solutions. For example, the fastest "missing sequence" code I've ever seen uses a couple of correlated sub-queries which is a form of RBAR... but the constraints it used make it all nasty fast. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/5/2008)


    ...For example, the fastest "missing sequence" code I've ever seen uses a couple of correlated sub-queries which is a form of RBAR... but the constraints it used make it all nasty fast. 🙂

    I'm sure I've seen this floating around here somewhere, but can't find it atm. Link please?

    Regards,

    Jacob

  • Jacob Luebbers (8/5/2008)


    Jeff Moden (8/5/2008)


    ...For example, the fastest "missing sequence" code I've ever seen uses a couple of correlated sub-queries which is a form of RBAR... but the constraints it used make it all nasty fast. 🙂

    I'm sure I've seen this floating around here somewhere, but can't find it atm. Link please?

    Regards,

    Jacob

    No link necessary... here's the test data... as usual, comments in the code explain what's going on...

    --===== Setup for speed and to prevent blocking

    SET NOCOUNT ON

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

    --=============================================================================

    -- Create an experimental table to simulate the table being examined

    -- Again... 6 million rows...

    --=============================================================================

    --===== Create the experimental temp table and populate with Serial #'s on the fly

    -- This works because SysColumns always has at least 4000 entries

    -- even in a new database and 4000*4000 = 16,000,000

    SELECT TOP 6000000 SerialNumber = IDENTITY(INT, 1, 1)

    INTO #JbmTest

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    -- --===== Like any good table, our experimental table needs a Primary Key

    ALTER TABLE #JbmTest

    ADD PRIMARY KEY CLUSTERED (SerialNumber)

    -- This deletes a "monster" range just to see how it's handled.

    DELETE #JbmTest

    WHERE SerialNumber BETWEEN 5000000 AND 5500000

    -- This deletes every third row in the first 1000 rows

    DELETE #JbmTest

    WHERE SerialNumber %3 = 0

    AND SerialNumber <= 1000

    ... and here's the little pearl that does the trick so nicely... if you think 10 seconds is a long time, remember that there's SIX MILLION rows in the test data. 😉

    --=============================================================================

    -- Test the code

    --=============================================================================

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    --===== Calculated Gaps

    SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1

    FROM #JbmTest b

    WHERE b.SerialNumber < a.SerialNumber),

    GapEnd = SerialNumber - 1

    FROM #JbmTest a

    WHERE a.SerialNumber - 1 NOT IN (SELECT SerialNumber FROM #JbmTest)

    AND a.SerialNumber - 1 > 0

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nice - just ran a quick comparison (dual Xeon 2.8, 6GB RAM, SQL Server 2005 Ent. SP2) - your method vs Alex Kozac's original (http://msdn.microsoft.com/en-us/library/aa175780.aspx) vs Goce Smilevski's UDF-based improvement (http://www.sqlservercentral.com/articles/Advanced+Querying/anefficientsetbasedsolutionforislandsandgaps/1619/[/url]):

    --...SNIP... Jeff's test table setup, changed to use a permanent instead of temp table

    -- Goce's UDFs

    create FUNCTION dbo.udfFindGapStarts()

    RETURNS @res_tbl TABLE (

    row_num int identity(1, 1) NOT NULL PRIMARY KEY,

    gap_start int NOT NULL)

    AS

    BEGIN

    INSERT INTO @res_tbl (gap_start)

    SELECT SerialNumber + 1

    FROM JbmTest AS g1

    WHERE NOT EXISTS (SELECT SerialNumber FROM JbmTest AS g2

    WHERE (g2.SerialNumber = g1.SerialNumber + 1))

    ORDER BY SerialNumber + 1

    OPTION (MERGE JOIN)

    RETURN

    END

    create FUNCTION dbo.udfFindGapEnds()

    RETURNS @res_tbl TABLE (

    row_num int identity(0, 1) NOT NULL PRIMARY KEY,

    gap_end int NOT NULL)

    AS

    BEGIN

    INSERT INTO @res_tbl (gap_end)

    SELECT SerialNumber - 1

    FROM JbmTest AS g1

    WHERE NOT EXISTS (SELECT SerialNumber FROM JbmTest AS g2

    WHERE (g2.SerialNumber = g1.SerialNumber - 1))

    ORDER BY SerialNumber - 1

    OPTION (MERGE JOIN)

    RETURN

    END

    --=============================================================================

    -- Test the code

    --=============================================================================

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    -- GAPS - Jeff Moden

    SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1

    FROM JbmTest b

    WHERE b.SerialNumber < a.SerialNumber),

    GapEnd = SerialNumber - 1

    FROM JbmTest a

    WHERE a.SerialNumber - 1 NOT IN (SELECT SerialNumber FROM JbmTest)

    AND a.SerialNumber - 1 > 0

    -- GAPS - Goce Smilevski

    SELECT

    t1.gap_start, t2.gap_end

    FROM

    dbo.udfFindGapStarts() AS t1

    INNER JOIN dbo.udfFindGapEnds() AS t2

    ON (t2.row_num = t1.row_num)

    OPTION

    (MERGE JOIN)

    -- GAPS - Alexander Kozak

    SELECT

    t1.col1 as startOfGroup,

    MIN(t2.col1) AS endOfGroup

    FROM

    (SELECT

    col1 = SerialNumber+1

    FROM JbmTest tbl1

    WHERE NOT EXISTS(SELECT * FROM JbmTest tbl2 WHERE tbl2.SerialNumber = tbl1.SerialNumber + 1)

    AND SerialNumber <> (SELECT MAX(SerialNumber) FROM JbmTest)

    ) t1

    INNER JOIN

    (SELECT

    col1 = SerialNumber-1

    FROM JbmTest tbl1

    WHERE NOT EXISTS (SELECT * FROM JbmTest tbl2 WHERE tbl1.SerialNumber = tbl2.SerialNumber + 1)

    AND SerialNumber <> (SELECT MIN(SerialNumber) FROM JbmTest)

    ) t2 ON t1.col1 <= t2.col1

    GROUP BY t1.col1

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    Results:

    Jeff Moden: 7 secs, 18743 logical reads

    Goce Smilevski: 11 secs, 36822 logical reads

    Alex Kozac: killed it after 20 mins, no result

    Regards,

    Jacob

  • Very cool, Jacob! Thanks for taking the time to test it and other methods, as well! 🙂

    I wish I could take the credit for the code, but I can't. I got the code off of some website about a million years ago and wish I could find it again so I could give the lad a hefty pat on the back.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great article.....

  • create table #tab (lname varchar(5), age smallint, sex char(1))

    insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

    insert into #tab (lname, age, sex) select 'ABC', 24, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'LMN', 27, 'M'

    insert into #tab (lname, age, sex) select 'PQRS', 25, 'F'

    insert into #tab (lname, age, sex) select 'XYZ', 24, 'M'

    insert into #tab (lname, age, sex) select 'XYZ', 25, 'M'

    select * from #tab

    set rowcount 0

    while exists(select lname, count(*) from #tab group by lname having count(*) > 1)

    begin

    set rowcount 1

    delete from #tab where lname in (select lname from #tab group by lname having count(*) > 1)

    set rowcount 0

    end

    select * from #tab

  • Bill; one of the rules is "No Loops". "WHILE..." is a loop.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Admittedly, I didn't comb through the rules. I just threw together the code to perform the "impossible delete". 🙂

  • rbarryyoung (8/6/2008)


    Bill; one of the rules is "No Loops". "WHILE..." is a loop.

    Another [implicit] rule is that the solution should return the desired result. While I don't understand the purpose driving the original design problem, this "solution" fails to deliver correct results.

    Andrew

    --Andrew

  • You're right, sorry. I misread the article to mean ditch the dupes.

    My bad, in a hurry while trying to do actual work. 🙂

Viewing 15 posts - 61 through 75 (of 156 total)

You must be logged in to reply to this topic. Login to reply