Need help on performance and integrity

  • AnzioBake (10/29/2008)


    I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.

    My Solution returned 670000+ Rows in the result set

    Thanks Anzio... do you have code to generate that nice test set? Sure would appreciate it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here you go.....I am sure I do not have to explain the logic and how I biased the results to get duplicates.....

    Set NoCount on

    Declare @Cnt int

    ,@RowKey varchar(1000)

    ,@Locale varchar(10)

    , @Code char(1)

    Set @Cnt = 1

    While @Cnt <= 1000000 Begin

    Insert into SSCData( RowKey, Locale, Code)

    Select

    RowKey =

    Case

    When RowKeyId > 0 and rowKeyID <=10 Then 1

    When RowKeyId > 10 and rowKeyID <=25 Then 1

    When RowKeyId > 25 and rowKeyID <=45 Then 1

    When RowKeyId > 45 and rowKeyID <=60 Then 1

    When RowKeyId > 60 and rowKeyID <=70 Then 1

    End

    , Locale =

    Case

    When LocaleID > 0 And LocaleID <=15 Then 'en'

    When LocaleID > 15 And LocaleID <= 24 Then 'de'

    When LocaleID > 24 And LocaleID <= 30 Then 'zh'

    End

    , Code =

    Case

    When CodeId > 0 And CodeID <= 7 Then 'A'

    When CodeID > 7 And CodeID <= 10 Then 'C'

    End

    From(

    Select

    RowKeyID = Ceiling( Rand() *70 )

    , LocaleID = Ceiling( Rand() *30 )

    , CodeId = Ceiling( Rand() *10 )

    ) as a

    If @Cnt % 10000 = 0

    Print Convert( Varchar(9), @Cnt) +' inserted'

    Select @Cnt = @Cnt +1

    End

  • I seem to have problems loading code with create table scripts

    CREATE TABLE SSCData

    (

    RowID int identity(1,1 ) PRIMARY KEY CLUSTERED,

    RowKey varchar(1000),

    Locale varchar(10),

    Code char(1)

    )

    In the event you are wondering why the loop:

    I have found that when doing set based query with a Rand(), SQL applies the same value to all rows, obviously not what I want.....therefore the row based approach here

  • Garadin (10/29/2008)


    Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.

    Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?

    Not yet Seth, due to the problem on my laptop. Probably I'll try again later when I get home if my laptop is working. I'll keep you guys posted.

  • Not that it matters at this point, but I have to recant my earlier belief that a set-based solution to this problem was possible. Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes, but there it stops.

    Stated simply, the set of valid duplicates (V) contains all those duplicates which do not span any row belong to another valid duplicate (V). Since set (V) is defined in reference to itself, any solution will have to iteratively expand the set.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bhovious

    I am not entirely sure what you are trying to say.

    By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)

    My solution employs a loop, but the query in side the loop is still set based.......?????

    So how do you define my solution?

    (Just wanting to understnad your comment)

  • Not that you asked me... but 😀

    IMO, *just about* any solution involving a loop(or a correlated subquery or a few other things) is not set based. Now, yours might be *closer* to set based than say... mine, which doesn't even attempt to resemble a set based solution and just lays smack dab on top of a cursor; but being *almost* set based isn't set based. Performing set based operations on single rows is still RBAR. That's not to say that mine will perform better or vise versa. We'll only know that when Dan runs his tests. Speaking of which... Dan, any updates on that topic?

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • AnzioBake (10/30/2008)


    By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)

    My solution employs a loop, but the query in side the loop is still set based.......?????

    So how do you define my solution?

    You are correct that iteration does not imply "non set-based" logic because, in fact, all SQL statements are loops.

    However, when you EXPLICITLY define a loop that is not set-based.

    Set-based solutions are IMPLICIT loops.

  • Guys, first let me make clear that I was pursuing a set-based solution purely out of curiosity for mental stimulation (like some people do crosswords or Sudoku puzzles). I conceded up front that even if such a solution was possible, the performance might be questionable.

    With that out of the way, let me address Anzio's question. As I understand it, a true set based solution should be be answerable by a single query (which could contain ctes and subqueries), but which derives it's answers at each step based on common characteristics of the set. It doesn't depend on sequencing or use of explicitly coded loops (including recursive ctes). For example, in this problem, I started by finding the set of break points with the following cte:

    ;with cte1 as -- identify breaks

    (select t1.RowID,t1.code

    from #temptable t1

    where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

    )

    This is a set based solution because it does not matter in what order the rows appear in #temptable. Having RowID as the primary key might make it run faster, but it doesn't affect the result at all. (Yes I know that where temptable has only code "A"s, this produces no rows. It's just the starting point and a simple example.)

    In Anzio's solution, the first two steps ("Find All the individual sequences" and "Identify all Duplicates in sequences") are set based single queries. [This is what I meant by "Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes."] I reached the same point in my CTE solution at the bottom.

    But then comes "Construct The Result Set" ... which involves a While loop, which is procedural in nature. ["but there it stops."] At this point, we have to "iterate" through this loop, row by row, gradually inserting more and more rows into the result set. Furthermore we have to use the result set in subsequent loops to determine the next duplicate (Inner join #result as r). [This is what I meant by saying the set of valid duplicates (V) is defined in reference to itself.]

    I hoped to be able to produce a query that would do the job without the while loop.

    Instead I finally realized that it was an impossible task, because a single query cannot know whether a duplicate (other than the first) is valid without knowing the max(rowid) of any "previous" valid queries. Doing things sequentially is unavoidable. Seth reached that conclusion much sooner than I did, but being stubborn, I didn't take his word for it 🙂

    Even if I had succeeded, it wouldn't mean that there was anything "wrong" about Anzio's solution. My only criteria for judging a solution are (1) Does it produce the correct result? (2) Does it run quickly and efficiently? and sometimes (3) Is it flexible for future use?

    My boss would always add (4) Was it delivered quickly? 😉

    So much for thinking I said something simply. Congratulations to all of y'all for working that one out. Maybe one day Dan will be able to tell us all what exactly is being measured and why this unusual scheme. In any event, it was fun.

    Best regards,

    Bob

    =================================================================

    ;with cte1 as -- identify breaks

    (select t1.RowID,t1.code

    from #temptable t1

    where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

    )

    ,cte1a as -- include first and last rows for ranging purposes

    (select t1.RowID, t1.code

    from #temptable t1

    where RowID = (select min(RowID) from #temptable t2)

    union all

    select * from cte1

    union all

    select max(RowID)+1, '~'

    from #temptable t1

    )

    ,cte2 as -- identify ranges

    (select c1.code,c1.RowID as ARange,isnull(min(c1a.RowID),max(c1.RowID)) as maxRow

    from cte1a c1

    left join cte1a c1a on c1a.code <> c1.code and c1a.RowID > c1.RowID

    where c1.code = 'A'

    group by c1.code,c1.RowID

    )

    ,cte3 as -- associate individual rows with their ranges

    (select ARange,t.*

    from #tempTable t

    join cte2 c on t.rowID between ARange and MaxRow

    where t.code = 'A'

    )

    ,cte4 as -- identify "raw" duplicates within ranges

    (select c1.*,min(c2.rowID) as preMin,max(c2.rowID) as preMax

    from cte3 c1

    join cte3 c2 on c2.ARange = c1.ARange

    and c2.rowKey = c1.rowKey

    and c2.locale = c1.locale

    and c2.code = c1.code

    and c2.rowID < c1.rowID

    group by c1.ARange, c1.rowID,c1.rowKey,c1.locale,c1.code

    )

    select * from cte4 c

    -- at this point you have to go to a loop

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • ;with cte1 as -- identify breaks

    (select t1.RowID,t1.code

    from #temptable t1

    where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

    )

    This is a set based solution because it does not matter in what order the rows appear in #temptable.

    There's that weird one again. In theory, that should be a recursive CTE, but I think the optimizer outsmarts us all on that one and turns it into a derived table first.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I gave up on the solution I was working on, as I realized that the UPDATE statements I was using were unable to take into account the update it had just made - in other words, it will update the first record, but that update doesn't exist as far as the rest of the record updates are concerned, and thus, my methodology had zero chance of working. Also, I'm still convinced that no one has actually provided a COMPLETE (even if entirely procedural) statement of the problem AND it's solution in such a way as to be COMPLETELY UNAMBIGUOUS. That latter part is almost always very difficult, regardless of the problem at hand, and this problem is particularly difficult to describe.

    Therefore, I'm requesting that Dan Segalles please describe the actual real-world problem this query is intended to solve. There might well be a completely different way to solve the problem, as opposed to the solution at hand, which seems rather significantly contrived.

    Steve

    (aka smunson)

    :):):)

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Seth:

    ;with cte1 as -- identify breaks

    (select t1.RowID,t1.code

    from #temptable t1

    where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

    )

    This is only going to be fast if RowID is the primary key. If #temptable were not indexed, it would crawl. I'm sending you a separate note about this.

    By the way, please tell me how to create a box containing text from a prior message.

    Bob

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The box is from the QUOTE tags. I responded to your message. I know it's fast, that's what's weird about it. It *should* be slow... but it's not.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Just to Play devils advocate...

    At the lowest level even "set based" queries are handled at a row by row level:D

    Think about updating running totals with the

    var=col=var +cnt

    syntax

    (goes and hides in nuclear bunker under mountain)

  • AnzioBake (10/30/2008)


    Just to Play devils advocate...

    At the lowest level even "set based" queries are handled at a row by row level:D

    Think about updating running totals with the

    var=col=var +cnt

    syntax

    (goes and hides in nuclear bunker under mountain)

    you're right to go hide (this may start an avalanche:))

    That being said:

    - it's probably more correct to say that the DBEngine tends to implement internal loops to process set-based operations. row by row, it identifies what needs to happen; however the actual updates (or the write-back) happens page by page or extent by extent.

    - the running total logic isn't technically set-based, since it requires controlling the processing order (one of the cardinal rules about current "set-based" processing is that it's based on SIMPLE (a.k.a. unordered) sets).

    - the part where it ACTS like a set-based operation is that the actual writes/commits occur in a single operation (so things are updated whole pages and/or extents at a time, and not so much row by row).

    And that's the big difference. The "single write/commit" part is where set-based updates really shine.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 76 through 90 (of 115 total)

You must be logged in to reply to this topic. Login to reply