Need help on performance and integrity

Question

Need help on performance and integrity

Viewing 15 posts - 76 through 90 (of 115 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 1

AnzioBake (10/29/2008)
I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.
My Solution returned 670000+ Rows in the result set

Thanks Anzio... do you have code to generate that nice test set? Sure would appreciate it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 2

Here you go.....I am sure I do not have to explain the logic and how I biased the results to get duplicates.....

Set NoCount on

Declare @Cnt int

,@RowKey varchar(1000)

,@Locale varchar(10)

, @Code char(1)

Set @Cnt = 1

While @Cnt <= 1000000 Begin

Insert into SSCData( RowKey, Locale, Code)

Select

RowKey =

Case

When RowKeyId > 0 and rowKeyID <=10 Then 1

When RowKeyId > 10 and rowKeyID <=25 Then 1

When RowKeyId > 25 and rowKeyID <=45 Then 1

When RowKeyId > 45 and rowKeyID <=60 Then 1

When RowKeyId > 60 and rowKeyID <=70 Then 1

End

, Locale =

Case

When LocaleID > 0 And LocaleID <=15 Then 'en'

When LocaleID > 15 And LocaleID <= 24 Then 'de'

When LocaleID > 24 And LocaleID <= 30 Then 'zh'

End

, Code =

Case

When CodeId > 0 And CodeID <= 7 Then 'A'

When CodeID > 7 And CodeID <= 10 Then 'C'

End

From(

Select

RowKeyID = Ceiling( Rand() *70 )

, LocaleID = Ceiling( Rand() *30 )

, CodeId = Ceiling( Rand() *10 )

) as a

If @Cnt % 10000 = 0

Print Convert( Varchar(9), @Cnt) +' inserted'

Select @Cnt = @Cnt +1

End

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 3

I seem to have problems loading code with create table scripts

CREATE TABLE SSCData

(

RowID int identity(1,1 ) PRIMARY KEY CLUSTERED,

RowKey varchar(1000),

Locale varchar(10),

Code char(1)

)

In the event you are wondering why the loop:

I have found that when doing set based query with a Rand(), SQL applies the same value to all rows, obviously not what I want.....therefore the row based approach here

Dan Segalles SSC-Addicted Points: 412 More actions · Answer 4

Garadin (10/29/2008)
Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.
Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?

Not yet Seth, due to the problem on my laptop. Probably I'll try again later when I get home if my laptop is working. I'll keep you guys posted.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 5

Not that it matters at this point, but I have to recant my earlier belief that a set-based solution to this problem was possible. Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes, but there it stops.

Stated simply, the set of valid duplicates (V) contains all those duplicates which do not span any row belong to another valid duplicate (V). Since set (V) is defined in reference to itself, any solution will have to iteratively expand the set.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 6

Bhovious

I am not entirely sure what you are trying to say.

By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)

My solution employs a loop, but the query in side the loop is still set based.......?????

So how do you define my solution?

(Just wanting to understnad your comment)

Garadin One Orange Chip Points: 29613 More actions · Answer 7

Not that you asked me... but 😀

IMO, *just about* any solution involving a loop(or a correlated subquery or a few other things) is not set based. Now, yours might be *closer* to set based than say... mine, which doesn't even attempt to resemble a set based solution and just lays smack dab on top of a cursor; but being *almost* set based isn't set based. Performing set based operations on single rows is still RBAR. That's not to say that mine will perform better or vise versa. We'll only know that when Dan runs his tests. Speaking of which... Dan, any updates on that topic?

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Goldie Lesser SSCertifiable Points: 6155 More actions · Answer 8

AnzioBake (10/30/2008)
By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)
My solution employs a loop, but the query in side the loop is still set based.......?????
So how do you define my solution?

You are correct that iteration does not imply "non set-based" logic because, in fact, all SQL statements are loops.

However, when you EXPLICITLY define a loop that is not set-based.

Set-based solutions are IMPLICIT loops.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

Guys, first let me make clear that I was pursuing a set-based solution purely out of curiosity for mental stimulation (like some people do crosswords or Sudoku puzzles). I conceded up front that even if such a solution was possible, the performance might be questionable.

With that out of the way, let me address Anzio's question. As I understand it, a true set based solution should be be answerable by a single query (which could contain ctes and subqueries), but which derives it's answers at each step based on common characteristics of the set. It doesn't depend on sequencing or use of explicitly coded loops (including recursive ctes). For example, in this problem, I started by finding the set of break points with the following cte:

;with cte1 as -- identify breaks

(select t1.RowID,t1.code

from #temptable t1

where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

)

This is a set based solution because it does not matter in what order the rows appear in #temptable. Having RowID as the primary key might make it run faster, but it doesn't affect the result at all. (Yes I know that where temptable has only code "A"s, this produces no rows. It's just the starting point and a simple example.)

In Anzio's solution, the first two steps ("Find All the individual sequences" and "Identify all Duplicates in sequences") are set based single queries. [This is what I meant by "Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes."] I reached the same point in my CTE solution at the bottom.

But then comes "Construct The Result Set" ... which involves a While loop, which is procedural in nature. ["but there it stops."] At this point, we have to "iterate" through this loop, row by row, gradually inserting more and more rows into the result set. Furthermore we have to use the result set in subsequent loops to determine the next duplicate (Inner join #result as r). [This is what I meant by saying the set of valid duplicates (V) is defined in reference to itself.]

I hoped to be able to produce a query that would do the job without the while loop.

Instead I finally realized that it was an impossible task, because a single query cannot know whether a duplicate (other than the first) is valid without knowing the max(rowid) of any "previous" valid queries. Doing things sequentially is unavoidable. Seth reached that conclusion much sooner than I did, but being stubborn, I didn't take his word for it 🙂

Even if I had succeeded, it wouldn't mean that there was anything "wrong" about Anzio's solution. My only criteria for judging a solution are (1) Does it produce the correct result? (2) Does it run quickly and efficiently? and sometimes (3) Is it flexible for future use?

My boss would always add (4) Was it delivered quickly? 😉

So much for thinking I said something simply. Congratulations to all of y'all for working that one out. Maybe one day Dan will be able to tell us all what exactly is being measured and why this unusual scheme. In any event, it was fun.

Best regards,

Bob

=================================================================

;with cte1 as -- identify breaks

(select t1.RowID,t1.code

from #temptable t1

where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

)

,cte1a as -- include first and last rows for ranging purposes

(select t1.RowID, t1.code

from #temptable t1

where RowID = (select min(RowID) from #temptable t2)

union all

select * from cte1

union all

select max(RowID)+1, '~'

from #temptable t1

)

,cte2 as -- identify ranges

(select c1.code,c1.RowID as ARange,isnull(min(c1a.RowID),max(c1.RowID)) as maxRow

from cte1a c1

left join cte1a c1a on c1a.code <> c1.code and c1a.RowID > c1.RowID

where c1.code = 'A'

group by c1.code,c1.RowID

)

,cte3 as -- associate individual rows with their ranges

(select ARange,t.*

from #tempTable t

join cte2 c on t.rowID between ARange and MaxRow

where t.code = 'A'

)

,cte4 as -- identify "raw" duplicates within ranges

(select c1.*,min(c2.rowID) as preMin,max(c2.rowID) as preMax

from cte3 c1

join cte3 c2 on c2.ARange = c1.ARange

and c2.rowKey = c1.rowKey

and c2.locale = c1.locale

and c2.code = c1.code

and c2.rowID < c1.rowID

group by c1.ARange, c1.rowID,c1.rowKey,c1.locale,c1.code

)

select * from cte4 c

-- at this point you have to go to a loop

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Garadin One Orange Chip Points: 29613 More actions · Answer 10

;with cte1 as -- identify breaks
(select t1.RowID,t1.code
from #temptable t1
where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)
)
This is a set based solution because it does not matter in what order the rows appear in #temptable.

There's that weird one again. In theory, that should be a recursive CTE, but I think the optimizer outsmarts us all on that one and turns it into a derived table first.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

sgmunson SSC Guru Points: 110640 More actions · Answer 11

I gave up on the solution I was working on, as I realized that the UPDATE statements I was using were unable to take into account the update it had just made - in other words, it will update the first record, but that update doesn't exist as far as the rest of the record updates are concerned, and thus, my methodology had zero chance of working. Also, I'm still convinced that no one has actually provided a COMPLETE (even if entirely procedural) statement of the problem AND it's solution in such a way as to be COMPLETELY UNAMBIGUOUS. That latter part is almost always very difficult, regardless of the problem at hand, and this problem is particularly difficult to describe.

Therefore, I'm requesting that Dan Segalles please describe the actual real-world problem this query is intended to solve. There might well be a completely different way to solve the problem, as opposed to the solution at hand, which seems rather significantly contrived.

Steve

(aka smunson)

:):):)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 12

Seth:

;with cte1 as -- identify breaks

(select t1.RowID,t1.code

from #temptable t1

where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)

)

This is only going to be fast if RowID is the primary key. If #temptable were not indexed, it would crawl. I'm sending you a separate note about this.

By the way, please tell me how to create a box containing text from a prior message.

Bob

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Garadin One Orange Chip Points: 29613 More actions · Answer 13

The box is from the QUOTE tags. I responded to your message. I know it's fast, that's what's weird about it. It *should* be slow... but it's not.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

AnzioBake SSCarpal Tunnel Points: 4134 More actions · Answer 14

Just to Play devils advocate...

At the lowest level even "set based" queries are handled at a row by row level:D

Think about updating running totals with the

var=col=var +cnt

syntax

(goes and hides in nuclear bunker under mountain)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 15

AnzioBake (10/30/2008)
Just to Play devils advocate...
At the lowest level even "set based" queries are handled at a row by row level:D
Think about updating running totals with the
var=col=var +cnt
syntax
(goes and hides in nuclear bunker under mountain)

you're right to go hide (this may start an avalanche:))

That being said:

- it's probably more correct to say that the DBEngine tends to implement internal loops to process set-based operations. row by row, it identifies what needs to happen; however the actual updates (or the write-back) happens page by page or extent by extent.

- the running total logic isn't technically set-based, since it requires controlling the processing order (one of the cardinal rules about current "set-based" processing is that it's based on SIMPLE (a.k.a. unordered) sets).

- the part where it ACTS like a set-based operation is that the actual writes/commits occur in a single operation (so things are updated whole pages and/or extents at a time, and not so much row by row).

And that's the big difference. The "single write/commit" part is where set-based updates really shine.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?