October 29, 2008 at 6:31 am
AnzioBake (10/29/2008)
I tested the Solutions against the Test dat set that I created. The Table Contained 1 Million records with sequences ranging from 1-35 rows.My Solution returned 670000+ Rows in the result set
Thanks Anzio... do you have code to generate that nice test set? Sure would appreciate it.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 29, 2008 at 6:38 am
Here you go.....I am sure I do not have to explain the logic and how I biased the results to get duplicates.....
Set NoCount on
Declare @Cnt int
,@RowKey varchar(1000)
,@Locale varchar(10)
, @Code char(1)
Set @Cnt = 1
While @Cnt <= 1000000 Begin
Insert into SSCData( RowKey, Locale, Code)
Select
RowKey =
Case
When RowKeyId > 0 and rowKeyID <=10 Then 1
When RowKeyId > 10 and rowKeyID <=25 Then 1
When RowKeyId > 25 and rowKeyID <=45 Then 1
When RowKeyId > 45 and rowKeyID <=60 Then 1
When RowKeyId > 60 and rowKeyID <=70 Then 1
End
, Locale =
Case
When LocaleID > 0 And LocaleID <=15 Then 'en'
When LocaleID > 15 And LocaleID <= 24 Then 'de'
When LocaleID > 24 And LocaleID <= 30 Then 'zh'
End
, Code =
Case
When CodeId > 0 And CodeID <= 7 Then 'A'
When CodeID > 7 And CodeID <= 10 Then 'C'
End
From(
Select
RowKeyID = Ceiling( Rand() *70 )
, LocaleID = Ceiling( Rand() *30 )
, CodeId = Ceiling( Rand() *10 )
) as a
If @Cnt % 10000 = 0
Print Convert( Varchar(9), @Cnt) +' inserted'
Select @Cnt = @Cnt +1
End
October 29, 2008 at 6:41 am
I seem to have problems loading code with create table scripts
CREATE TABLE SSCData
(
RowID int identity(1,1 ) PRIMARY KEY CLUSTERED,
RowKey varchar(1000),
Locale varchar(10),
Code char(1)
)
In the event you are wondering why the loop:
I have found that when doing set based query with a Rand(), SQL applies the same value to all rows, obviously not what I want.....therefore the row based approach here
October 29, 2008 at 9:07 am
Garadin (10/29/2008)
Although, the only thing I'm also concerned about is the data that has concentrated duplicates like the example below which would affect performance more I guess.
Due to the nature of the way my solution works (haven't looked at Anzio or Steve's well enough to tell if theirs are the same), these should not affect the processing time. Did you get a chance to test the performance of any of these methods last night?
Not yet Seth, due to the problem on my laptop. Probably I'll try again later when I get home if my laptop is working. I'll keep you guys posted.
October 29, 2008 at 7:29 pm
Not that it matters at this point, but I have to recant my earlier belief that a set-based solution to this problem was possible. Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes, but there it stops.
Stated simply, the set of valid duplicates (V) contains all those duplicates which do not span any row belong to another valid duplicate (V). Since set (V) is defined in reference to itself, any solution will have to iteratively expand the set.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 30, 2008 at 1:38 am
Bhovious
I am not entirely sure what you are trying to say.
By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)
My solution employs a loop, but the query in side the loop is still set based.......?????
So how do you define my solution?
(Just wanting to understnad your comment)
October 30, 2008 at 6:31 am
Not that you asked me... but 😀
IMO, *just about* any solution involving a loop(or a correlated subquery or a few other things) is not set based. Now, yours might be *closer* to set based than say... mine, which doesn't even attempt to resemble a set based solution and just lays smack dab on top of a cursor; but being *almost* set based isn't set based. Performing set based operations on single rows is still RBAR. That's not to say that mine will perform better or vise versa. We'll only know that when Dan runs his tests. Speaking of which... Dan, any updates on that topic?
October 30, 2008 at 7:39 am
AnzioBake (10/30/2008)
By my definition, Iteration does NOT imply "non-set based" or RBAR (as Jeff would call it)My solution employs a loop, but the query in side the loop is still set based.......?????
So how do you define my solution?
You are correct that iteration does not imply "non set-based" logic because, in fact, all SQL statements are loops.
However, when you EXPLICITLY define a loop that is not set-based.
Set-based solutions are IMPLICIT loops.
October 30, 2008 at 7:45 am
Guys, first let me make clear that I was pursuing a set-based solution purely out of curiosity for mental stimulation (like some people do crosswords or Sudoku puzzles). I conceded up front that even if such a solution was possible, the performance might be questionable.
With that out of the way, let me address Anzio's question. As I understand it, a true set based solution should be be answerable by a single query (which could contain ctes and subqueries), but which derives it's answers at each step based on common characteristics of the set. It doesn't depend on sequencing or use of explicitly coded loops (including recursive ctes). For example, in this problem, I started by finding the set of break points with the following cte:
;with cte1 as -- identify breaks
(select t1.RowID,t1.code
from #temptable t1
where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)
)
This is a set based solution because it does not matter in what order the rows appear in #temptable. Having RowID as the primary key might make it run faster, but it doesn't affect the result at all. (Yes I know that where temptable has only code "A"s, this produces no rows. It's just the starting point and a simple example.)
In Anzio's solution, the first two steps ("Find All the individual sequences" and "Identify all Duplicates in sequences") are set based single queries. [This is what I meant by "Set-based coding can produce the set of all duplicates (valid or invalid) within each range of 'A' codes."] I reached the same point in my CTE solution at the bottom.
But then comes "Construct The Result Set" ... which involves a While loop, which is procedural in nature. ["but there it stops."] At this point, we have to "iterate" through this loop, row by row, gradually inserting more and more rows into the result set. Furthermore we have to use the result set in subsequent loops to determine the next duplicate (Inner join #result as r). [This is what I meant by saying the set of valid duplicates (V) is defined in reference to itself.]
I hoped to be able to produce a query that would do the job without the while loop.
Instead I finally realized that it was an impossible task, because a single query cannot know whether a duplicate (other than the first) is valid without knowing the max(rowid) of any "previous" valid queries. Doing things sequentially is unavoidable. Seth reached that conclusion much sooner than I did, but being stubborn, I didn't take his word for it 🙂
Even if I had succeeded, it wouldn't mean that there was anything "wrong" about Anzio's solution. My only criteria for judging a solution are (1) Does it produce the correct result? (2) Does it run quickly and efficiently? and sometimes (3) Is it flexible for future use?
My boss would always add (4) Was it delivered quickly? 😉
So much for thinking I said something simply. Congratulations to all of y'all for working that one out. Maybe one day Dan will be able to tell us all what exactly is being measured and why this unusual scheme. In any event, it was fun.
Best regards,
Bob
=================================================================
;with cte1 as -- identify breaks
(select t1.RowID,t1.code
from #temptable t1
where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)
)
,cte1a as -- include first and last rows for ranging purposes
(select t1.RowID, t1.code
from #temptable t1
where RowID = (select min(RowID) from #temptable t2)
union all
select * from cte1
union all
select max(RowID)+1, '~'
from #temptable t1
)
,cte2 as -- identify ranges
(select c1.code,c1.RowID as ARange,isnull(min(c1a.RowID),max(c1.RowID)) as maxRow
from cte1a c1
left join cte1a c1a on c1a.code <> c1.code and c1a.RowID > c1.RowID
where c1.code = 'A'
group by c1.code,c1.RowID
)
,cte3 as -- associate individual rows with their ranges
(select ARange,t.*
from #tempTable t
join cte2 c on t.rowID between ARange and MaxRow
where t.code = 'A'
)
,cte4 as -- identify "raw" duplicates within ranges
(select c1.*,min(c2.rowID) as preMin,max(c2.rowID) as preMax
from cte3 c1
join cte3 c2 on c2.ARange = c1.ARange
and c2.rowKey = c1.rowKey
and c2.locale = c1.locale
and c2.code = c1.code
and c2.rowID < c1.rowID
group by c1.ARange, c1.rowID,c1.rowKey,c1.locale,c1.code
)
select * from cte4 c
-- at this point you have to go to a loop
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 30, 2008 at 8:02 am
;with cte1 as -- identify breaks
(select t1.RowID,t1.code
from #temptable t1
where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)
)
This is a set based solution because it does not matter in what order the rows appear in #temptable.
There's that weird one again. In theory, that should be a recursive CTE, but I think the optimizer outsmarts us all on that one and turns it into a derived table first.
October 30, 2008 at 8:16 am
I gave up on the solution I was working on, as I realized that the UPDATE statements I was using were unable to take into account the update it had just made - in other words, it will update the first record, but that update doesn't exist as far as the rest of the record updates are concerned, and thus, my methodology had zero chance of working. Also, I'm still convinced that no one has actually provided a COMPLETE (even if entirely procedural) statement of the problem AND it's solution in such a way as to be COMPLETELY UNAMBIGUOUS. That latter part is almost always very difficult, regardless of the problem at hand, and this problem is particularly difficult to describe.
Therefore, I'm requesting that Dan Segalles please describe the actual real-world problem this query is intended to solve. There might well be a completely different way to solve the problem, as opposed to the solution at hand, which seems rather significantly contrived.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
October 30, 2008 at 9:32 am
Seth:
;with cte1 as -- identify breaks
(select t1.RowID,t1.code
from #temptable t1
where t1.code <> (select top 1 code from #temptable where RowID < t1.rowID order by rowID desc)
)
This is only going to be fast if RowID is the primary key. If #temptable were not indexed, it would crawl. I'm sending you a separate note about this.
By the way, please tell me how to create a box containing text from a prior message.
Bob
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 30, 2008 at 9:58 am
The box is from the QUOTE tags. I responded to your message. I know it's fast, that's what's weird about it. It *should* be slow... but it's not.
October 30, 2008 at 10:38 am
Just to Play devils advocate...
At the lowest level even "set based" queries are handled at a row by row level:D
Think about updating running totals with the
var=col=var +cnt
syntax
(goes and hides in nuclear bunker under mountain)
October 30, 2008 at 10:57 am
AnzioBake (10/30/2008)
Just to Play devils advocate...At the lowest level even "set based" queries are handled at a row by row level:D
Think about updating running totals with the
var=col=var +cnt
syntax
(goes and hides in nuclear bunker under mountain)
you're right to go hide (this may start an avalanche:))
That being said:
- it's probably more correct to say that the DBEngine tends to implement internal loops to process set-based operations. row by row, it identifies what needs to happen; however the actual updates (or the write-back) happens page by page or extent by extent.
- the running total logic isn't technically set-based, since it requires controlling the processing order (one of the cardinal rules about current "set-based" processing is that it's based on SIMPLE (a.k.a. unordered) sets).
- the part where it ACTS like a set-based operation is that the actual writes/commits occur in a single operation (so things are updated whole pages and/or extents at a time, and not so much row by row).
And that's the big difference. The "single write/commit" part is where set-based updates really shine.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 76 through 90 (of 115 total)
You must be logged in to reply to this topic. Login to reply