October 22, 2007 at 5:43 pm
Sergiy (10/22/2007)
Do you know the difference between table scan and clustered index scan?
Sure do (you really don't want to go here, but feel free to continue). Do you know the difference between two and one?
October 22, 2007 at 5:44 pm
David McFarland (10/22/2007)
Sergiy (10/22/2007)
Do you know the difference between table scan and clustered index scan?Sure do (you really don't want to go here, but feel free to continue). Do you know the difference between two and one?
No.
I want to know you version of the difference between table scan and clustered index scan.
Very interesting.
_____________
Code for TallyGenerator
October 22, 2007 at 5:48 pm
Sergiy (10/22/2007)
CustomerID unlikely to be used for ordering records in any report.It's some internal ID nobody cares about.
In real life you need to sort by CustomerLastName (which is in another table) or CustomerFullName (which is computed on fly), or by aggregated amounts.
So, your OVER() "advantage" is quite useless for real life tasks.
Not to mention it's just occasional.
I'll happily demonstrate that Over handles this just fine, but from what I've seen from your other responses, it would be a waste of time to do so. Tell you what, if I prove you wrong (and I'll happily let Jeff be the judge, with all parties agreeing to the test conditions), will you openly apologize to every single poster on this board you've ever denigrated? I wonder what the maximum post size is on the new boards, and hope it can handle that.
October 22, 2007 at 6:14 pm
Sergiy (10/22/2007)
No.I want to know you version of the difference between table scan and clustered index scan.
Very interesting.
What would you like to know? If you would like some help on the internals, I'd highly recommend any of Kalen Delaney's books on the subject. They're quite good, and I've read each of them, including the 2005 version, several times. Want to borrow mine? The clustered index scan is basically the same thing, as the leaf level nodes are simply the data. In most cases (documentation out there says all, but I've seen exceptions, due to storage device strategies), the Clustered Index Scan will be ever so slightly faster than a Table scan. For our tests, it added about a third of a second to each of our queries. This is a silly sidetrack, as it has nothing to do with anything we are testing. I'm not the one who added the clustered index in the first place, and the results for OVER will perform the same as the INNER JOIN whether we have a Clustered Index Scan, or remove it and do a table scan. Talk to Jeff if you don't like the clustered index (although I'll happily prove you wrong here as well, if you want). I can only assume that when you've lost face, you look for anything at all to go "Ha!". Sorry, but this isn't a good one.
The real issue is that you claim that Over is slower than your method. It's not. You claimed that it was because the Over would create two derived tables, one for each aggregate line. It doesn't. In fact, I can add Over lines all over the place, and it's still going to sing. Is this the point where you start asking bizarre questions about the midnight event, or are you ready to acknowledge that you were mistaken?
Yeah, I know, I laughed as well.
October 22, 2007 at 6:32 pm
Matt... all good points and thanks for the experiments you did... Man, I gotta take the time to install the Dev version of 2k5... then I can run my own tests and play with indexes and the like.
Funny thing, though... to me (I could be wrong), if you want shear speed without much thought about tuning, it looks like the Inner Join is the winner here.
Know of any caveats I should be aware of during a 2k5 installation?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 22, 2007 at 6:53 pm
Jeff Moden (10/22/2007)Funny thing, though... to me (I could be wrong), if you want shear speed without much thought about tuning, it looks like the Inner Join is the winner here.
Based on what evidence?
I get the same exact times between the two methods when no thought goes into tuning. I get the same exact times when configured like a typical table would be in a production environment. I'm not seeing the evidence that Inner Join wins anything, and as the query gets added to, it's going to remain a tie, or actually beat out the Inner Join method.
Come up with as complicated of a real world scenario as you want, with aggregates all over the place. I'll happily test it.
What's the purpose of publishing a test, comments on it, execution plans demonstrating that Sergiy was wrong in the behavior of Over, the results, etc. if they're ignored?
Seriously, I have always considered you a fair arbiter, simply based on posting history, and had never once questioned where you'd come down when faced with actual evidence. Is that actually the case?
October 22, 2007 at 7:12 pm
David McFarland (10/22/2007)
The clustered index scan is basically the same thing, as the leaf level nodes are simply the data.
Then what's is "not-basical" difference you aware of?
This is a silly sidetrack, as it has nothing to do with anything we are testing.
I'm not the one who added the clustered index in the first place
You are the one who pointed that it was Clustered Index Scan, not Table Scan as I said.
So, what does Clustered Index Scan actually scan?
The real issue is that you claim that Over is slower than your method. It's not.
Sure?
What about this?
David McFarland (10/19/2007)
Sergiy
Probably 2k5 is more comfortable for procedural language programmers, but for those who understand relational model and use to operate with datasets it's useless.While I'm on board with the "set based over CLR whenever possible" crowd, I have to take exception to this. If you want to claim that anything set-based that can be done in 2005 can be accomplished with 2000, you'll get only a minor argument from me, but that does't mean it's useless. There is nothing I can do in C++ that I can't do in Assembler, but that doesn't make C++ useless.
For instance, you'll take Over() back out of SQL Server over my cold dead body.
What is a single advantage of OVER() for someone who knows how to write T-SQL statements?
So far you're trying to prove that in some conditions in some specific cases OVER() is not that worse than pure T-SQL solution.
Where are those advantages you are so excited about?
I can add Over lines all over the place, and it's still going to sing.
Wanna try?
_____________
Code for TallyGenerator
October 22, 2007 at 7:59 pm
Sergiy (10/22/2007)
No, I pointed out that it was one, 1, uno, einz, a single clustered index scan. Without the Clustered Index, it would have been one, 1, uno, einz, a single table scan. You claimed differently.
Sure?
What about this?
David McFarland (10/19/2007)
Sergiy
Probably 2k5 is more comfortable for procedural language programmers, but for those who understand relational model and use to operate with datasets it's useless.While I'm on board with the "set based over CLR whenever possible" crowd, I have to take exception to this. If you want to claim that anything set-based that can be done in 2005 can be accomplished with 2000, you'll get only a minor argument from me, but that does't mean it's useless. There is nothing I can do in C++ that I can't do in Assembler, but that doesn't make C++ useless.
For instance, you'll take Over() back out of SQL Server over my cold dead body.
What does that have to do with anything? I see no where in that text where I claim that Over() will be slower than other methods.
What is a single advantage of OVER() for someone who knows how to write T-SQL statements?
I already stated that upthread. Feel free to actually read the thread if you'd like to know.
So far you're trying to prove that in some conditions in some specific cases OVER() is not that worse than pure T-SQL solution.
Not at all, see above.
October 22, 2007 at 8:21 pm
Jeff - no major caveats. Just for S & G's, a few weeks ago I did a "full" sloppy install (zone alarm and avg still running, upgrade right over Express, install everything - take defaults), even after VS2005 installed...and it still ran with no issues. It's really more of a disk swap exercise, and visiting windows update.
This may be a factor of it being the upgrade of Express - but it has a tendency to turn on "auto-closing" the DB's, so just be careful with that one.
=========================================
David stole my thunder a bit , but I think he's on to something. The bottom line is - we were comparing apples and oranges. Some times it helps to look at something in its most basic form - when you decompose the OVER PARTITION to its most basic form - it's coming up with an execution plan like the standard running sum query (double-assignments, increment, etc...), with the multiple nested loop syntaxes, etc... Meaning - it's not a grouped sum, it's a group/partition WITH A SUM.
With that in mind - you wouldn't use a running sum syntax to perform this task since it doesn't use anything "running". It's merely percentages of total group. We'd have to come up with another test to compare the two.
I'll now duck out of the line of fire before more the hostilities start up again:)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 22, 2007 at 9:00 pm
Thanks, Matt...
I'd be doing a full install on a box that already has the Developers Edition of 2k on it and I'd be installing the Developer's edition of 2k5... do I have to worry about it trying to automatically upgrade or change that 2k installation at all? So far as Windows goes... it's XP Pro sp2 and auto-update for Windows is always on.
Heh... so far as the basics you mentioned go... that's precisely why I took out the ORDER BY's... the tests are not geared to producing a million row report (who'd read it? :w00t: )... they're geared towards processing a million rows as fast as you can π
Cool how the way the hostilities have gone way up... I even have someone questioning my integrity, now... :hehe:
If you're bailing off this thread (and, I wouldn't blame you if you did), lemme just say that I always get a real kick working these problems with you... you even read my mind about testing... it's always a pleasure.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 22, 2007 at 9:09 pm
oh I'll keep an eye on it - and I'll jump on when I can. I just got something dumped in my lap I now have to evaluate and possibly rebuild, so we'll see how much time is left for playing....
Always fun to play with new toys:)
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 22, 2007 at 9:09 pm
David McFarland (10/22/2007)
Seriously, I have always considered you a fair arbiter, simply based on posting history, and had never once questioned where you'd come down when faced with actual evidence. Is that actually the case?
Ya know... for someone who bitched as much as you did about personal attacks, you're really pretty good at it. :Whistling: Remember the post to my friend Greg? You could have said nothing... instead, it was you who attacked. And there's a strong bit of innuendo in your statement above... are YOU questioning my integrity? Or just insinuating? And all because I happen to not be arbitrating for your position? Sheesh!
Let's get back to the code because that's all that really matters...
You've brought a good point to the table, David... the OVER(PARTITION) does appear to do an inherent sort (I can't verify, I don't have 2k5, so I take your word for it)... and that's great for reports of all kinds...
...but... who's going to create a report with a million rows? The purpose of the testing I'm interested in is not how fast you can create a million row report... the purpose is how fast each method can process a million rows of data.
I agree... if you don't want to even think about tuning, go ahead and use OVER(PARTITION). If you intend to build reports, go ahead, use OVER(PARTITION). If you want to process millions of rows of data about twice as fast, you just might want to consider exercising a little "old" T-SQL by using the Inner Join method with the covering index.
You don't believe that covering indexes are a representation of what a real production system would have. We're not arguing the fact that you don't use covering indexes in production, but, contrary to your beliefs, many of us do use such indexes in production when we want to double the performance of mega row code... it's part of "tuning" and you know that.
Now, instead of questioning my personal integrity for doing tests that don't happen to meet your particular needs, embrace the spirit of the tests... Show me the code, the indexes, or any combination of methods where you can process a million rows of data for the current given problem (which I believe you originally posted, by the way) using OVER(PARTITION) that can beat Serqiy's Inner Join method in the presence of the tuning/coving index that's been listed. THEN we can test the best way to do (gasp) million row reportsβ¦
--Jeff Moden
Change is inevitable... Change for the better is not.
October 23, 2007 at 8:13 am
You know... I was just wondering....
Have you guys ever stopped and considered exactly what you're testing?
Or how the tests are done? (which coincides with the 'what' btw)
No, I haven't read the entire 14 pages of this thread.. perhaps half.
Are you aware that you are all right in some things, and you're also all wrong about some things..
Mostly it's about comparing apples with oranges. Noone can win π
Seems all is left now is arguments for arguments sake?
Would be a shame, there's much more to explore.
/Kenneth
October 24, 2007 at 7:59 pm
Heh... dunno about anyone else, but as stated in my last... I'm just testing for duration comparisons of simple parallel methods, Ken... what are you going on about?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 25, 2007 at 2:06 am
Ah, just small details that tends to get lost in the overall excitement...
Like first, when performance was measured by just execting something and as I assumed, also returning the results. This is just a competition in who has the fastest box to render the result.
Though that was indeed noted, so it was changed to inserting into a temptable instead.
However, then it became a competiotin about who has the fastest diskdrives.. π
Ofc, one could say that when running the 'competitors' back to back, then the relative results between could serve as an indicator of which is 'best', but still...
Personally, I don't think that timings is that good at all when deciding what is most efficient anyway.
No harm doing it, but there's so much more to what makes a query 'good' or 'bad'.
IMO, the plans generated are more useful to understand how the optimizer resolves the query, and also to find out what potentionals there may be to improve upon it. Both by syntax, and also what happens with different index strategies.
I've only seen one post with actual plans. (remember, haven't scanned the entire thread).
There they was identical, and indeed, they were. However, there's a lot of 'tweaking' one can do, like remove the ORDER BY. Sorts are always very costly in terms of performance, so we like to avoid them if we can. Now, that was only unnecessary in the example that selected into temptable. Assuming that was a requirement, ORDER BY won't guarantee how rows are inserted anyway. π
OTOH, if you're in the situation that you do need to have absolute control over a 'generated' sequence and store that in a table, then Sergiy's 'absolutely useless' OVER() clause is one of the very few tools that would allow you to do just that.
And no, 'INSERT myTmpTblWithIdentity SELECT foo FROM bar ORDER BY foo will *NOT* guarantee that rows will be inserted according to the ORDER BY.
Though, the 'old' way that is supported (if I understood that correctly) is if you do a construct like:
SELECT (inline identity col), col1, col2... INTO tmpTbl FROM otherTable ORDER BY col1
So, Sergiy was wrong about saying OVER() is totally useless π
Though if we go back to the examples that was competing, remove the ORDER BY from both, then the inner join version has a very much more efficient plan. (and shorter time)
But IF the ORDER BY is deemed necessary for the task given, then it's a tie between the two.
Also as noted, though not yet followed up on(?) is what different index strategies would to to the statements in terms of changing the plans. From what I found, the OVER() version has most to gain from playing around with where the clustered index is, or if we have a covered index, and exactly how that would look... Try it, and note how sort and join operators in the plans change or go away.
Yes, that's also part of a tuning process, as you mentioned earlier, Jeff.
Anyhow, bottom line...
Timings isn't everything. Don't forget the plans. Imo plans are better tools than timings, because timings are very much dependant on hardware. (ie the same plan may have different times depending on what else goes on in the system at that particular time)
When feeling too excited, get a bucket of water, apply to head, and chill out a bit π
Then come back and be constructive (none mentoined, none forgotten)
just my .02, though....
/Kenneth
Viewing 15 posts - 121 through 135 (of 172 total)
You must be logged in to reply to this topic. Login to reply