December 6, 2011 at 8:38 am
Ninja's_RGR'us (12/6/2011)
Because the table variable is so small, I figured that without the overhead of statistics or minimal disk I/O, I might gain some time. It's an interesting thought, but in very large databases (several billion rows per table) I always had best performance by reducing overhead and even shifting the overhead to smaller tables that I create on the fly. My perception was that it would work here, too, especially if I wanted the main table with 420 million rows to join with this small table. It was just an educated guess and certainly open to debate and improvement.
We're not talking about @t vs #t anymore. This is divide and conquer. A very different concept that has more than proven its worth.
I don't work with anything as large as 420M rows. But from all the tests I've done, stats have never proven to be bad or detrimental to performance. They either helped the query or cost only a few MS to create with no apparent gain. That being said I know they help protect from bad plans so that 1-2 ms imho is well worth it.
That being said I don't work on queries that run 1M times / day nor on 10 GB datasets. So my view might be skewed because of that.
Exactly my point. That is why such a large spread between execution times with the OPTIMIZE FOR hint and without it, was such a rare opportunity for me to see what the construct of the query's influence would be on the statistics. Or, to put it more directly, why would SS choose to loop 420 million times through the FileSummaries table without the hint. It made no sense. One option was to recalculate the statistics, but that would have destroyed the circumstances and opportunity to see what the impact of the query construct was in circumstances like that, without using the OPTIMIZE FOR hint. My conclusion was that the bad query didn't give the query optimizer sufficient unambiguous information about the query's intent and it relied on warped statistics. The good query must have been less ambiguous, then. That's open for interpretation.
Hence the case study.
Thanks.
December 6, 2011 at 9:03 am
for me to see what the construct of the query's influence would be on the statistics
Didn't you mean the reverse? The QP has absolutely nothing to do with changing / editing stats. The longest I can stretch this is to say that the QO will start a stats update but that's about it.
why would SS choose to loop 420 million times through the FileSummaries table without the hint
Because it calculated that this was the fastest / easiest way to access the data. Now anybody with 2 eyes can see this is clearly wrong in this case.
So what assumption / info is wrong in the QO's calculation? This is the place / POV from where I'd start digging.
The good query must have been less ambiguous, then
99.999999999999% sure it's not the case. As Grant pointed out you had a full optimization cycle. That means it did NOT timeout. In other words, the server was satisfied that he had found a good enough plan and stopped trying to find a better one. Again SS was clearly wrong there. Re-Again, what assumption made was wrong in that process?
about the query's intent and it relied on warped statistics.
Depends on your definition of intent. SS returned the correct results so the intent was clearly understood.
Warped statistic, weird distribution, stale stats all come back to the same problem. Bad estimates, wrong assumption and sub-optimal access path.
This is the first plan I see with 1 trillion rows (anywhere, actually). I'm starting to think this is a case of bad parameter sniffing, on super-mega steroids because of the amount of data in place (doesn't take long with 420M rows to go to 1 trillion, it's only 2400 to 1 bad estimate, which I see all the time with @t and complex-ish reports).
What your hint does is "ignore the stats you think are good" and use this specific distribution.
The intent which is not directly obvious in the query is that your join based on the dates (which is almost a triangular join written the way it is) is actually closer to an equi-join where only 0-1-2 rows can be returned. In the current writing of the query you have nothing explicitely helping the optimizer to guess at that (like top 1 with cross apply). What the use of the @t does is allow the server to use the minimum value as the estimate (1)... which helps you tremendously here.
Of course you could have hit a bug / edge case.
This has definitely hit my curiostity. Can you post the actual content of the table variable and then some sample data for the other table so we can create a local env?
This might be a case where create stats with stats_stream would be usefull assuming you can pull it out.
Viewing 2 posts - 46 through 46 (of 46 total)
You must be logged in to reply to this topic. Login to reply