February 23, 2016 at 2:40 am
Check that "auto create statistics" is enabled for the database.
Inner joins eliminate null values of the join columns:
inner join PaleoData.Tax.PodrobnostiTmp P2
on p1.Family_13000 = p2.Family_13000
you don't need to check that p1.Family_13000 and p2.Family_13000 are not null. Ordinarily, providing SQL Server with as much information as possible is a good thing, but there are exceptions.
Edit: also, check the result operator of the execution plan for warnings. Right-click on it to raise the property sheet to double-check - they don't always appear on the result operator as the little yellow triangle.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 23, 2016 at 3:18 am
Auto Create Statistics and Auto Update Statistics are both on, Auto Update Statistics Asynchronously is off. I tried creating an index for all the affected fields, naturally it was too big. So I made an index just for the two fields in the query posted above, and it made the response instant. Removed the query again to see if NOT specifying the Nulls in the WHERE clause would make a difference, and the response is still instant, even with the index removed. Going to have to slug on to another one that is dragging.
I also tried removing the Null specification in the WHERE clauses, but I get different results. What I'm trying for is a self-join of one instance that has A not Null and B not Null to all instances that have A not Null and B Null, in order to fill the Null Bs from the one not Null B.
Hope that makes some sense - not sure if I explained it properly. In any case, when I execute this:
select count(1)
from PaleoData.Tax.PodrobnostiTmp P1
inner join PaleoData.Tax.PodrobnostiTmp P2
on p1.Family_13000 = p2.Family_13000
where P1.Family_13000 is not null and p1.Suborder_11000 is not null
and p2.Family_13000 is not null and p2.Suborder_11000 is null I get a count of zero, and when I execute this:select count(1)
from PaleoData.Tax.PodrobnostiTmp P1
inner join PaleoData.Tax.PodrobnostiTmp P2
on p1.Family_13000 = p2.Family_13000
where p2.Family_13000 is not null and p2.Suborder_11000 is null I get a count of 8,292,172, from a table of 70K records.
February 23, 2016 at 3:26 am
Your two queries are logically different - sorry if I was ambiguous, what I meant was remove the NULL checks on the join columns only, like this:
SELECT COUNT(1)
FROM PaleoData.Tax.PodrobnostiTmp P1
INNER JOIN PaleoData.Tax.PodrobnostiTmp P2
ON p1.Family_13000 = p2.Family_13000
WHERE p1.Suborder_11000 IS NOT NULL
AND p2.Suborder_11000 IS NULL
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
February 23, 2016 at 4:36 am
ChrisM@Work (2/23/2016)
Your two queries are logically different - sorry if I was ambiguous, what I meant was remove the NULL checks on the join columns only, like this:
SELECT COUNT(1)
FROM PaleoData.Tax.PodrobnostiTmp P1
INNER JOIN PaleoData.Tax.PodrobnostiTmp P2
ON p1.Family_13000 = p2.Family_13000
WHERE p1.Suborder_11000 IS NOT NULL
AND p2.Suborder_11000 IS NULL
Yeah, sorry about that - I'm getting cross-eyed from messing with this for so long. When I use the correct WHERE clause, as you have here, the results are the same.
February 23, 2016 at 7:02 am
I wouldn't have expected the EXISTS to perform any better, since that's exactly what the IF (SELECT COUNT...)>0 gets converted to by the optimizer, but it's always good to be thorough 🙂
In fact, based on what you've said so far, it's probably exactly that optimization that is causing the problem.
When you run an EXISTS (or the IF (SELECT COUNT...)>0 construct that gets converted to an EXISTS), SQL Server implicitly introduces a row goal of 1 into the plan.
Row goals basically tell SQL Server to work out a plan that is optimal for returning that number of rows; you can introduce them explicitly using TOP N or the FAST N hint.
Normally, as you'd expect, a smaller row goal will mean a query plan that can execute more quickly, since SQL Server can modify the plan to be optimal for returning fewer rows than the entire result set.
However, a row goal means that SQL Server makes some assumptions about how many fewer rows come out of the other operators, and sometimes those estimates are very, very wrong. That leads to some unexpected behavior where a query with TOP N performs worse than the same query without TOP N (or a query wrapped in an EXISTS check performs worse than the same query returning its full result set).
There's an investigation of one way that can happen here, for example: http://blogs.msdn.com/b/bartd/archive/2012/03/14/row-goals-gone-rogue.aspx.
The quick workaround would be the one mentioned earlier, or variants on that theme: Pipe the output of the SELECT COUNT(*) query to a variable, and then check that variable for >0; basically anything that avoids the "clever" optimization introduced by SQL Server recognizing you're doing an existence check.
Long-term I'd prefer to figure out what about the plan with a row goal of 1 is so much worse; if it's just that the poor modified estimates from the row goal lead it to choose a different type of join, then you might be able to work around the problem with a join hint.
Even more preferable, as you and Chris have been working towards, is a modification of the query so that it performs acceptably within the EXISTS check.
Is it possible for you to post the actual plans for the stand-alone query and the one wrapped in IF(...)>0 (or the estimated plan for the latter if it takes too long to run)?
Cheers!
February 23, 2016 at 8:32 am
I wound up putting a single-field index on every column that is used in this circus, about twenty of them. That has made the performance acceptable. I'll put up some query plans later, right now I'm glad to have busted the logjam and am trying to get some work done with it.
February 23, 2016 at 11:28 am
Jacob, I think this article that you posted does, in fact, pinpoint the trouble: http://blogs.msdn.com/b/bartd/archive/2012/03/14/row-goals-gone-rogue.aspx
It matches what I recall, in that the most troublesome queries were those that returned zero matches.
Interesting reading there - always something new to learn.
February 23, 2016 at 1:25 pm
Here is one queryplan that takes forever and returns a count of 0.
February 23, 2016 at 2:34 pm
Ah, interesting. Estimating 1240 rows and getting an actual 25 million is certainly problematic 🙂
I must have misunderstood you earlier, though. I thought the stand-alone COUNT queries were performing well, and it was only when they were wrapped in an EXISTS check or the equivalent IF (SELECT COUNT...)>0 that they performed poorly. This one seems to perform poorly and have grossly incorrect estimates as a stand-alone COUNT query.
It drastically underestimates the number of rows in the initial seek for rows in PodrobnostiTmp for rows where SubPhylum_5000 is not null, and then also understimates the matches in the seek on IX_Podrobnosti_Class_7000 (although most of that difference is because it gets executed 6000 times instead of 1, because of that first underestimate).
This particular plan, given that it's not wrapped in an EXISTS or equivalent and that the root of most of the estimation error is from a seek for values that are not null, seems like it might be suffering from a more mundane issue with inaccurate statistics.
Cheers!
February 23, 2016 at 3:30 pm
I think this will identity whether or not a row exist and will probably run much more quickly, since no massive join is involved, although a huge sort will be needed.
You should pick the most common initial search column -- whether it be Class_7000 or another column -- and cluster the table on that column.
Also/instead, you should also create nonclustered indexes on common combinations of columns, in order of importance, such as, for the query below, ( Class_7000, SubPhylum_5000 ). With that in place, I strongly suspect you'd get an almost instance response to the query below:
IF EXISTS(
SELECT 1 AS row_count
FROM PaleoData.Tax.PodrobnostiTmp pt
WHERE pt.Class_7000 IS NOT NULL
GROUP BY pt.Class_7000
HAVING MAX(CASE WHEN p1.SubPhylum_5000 IS NOT NULL THEN 1 ELSE 0 END) = 1 AND
MAX(CASE WHEN p1.SubPhylum_5000 IS NULL THEN 1 ELSE 0 END) = 1
)
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 24, 2016 at 12:43 am
Jacob Wilkins (2/23/2016)
Ah, interesting. Estimating 1240 rows and getting an actual 25 million is certainly problematic 🙂I must have misunderstood you earlier, though. I thought the stand-alone COUNT queries were performing well, and it was only when they were wrapped in an EXISTS check or the equivalent IF (SELECT COUNT...)>0 that they performed poorly. This one seems to perform poorly and have grossly incorrect estimates as a stand-alone COUNT query.
It drastically underestimates the number of rows in the initial seek for rows in PodrobnostiTmp for rows where SubPhylum_5000 is not null, and then also understimates the matches in the seek on IX_Podrobnosti_Class_7000 (although most of that difference is because it gets executed 6000 times instead of 1, because of that first underestimate).
This particular plan, given that it's not wrapped in an EXISTS or equivalent and that the root of most of the estimation error is from a seek for values that are not null, seems like it might be suffering from a more mundane issue with inaccurate statistics.
Cheers!
No, you had it right. Since I've been continually messing with this, trying various suggestions and my own ideas, some changes have had an effect on overall performance. Initially, every query that caused problems did so only inside an EXISTS clause. When I highlighted and executed the inner COUNT query, performance was still not stellar, but acceptable (a second or two - not good enough for production, but adequate for my short-term needs). This one behaved poorly as an independent COUNT after I put indexes on the fields, when the overall performance had bumped up to something tolerable. No idea why. Attached is another plan, that runs quickly and returns zero, but still illustrates the wildly inaccurate estimates.
February 24, 2016 at 12:49 am
ScottPletcher (2/23/2016)
I think this will identity whether or not a row exist and will probably run much more quickly, since no massive join is involved, although a huge sort will be needed.You should pick the most common initial search column -- whether it be Class_7000 or another column -- and cluster the table on that column.
Also/instead, you should also create nonclustered indexes on common combinations of columns, in order of importance, such as, for the query below, ( Class_7000, SubPhylum_5000 ). With that in place, I strongly suspect you'd get an almost instance response to the query below:
IF EXISTS(
SELECT 1 AS row_count
FROM PaleoData.Tax.PodrobnostiTmp pt
WHERE pt.Class_7000 IS NOT NULL
GROUP BY pt.Class_7000
HAVING MAX(CASE WHEN p1.SubPhylum_5000 IS NOT NULL THEN 1 ELSE 0 END) = 1 AND
MAX(CASE WHEN p1.SubPhylum_5000 IS NULL THEN 1 ELSE 0 END) = 1
)
Thanks, I'll give that form a try. There is no column that would be appropriate for clustering as you suggest, since all columns are used at roughly the same rate. This is not a production table, with patterns of behavior, but a one-time conversion mechanism. All columns are populated, all columns are scanned, additional information is built from those scans, and the final result will be exported back to the production system, as soon as I get all the errors out of the data. At that point, all this will be discarded. My purpose here was not overall good performance for users, but a minimum acceptable level of performance for myself (minutes instead of hours) so that I could get the conversion completed.
February 24, 2016 at 9:45 am
pdanes (2/24/2016)
Jacob Wilkins (2/23/2016)
Ah, interesting. Estimating 1240 rows and getting an actual 25 million is certainly problematic 🙂I must have misunderstood you earlier, though. I thought the stand-alone COUNT queries were performing well, and it was only when they were wrapped in an EXISTS check or the equivalent IF (SELECT COUNT...)>0 that they performed poorly. This one seems to perform poorly and have grossly incorrect estimates as a stand-alone COUNT query.
It drastically underestimates the number of rows in the initial seek for rows in PodrobnostiTmp for rows where SubPhylum_5000 is not null, and then also understimates the matches in the seek on IX_Podrobnosti_Class_7000 (although most of that difference is because it gets executed 6000 times instead of 1, because of that first underestimate).
This particular plan, given that it's not wrapped in an EXISTS or equivalent and that the root of most of the estimation error is from a seek for values that are not null, seems like it might be suffering from a more mundane issue with inaccurate statistics.
Cheers!
No, you had it right. Since I've been continually messing with this, trying various suggestions and my own ideas, some changes have had an effect on overall performance. Initially, every query that caused problems did so only inside an EXISTS clause. When I highlighted and executed the inner COUNT query, performance was still not stellar, but acceptable (a second or two - not good enough for production, but adequate for my short-term needs). This one behaved poorly as an independent COUNT after I put indexes on the fields, when the overall performance had bumped up to something tolerable. No idea why. Attached is another plan, that runs quickly and returns zero, but still illustrates the wildly inaccurate estimates.
Thanks for the additional plan!
While the new plan has the same pattern of underestimates as the last one, it's on a wholly different scale. Dealing with an extra 50 thousand rows is bad, but it's not quite like dealing with an extra 25 million rows. Of course, this one also has the wrinkle of being converted to an EXISTS check, which affects how the plan works. Is this one also fast (relatively) when run as a stand-alone COUNT?
It's really looking more and more like the root of the problem lies largely with the statistics not correctly answering the question of how many non-NULL values are in those columns.
What's the output from the following command?
DBCC SHOW_STATISTICS ('PaleoData.Tax.PodrobnostiTmp','IX_Podrobnosti_SubPhylum_5000')
Switching the code to something more efficient will also help, of course. Scott's suggestion is one nice way of working towards that.
Another pattern that can perform better/worse/the same as that depending on available indexes and the optimizer's choices (helpful, eh?) is this:
IF EXISTS
(
SELECT Class_7000
FROM PaleoData.Tax.PodrobnostiTmp
WHERE Class_7000 IS NOT NULL
AND
SubPhylum_5000 IS NULL
INTERSECT
SELECT Class_7000
FROM PaleoData.Tax.PodrobnostiTmp
WHERE Class_7000 IS NOT NULL
AND
SubPhylum_5000 IS NOT NULL
)
I've found that the optimizer will sometimes choose to leverage indexes with this pattern that it won't with the aggregate pattern, in which case this can be a better choice. Outside of that specific situation, though, the aggregate approach should be faster since it only has to make one pass through the data.
Cheers!
February 24, 2016 at 12:08 pm
Here are the outputs from the SHOW_STATISTICS command:
IX_Podrobnosti_SubPhylum_5000Feb 23 2016 10:51PM7071513285100YESNULL70715
10SubPhylum_5000
NULL07071501
Sorry about the ugly output - if there is a tidy way to post table output, I don't know how to do it.
All the queries run reasonably quickly now. Not great, but tolerable. For what I'm doing here, a few seconds each instead of a few milliseconds each is something I can live with. Minutes each instead of seconds each would not be, since I have several hundred such queries in the procedure. But I have all the garbage out of the data now, and am kind of on the 'home stretch'. Another day or two and I should be done with the whole thing, so there is probably little point in working on further speed enhancements. The suggestions from everyone here have gotten me where I needed, and what I learned in the process will unquestionably be valuable in the future.
Many thanks to all.
February 24, 2016 at 12:35 pm
pdanes (2/24/2016)
Here are the outputs from the SHOW_STATISTICS command:
IX_Podrobnosti_SubPhylum_5000Feb 23 2016 10:51PM7071513285100YESNULL70715
10SubPhylum_5000
NULL07071501
Sorry about the ugly output - if there is a tidy way to post table output, I don't know how to do it.
All the queries run reasonably quickly now. Not great, but tolerable. For what I'm doing here, a few seconds each instead of a few milliseconds each is something I can live with. Minutes each instead of seconds each would not be, since I have several hundred such queries in the procedure. But I have all the garbage out of the data now, and am kind of on the 'home stretch'. Another day or two and I should be done with the whole thing, so there is probably little point in working on further speed enhancements. The suggestions from everyone here have gotten me where I needed, and what I learned in the process will unquestionably be valuable in the future.
Many thanks to all.
Thanks!
The easiest way is to use Results to Text instead of Results to Grid in SSMS. Then in your SSMS options, under "Query Results"->"SQL Server"->"Results to Text", set the Output Format to Column Aligned, and make sure headers are included. Then, so long as you have a fixed-width font set for Text Results, Results to Text output will format quite nicely in a "plain" code block here.
That output suggests the problem is indeed stats. The stats for that index indicate that every row has a NULL value for SubPhylum_5000. That's why it assumes the minimum for estimated rows coming out of that seek, 1; of course, the actual value is over 6000.
Since it looks like the stats were updated recently, I can only guess that either the stats update was sampled and missed the non-NULL values, or there were no non-NULL values in that column at the time of the stats update (e.g., maybe the table gets repopulated regularly with all NULLs in that column, and only later updated with some non-NULL values, but without enough data modifications to trigger a stats update).
Whenever the system can afford it, I'd try updating the stats with a full scan, and then trying some of the problematic queries with an OPTION (RECOMPILE) to force a new query plan (query plans don't always get recompiled after a statistics update).
At any rate, I'm glad you're seeing light at the end of the tunnel! 🙂
Cheers!
Viewing 15 posts - 16 through 30 (of 31 total)
You must be logged in to reply to this topic. Login to reply