December 7, 2016 at 4:35 pm
I have a large aggregation query on a large dataset that run forever and never return, but if I run the main part of the query by itself it returns in 5 minutes. I can explain it better with an example:
SELECT
*
FROM
(
SELECT
key,
SUM(value)
FROM
some_table
GROUP BY
key
) s
If I run just the code inside the virtual table, it runs in 5 min, but as soon as I simply nest it and select * from it, then it run indefinitely. Of course I need to do more than select *. The internal query does a bunch of aggregations, then the external query does additional calculations with the values.
I did however try literally just changing it to * to troubleshoot and I still have the same performance problem.
Any ideas what could be causing this?
December 7, 2016 at 7:01 pm
You must have the derived table joined to something else.
It may cause all sorts of issues which I don't see the reason to list.
If you post the whole query and DDL for the objects involved the guess work would by much less.
Alternatively, you may create a temp table, populate with subquery data and join it to the rest of the big query.
Don't forget to index it appropriately.
_____________
Code for TallyGenerator
December 7, 2016 at 8:09 pm
Look at the estimated query plans for each and see if the differences tell a story.
Also run sp_whoisactive during the execution and see various things you can see with that.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 8, 2016 at 6:16 am
Actually I did have it joined to a different table at first, but I removed that thinking that the join was the problem.
When I do an estimated execution plan, with and without the additional table join, there are no additional recommended indexes.
I've seen this type of behavior before and never understood it. I think last time I ended up using the temp table method you suggested, but never understood why it was needed. If nothing else I guess I'll try that again, but this is a result set with hundreds of columns and millions of rows, so seems silly to put it into a table just to pull it back out again, but we'll see.
For a result set this large, would a temp table or a table variable be better?
December 8, 2016 at 6:35 am
rgp151 (12/7/2016)
I have a large aggregation query on a large dataset that run forever and never return, but if I run the main part of the query by itself it returns in 5 minutes. I can explain it better with an example:
SELECT
*
FROM
(
SELECT
key,
SUM(value)
FROM
some_table
GROUP BY
key
) s
If I run just the code inside the virtual table, it runs in 5 min, but as soon as I simply nest it and select * from it, then it run indefinitely. Of course I need to do more than select *. The internal query does a bunch of aggregations, then the external query does additional calculations with the values.
I did however try literally just changing it to * to troubleshoot and I still have the same performance problem.
Any ideas what could be causing this?
Can you post (as a .sqlplan attachment) the estimated plan for this, please -
Cheers
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 8, 2016 at 9:19 am
Here is a plan where its just the query encapsulated in SELECT * FROM ()
Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.
Thanks
Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.
December 8, 2016 at 9:38 am
rgp151 (12/8/2016)
Here is a plan where its just the query encapsulated in SELECT * FROM ()Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.
Thanks
Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.
Thanks.
First thought - comment out the ten lines which look like this:
COUNT(DISTINCT CASE WHEN
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 8, 2016 at 9:41 am
I'm sorry, but what you have posted up goes WAY beyond free support on a forum, at least for me. That is a metric butt-ton of ugly operations that could take many hours of effort to decipher and improve.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
December 8, 2016 at 9:47 am
Yeah, I know its a lot. My main question is just what would cause performance to degrade so much just going from:
SELECT
x,
y,
z
FROM
t
to
SELECT
*
FROM
(
SELECT
x,
y,
z
FROM
t
) t
It should essentially be the same query with the same results either way. I don't know why simply nesting the query as a virtual table degrades the performance so much.
December 8, 2016 at 9:50 am
rgp151 (12/8/2016)
Yeah, I know its a lot. My main question is just what would cause performance to degrade so much just going from:SELECT
x,
y,
z
FROM
t
to
SELECT
*
FROM
(
SELECT
x,
y,
z
FROM
t
) t
It should essentially be the same query with the same results either way. I don't know why simply nesting the query as a virtual table degrades the performance so much.
You might be hitting a tipping point. The easiest to diagnose is optimiser timeout.
Did you try my suggestion? Just comment out the ten lines and rerun the simple query. Harvest the actual plan to post here. I have a cunning plan <<cough>>
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
December 8, 2016 at 10:05 am
The description doesn't begin to describe the real problem.
Without at least having the full query and actual execution plans, there's no real advice that can be given here. I agree that the COUNT DISTINCT might be the ones giving problems, but those are present on both plans.
Have you tried using temp tables?
December 8, 2016 at 12:55 pm
Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.
I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.
December 8, 2016 at 1:46 pm
rgp151 (12/8/2016)
Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.
It may be frustrating, but often it's best to materialize something that otherwise can "get lost in the plan", so to speak. I can't begin to tell you how often I've had CTEs that pull a rather small number of rows end up choking the rest of the query until I put the results into a temp table instead and join to that temp table instead. It just seems sometimes that there are some queries that the optimizer simply can not fathom a good row-count guess for, until it's "written in stone", so to speak.
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
December 8, 2016 at 2:05 pm
rgp151 (12/8/2016)
Yeah, I can't figure this out. But it is the case that if I simply insert the result of the nested query into a temp table first, then select from the temp table, the whole thing completes in a few minutes.I don't get it. It doesn't make any sense and I wish I knew what was going on, but for now, at least putting into a temp table works.
That's usually because the query becomes too complex to create an optimal plan. It then chokes and cries for help. Are you sure that you can't simplify the query?
December 9, 2016 at 7:45 am
rgp151 (12/8/2016)
Here is a plan where its just the query encapsulated in SELECT * FROM ()Remember, if I run just the part inside the outer nesting it runs in under 5 minutes. I just left it running last night as described above and it ran for 23 hours without finishing.
Thanks
Edit: I also included an execution plan without the encapsulation. This is when the query runs in <5 minutes. I can't see a difference between the two.
"Showplan Comparison" indicates differences in the result operator alone (the SELECT operator). I suspect that one of these plans might not be the plan which was executed when you performed your timings. Also, does the non-encapsulated query have OPTION (RECOMPILE)?
In any case, with so many copies of a 5.5 million row table littering the plan, it's not going to win any races.
The reason I suggested commenting out the COUNT(DISTINCT... parts is that the remaining columns in the SELECT could probably be evaluated with a couple of simple aggregations and a single scan of the table - although you might have to perform a preaggregation to provoke the optimiser into generating such a plan. I'd tackle the COUNT(DISTINCT... parts as a separate exercise then merge the results, but make sure you really, REALLY need the DISTINCT, because it's hurting.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply