Efficient way to count distinct

Question

Efficient way to count distinct

April 11, 2022 at 9:40 pm

This was removed by the editor as SPAM

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

ollyjolly SSC Eights! Points: 833 More actions · Answer 1

Jeff Moden - Thursday, July 5, 2018 8:48 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
Not a bad idea but folks need to be careful here. The table contains 60+ million rows, according to the OP. That a lot of stuff to have in TempDB (which is where the Op is operating from). If you simply add a clustered index, it will take twice as much TempDB space because the heap won't be dropped until the CI is rebuilt. It would be much better, if possible, to do the initial build and population of this table in a minimally logged fashion with the CI in place. If this causes explosive growth on TempDB, the DBAs may come looking for the OP with a bat. 😉
The real key here is why in the hell is anyone looking for such a multi-year report, who the hell is going to review or use the 2 million row result set, and why are they recalculating history that will not change every time they run it? If this is a recurring task and it's actually an important task, what they really need to do is accelerate the process for either warehousing the historical aggregates or accelerate the process of getting the Indexed View in place.

To answer why the report is being generated for multiple years is to perform a year over year comparison to identify changes. The result set will be exported as a text file and linked to a BI software. The historical data can change between runs as there are retro changes that occur. I agree with the need to get this into the hands of IT to create an indexed view, but am trying to create a solution in the interim.

Another thought I had was maybe using row_number partitioned by all 10 fields that I am trying to join to create a unique key that could be used to reduce the join from 10 fields to one field. Not sure if this will work but I am attempting it as I was not able to figure out the syntax for the recursive CTE..

ScottPletcher SSC Guru Points: 101126 More actions · Answer 2

Jeff Moden - Thursday, July 5, 2018 8:48 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
Not a bad idea but folks need to be careful here. The table contains 60+ million rows, according to the OP. That a lot of stuff to have in TempDB (which is where the Op is operating from). If you simply add a clustered index, it will take twice as much TempDB space because the heap won't be dropped until the CI is rebuilt. It would be much better, if possible, to do the initial build and population of this table in a minimally logged fashion with the CI in place. If this causes explosive growth on TempDB, the DBAs may come looking for the OP with a bat. 😉
The real key here is why in the hell is anyone looking for such a multi-year report, who the hell is going to review or use the 2 million row result set, and why are they recalculating history that will not change every time they run it? If this is a recurring task and it's actually an important task, what they really need to do is accelerate the process for either warehousing the historical aggregates or accelerate the process of getting the Indexed View in place.

No, not true. If you create the clus index before loading the table -- as I explicitly specified very clearly -- there is no heap to convert. It's still just a single load. The data has to be sorted, but it has to be sorted at some point anyway to get the desired grouping.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ollyjolly SSC Eights! Points: 833 More actions · Answer 3

ScottPletcher - Thursday, July 5, 2018 9:02 AM

ollyjolly - Thursday, July 5, 2018 8:56 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
/quote]
I do have the following index after the #Paid insert statement. Here is the index.
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.

But that's not how you're GROUPing them. You'll want the clus index to exactly match the grouping so that the query avoids a sort (another sort -- the rows will of course have to be sorted once when they are loaded, but that is true for any clus index).

Ok that makes sense. Change the order of the fields to this or the reserve order?
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (DiaCategory, ServCategory, , CType, Classification, Code, Groups, AType, Market, YrMnth, Yr,)

ScottPletcher SSC Guru Points: 101126 More actions · Answer 4

ollyjolly - Thursday, July 5, 2018 9:14 AM

ScottPletcher - Thursday, July 5, 2018 9:02 AM
ollyjolly - Thursday, July 5, 2018 8:56 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
/quote]
I do have the following index after the #Paid insert statement. Here is the index.
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.
But that's not how you're GROUPing them. You'll want the clus index to exactly match the grouping so that the query avoids a sort (another sort -- the rows will of course have to be sorted once when they are loaded, but that is true for any clus index).

Ok that makes sense. Change the order of the fields to this or the reserve order?
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (DiaCategory, ServCategory, , CType, Classification, Code, Groups, AType, Market, YrMnth, Yr,)

Exactly the same as the GROUPing order. I listed the order in my original comment.

And create the clus index before loading the table. Huge waste of resources to load it as a heap and then create a clus index.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

ollyjolly SSC Eights! Points: 833 More actions · Answer 5

ScottPletcher - Thursday, July 5, 2018 9:23 AM

ollyjolly - Thursday, July 5, 2018 9:14 AM
ScottPletcher - Thursday, July 5, 2018 9:02 AM
ollyjolly - Thursday, July 5, 2018 8:56 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
/quote]
I do have the following index after the #Paid insert statement. Here is the index.
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (Code, Classification, ServCategory, DiaCategory, Groups, YrMnth, Yr, Market, CType, AType)

The order of the fields is based on number of distinct values in each field. The Code column has 253 distinct values followed by the Classification column which has 90 distinct values.....all the way to the AType column which only contains 2 distinct values.
But that's not how you're GROUPing them. You'll want the clus index to exactly match the grouping so that the query avoids a sort (another sort -- the rows will of course have to be sorted once when they are loaded, but that is true for any clus index).
Ok that makes sense. Change the order of the fields to this or the reserve order?
CREATE CLUSTERED INDEX Indx_Paid ON #Paid (DiaCategory, ServCategory, , CType, Classification, Code, Groups, AType, Market, YrMnth, Yr,)

Exactly the same as the GROUPing order. I listed the order in my original comment.

And create the clus index before loading the table. Huge waste of resources to load it as a heap and then create a clus index.

I just was unsure if the order in the index should be left to right or right to left. Assuming the order should be from left to right I will reverse the order of the index I posted. I will also ensure the index is added before loading the table. Thank you for your insight.

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 6

ScottPletcher - Thursday, July 5, 2018 9:09 AM
Jeff Moden - Thursday, July 5, 2018 8:48 AM
ScottPletcher - Thursday, July 5, 2018 8:30 AM
Before loading the #Paid table, cluster it on all the GROUP BY columns:
Yr
,YrMnth
,Market
,AType
,Groups
,Code
,Classification
,CType
,ServCategory
,DiaCategory
That will avoid sorting when you run the main queries to get distinct values.
Not a bad idea but folks need to be careful here. The table contains 60+ million rows, according to the OP. That a lot of stuff to have in TempDB (which is where the Op is operating from). If you simply add a clustered index, it will take twice as much TempDB space because the heap won't be dropped until the CI is rebuilt. It would be much better, if possible, to do the initial build and population of this table in a minimally logged fashion with the CI in place. If this causes explosive growth on TempDB, the DBAs may come looking for the OP with a bat. 😉
The real key here is why in the hell is anyone looking for such a multi-year report, who the hell is going to review or use the 2 million row result set, and why are they recalculating history that will not change every time they run it? If this is a recurring task and it's actually an important task, what they really need to do is accelerate the process for either warehousing the historical aggregates or accelerate the process of getting the Indexed View in place.
No, not true. If you create the clus index before loading the table -- as I explicitly specified very clearly -- there is no heap to convert. It's still just a single load. The data has to be sorted, but it has to be sorted at some point anyway to get the desired grouping.

Ah... my apologies. I did miss the word BEFORE. Totally agree with you on the build-then-load philosophy here.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)