May 30, 2021 at 9:18 pm
Jonathan AC Roberts wrote:Well that didn't demonstrate any performance improvement from using group by.
It explained where the performance improvement comes from.
If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.
I have tried that before and not found group by to be faster.
May 30, 2021 at 11:42 pm
The table does not have an index that I can use to speed up the join
This is why I previously posted what I did. There are sometimes "tricks" that can be done to speed things up with existing indexes but we don't have a prayer of helping because we don't know what indexes ARE available and we know nothing about the table, especially the datatypes of the columns involved nor even what the blasted PK of the table is. 😉
Of course, it could still end up with no resolution but, right now, the answer is a sure "NO" because we don't actually know a thing about the table except that it's fairly large.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 3, 2021 at 4:07 am
Sergiy wrote:Jonathan AC Roberts wrote:Well that didn't demonstrate any performance improvement from using group by.
It explained where the performance improvement comes from.
If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.
I have tried that before and not found group by to be faster.
Sure. And the earth is flat, as I've been told recently - with pretty much the same level of confidence.
Here is a simple script for you to execute:
CREATE TABLE #Test (N int)
CREATE INDEX TestN ON #Test(N)
INSERT INTO #Test
select N%10
FROM service.dbo.TallyGenerator (1, 10000000, null, default)
SET STATISTICS TIME ON
select distinct str(N, 9,2)
FROM #Test
SELECT str(N, 9,2)
FROM #Test
group by N
SET STATISTICS TIME OFF
DROP TABLE #Test
Compare CPU times for both of the queries.
On my machine DISTINCT costs from 8 to 10 times more than GROUP BY.
What about yours?
_____________
Code for TallyGenerator
June 3, 2021 at 9:57 am
Jonathan AC Roberts wrote:Sergiy wrote:Jonathan AC Roberts wrote:Well that didn't demonstrate any performance improvement from using group by.
It explained where the performance improvement comes from.
If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.
I have tried that before and not found group by to be faster.
Sure. And the earth is flat, as I've been told recently - with pretty much the same level of confidence.
Here is a simple script for you to execute:
CREATE TABLE #Test (N int)
CREATE INDEX TestN ON #Test(N)
INSERT INTO #Test
select N%10
FROM service.dbo.TallyGenerator (1, 10000000, null, default)
SET STATISTICS TIME ON
select distinct str(N, 9,2)
FROM #Test
SELECT str(N, 9,2)
FROM #Test
group by N
SET STATISTICS TIME OFF
DROP TABLE #TestCompare CPU times for both of the queries.
On my machine DISTINCT costs from 8 to 10 times more than GROUP BY.
What about yours?
Yes, since your previous reply I did a bit of research and found this article by Aaron Bertrand: https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
It wasn't 8 to 10 times faster but 6 times more CPU and about double the elapsed time.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 2 ms.
(10 rows affected)
SQL Server Execution Times:
CPU time = 561 ms, elapsed time = 166 ms.
(10 rows affected)
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 90 ms.
The solution you gave won't work as you have group by LEFT(p.plan) which isn't valid SQL:
group by p.acct, p.processdate,
m.product + m.productcode, LEFT(p.plan)
You should be grouping by this:
group by p.acct, p.processdate, m.product + m.productcode,
CASE WHEN LEFT(p.plan_id,1) = 'C' then 'cash'
WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
ELSE p.plan_id
END
I tried with an example that's more similar to the OP's SQL
select N%10 N,left(convert(varchar(36),NEWID()),3) x
INTO #Test
FROM service.dbo.TallyGenerator (1, 10000000, null, default)
CREATE INDEX TestN ON #Test(N)
SET STATISTICS TIME ON
GO
SELECT DISTINCT
N,
CASE WHEN LEFT(x,1) = 'C' then 'cash'
WHEN LEFT(x,1) = 'B' then 'BT'
ELSE x
END AS balancetype
INTO #x
FROM #Test
GO
SELECT N,
CASE WHEN LEFT(x,1) = 'C' then 'cash'
WHEN LEFT(x,1) = 'B' then 'BT'
ELSE x
END AS balancetype
INTO #y
FROM #Test
GROUP BY N,CASE WHEN LEFT(x,1) = 'C' then 'cash'
WHEN LEFT(x,1) = 'B' then 'BT'
ELSE x
END
GO
SET STATISTICS TIME OFF
go
DROP TABLE #Test
DROP TABLE #x
DROP TABLE #y
There is no difference in the execution time in this case.
June 3, 2021 at 12:14 pm
There is no difference in the execution time in this case.
In this case - yes.
Because GROUP BY is applied to the output of the SELECT - making it equal to DISTINCT.
But your claim was :
I have tried that before and not found group by to be faster.
without specifying it's for some particular cases only.
My code proved that at least in some cases your claim is wrong.
So, my call would be - always choose GROUP BY before DISTINCT.
In worst case GROUP BY will be equally fast (or equally slow) as DISTINCT, in all other cases it will be significantly faster.
Would you argue that?
_____________
Code for TallyGenerator
June 3, 2021 at 1:00 pm
Jonathan AC Roberts wrote:There is no difference in the execution time in this case.
In this case - yes. Because GROUP BY is applied to the output of the SELECT - making it equal to DISTINCT.
But your claim was :
I have tried that before and not found group by to be faster.
without specifying it's for some particular cases only.
My code proved that at least in some cases your claim is wrong.
So, my call would be - always choose GROUP BY before DISTINCT. In worst case GROUP BY will be equally fast (or equally slow) as DISTINCT, in all other cases it will be significantly faster.
Would you argue that?
Yes, it is a bit surprising (to me anyway) that group by is sometimes more efficient than distinct.
For group by to be more efficient than distinct you need a calculation on a column that would be performed many times on distinct and fewer on group by, for this to be true the calculation on the column would not have to be in the group by clause. For example, in the code you provided if you changed:
SELECT str(N, 9,2)
FROM #Test
group by N
to
SELECT str(N, 9,2)
FROM #Test
group by str(N, 9,2)
Then the performance is the same as:
SELECT DISTINCT str(N, 9,2)
FROM #Test
In most SQL statements the calculated columns would be put in the group by. So your code is longer and less self-documenting than just using distinct. So I wouldn't always choose GROUP BY before DISTINCT, but each to his own.
June 3, 2021 at 4:01 pm
For the much more straightforward approach:
SELECT DISTINCT N
vs
SELECT N
...
GROUP BY N
my machine shows a slight edge for DISTINCT. That's what I've seen in normal usage, too. For a straight list of columns, DISTINCT is usually better, but often only slightly.
For more complex expressions / lists, GROUP BY is often better.
There's no "one rule fits every query" here.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
June 5, 2021 at 2:26 am
That's what I've seen in normal usage, too
In my years working on SQL Server I extremely rarely saw a database where I could not improve performance for at least 10 times fold. Usually, the rate of improvement would by 50..100 times or more.
Would it be something we could name a "normal usage"?
Well, the statistics suggest it's pretty normal.
Should we stick with such "normality"? Well, I definitely don't want to follow that pattern.
If you're thinking in the paradigm of distinct values SELECTed by a query - your GROUP BY will not faster than DISTINCT.
But if you start thinking what records from what source you need to group together than things might become quite different. You might even think of aggregating records from a big table before joining them to another big table - I bet the machine(s) would really appreciate it. They might even spare your life after the uprising. Think about it. 🙂
_____________
Code for TallyGenerator
June 8, 2021 at 5:50 am
Came across an actual query which gave me an idea about another version of the test script for DISTINCT vs. GROUP BY.
create table #BalanceType (
id int identity(1,1) primary key nonclustered,
TypeName nvarchar(50) UNIQUE CLUSTERED )
insert into #BalanceType(TypeName)
select 'C' union select 'B' union select 'other' union select 'D' union select 'E'
select *, convert(binary(2), NEWID()), convert(int, convert(binary(2), NEWID())), convert(int, convert(binary(2), NEWID())) % 5 + 1
from #BalanceType BT
inner join service.dbo.TallyGenerator (1, 1000, null, default) T ON BT.id = N % 5 + 1
select N%10 N, N % 5 + 1 TypeID
INTO #Accountbalance
FROM service.dbo.TallyGenerator (1, 10000000, null, default) T
inner join #BalanceType BT on BT.id = N % 5 + 1
CREATE INDEX TestN ON #Accountbalance(N)
SET STATISTICS TIME ON
GO
SELECT DISTINCT
N, balancetype
FROM #Accountbalance AB
inner join (select id,
CASE WHEN LEFT(TypeName,1) = 'C' then 'cash'
WHEN LEFT(TypeName,1) = 'B' then 'BT'
ELSE TypeName
END AS balancetype
from #BalanceType) BT ON BT.id = AB.TypeID
GO
SELECT N, balancetype
FROM #Accountbalance AB
inner join (select id,
CASE WHEN LEFT(TypeName,1) = 'C' then 'cash'
WHEN LEFT(TypeName,1) = 'B' then 'BT'
ELSE TypeName
END AS balancetype
from #BalanceType) BT ON BT.id = AB.TypeID
GROUP BY N, balancetype
GO
SET STATISTICS TIME OFF
go
DROP TABLE #Accountbalance
DROP table #BalanceType
GROUP BY is times faster on my laptop.
_____________
Code for TallyGenerator
June 8, 2021 at 7:55 am
If you do not know the composition of the (clustered) indexes of the huge table, how can you suggest any significant performance improvement? It will do a full table scan anyway if you are not using at least the first part(s) of one of the (clustered) indexes in your search conditions.
Gregory Liénard
Microsoft Data Engineer
https://seopageoptimizer.com/: analyses billions of webpages
https://keyboost.com/: analyses trillions of links
June 8, 2021 at 9:22 am
Came across an actual query which gave me an idea about another version of the test script for DISTINCT vs. GROUP BY.
I tried it and they were about the same:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(10 rows affected)
SQL Server Execution Times:
CPU time = 750 ms, elapsed time = 244 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(10 rows affected)
SQL Server Execution Times:
CPU time = 812 ms, elapsed time = 219 ms.
With the same execution plan:
June 8, 2021 at 2:37 pm
Came across an actual query which gave me an idea about another version of the test script for DISTINCT vs. GROUP BY.
create table #BalanceType (
id int identity(1,1) primary key nonclustered,
TypeName nvarchar(50) UNIQUE CLUSTERED )
insert into #BalanceType(TypeName)
select 'C' union select 'B' union select 'other' union select 'D' union select 'E'
select *, convert(binary(2), NEWID()), convert(int, convert(binary(2), NEWID())), convert(int, convert(binary(2), NEWID())) % 5 + 1
from #BalanceType BT
inner join service.dbo.TallyGenerator (1, 1000, null, default) T ON BT.id = N % 5 + 1
select N%10 N, N % 5 + 1 TypeID
INTO #Accountbalance
FROM service.dbo.TallyGenerator (1, 10000000, null, default) T
inner join #BalanceType BT on BT.id = N % 5 + 1
CREATE INDEX TestN ON #Accountbalance(N)
SET STATISTICS TIME ON
GO
SELECT DISTINCT
N, balancetype
FROM #Accountbalance AB
inner join (select id,
CASE WHEN LEFT(TypeName,1) = 'C' then 'cash'
WHEN LEFT(TypeName,1) = 'B' then 'BT'
ELSE TypeName
END AS balancetype
from #BalanceType) BT ON BT.id = AB.TypeID
GO
SELECT N, balancetype
FROM #Accountbalance AB
inner join (select id,
CASE WHEN LEFT(TypeName,1) = 'C' then 'cash'
WHEN LEFT(TypeName,1) = 'B' then 'BT'
ELSE TypeName
END AS balancetype
from #BalanceType) BT ON BT.id = AB.TypeID
GROUP BY N, balancetype
GO
SET STATISTICS TIME OFF
go
DROP TABLE #Accountbalance
DROP table #BalanceTypeGROUP BY is times faster on my laptop.
It's not SQL's fault if you write sloppy code. It's extremely inefficient/wasteful to create a nonclus index on #Accountbalance, which SQL can't use anyway. Therefore, I changed it to clus. After that, I get identical plans with much more efficient MERGE joins than the joins in the original plans.
Even if you don't want to / can't correct the indexes, you should add a DISTINCT to let SQL know it can remove dup rows earlier:
SELECT N, balancetype
FROM #Accountbalance AB
inner join (select distinct id, --<<-- add DISTINCT here yourself
CASE WHEN LEFT(TypeName,1) = 'C' then 'cash'
WHEN LEFT(TypeName,1) = 'B' then 'BT'
ELSE TypeName
END AS balancetype
from #BalanceType) BT ON BT.id = AB.TypeID
GROUP BY N, balancetype
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 12 posts - 16 through 26 (of 26 total)
You must be logged in to reply to this topic. Login to reply