April 8, 2014 at 12:09 am
Comments posted to this topic are about the item Divide and Conquer - Performance Tuning
April 8, 2014 at 5:43 am
I wonder if checksum could be an alternative to a two-step process suggested by the article?
Something to the effect:
; WITH CTE
as (
SELECT worker_id,
CHECKSUM_AGG(CHECKSUM(rate_code)) as chk
FROM #worker_rate
GROUP BY worker_id
)
SELECT worker_id,
DENSE_RANK() OVER (ORDER BY chk) as SNO
FROM CTE
April 8, 2014 at 6:53 am
That would indeed be much faster and simpler if it wasn't for the possibility of duplicate checksums.
This link http://ask.sqlservercentral.com/questions/7253/limitations-w-using-checksum-for-comparing-data-wh.html mentions this very problem and the only way around it is to use additional processing which adds to the complexity.
April 8, 2014 at 7:33 am
Thank you for the answer, Claire.
I came to a similar conclusion after a bit of googling, but you were faster.
April 8, 2014 at 8:49 am
A little off topic...However....
How does one get SQLPrep or is it even available for public consumption?
I do not see a link at http://www.csqls.com/
Thanks
April 8, 2014 at 9:48 am
It's not available to buy yet but if you go to the website and click the contact us link then ask to be an alpha tester you might get to try it early.
April 8, 2014 at 7:52 pm
Claire,
Interesting article. I have indeed found cases where divide and conquer can be helpful to solve a really tricky query form. But I'm afraid I'm going to need to point out a few fallacies in your arguments.
1. First off, you're comparing apples to oranges when you try to compare the timing results from creating a delimited character string of the rates to an encoded integer value. The string concatenation is always going to be slower.
2. Your method of creating the delimited list of rate codes can be improved upon. The reason it was spending so much time in those nasty sorts was (probably) because of using DISTINCT.
SELECT grp=COUNT(*), worker_id, combo=STUFF(
(
SELECT ',' + rate_code
FROM #worker_rate b
WHERE a.worker_id = b.worker_id
ORDER BY rate_code
FOR XML PATH('')
), 1, 1, '')
FROM #worker_rate a
GROUP BY worker_id
ORDER BY 1, 2;
3. You can literally do the same thing as your encoding approach in a single set-based query.
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
)
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN lu b ON a.rate_code = b.rate_code
GROUP BY worker_id
ORDER BY 1, 2;
The timing results I got for this (at 10M rows) were:
Method ElapsedMS
Original concat of rate codes as chars 136593
Improved concat of rate codes as chars 106220
'Optimal' Example of Divide and Conquer 29016
Divide and Conquer but in one step 24006
Note that all this was pre-coffee so possibly I made an error somewhere but you can check me yourself. Here's the test harness I used.
create table #worker_rate (
worker_id int
,rate_code nvarchar(10) collate database_default
primary key (rate_code,worker_id)
)
declare @rows int = 10000000 --How many rows of data to create
--Create test data
;with tt(x) as (
select row_number() over (order by (select null)) from
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)
)
insert into #worker_rate (worker_id, rate_code)
select top(@rows) tt.x, rc.code
from tt
inner join (values ('a'),('b'),('c')) rc(code)
on tt.x % 4 <> 0
or rc.code <> 'c'
DECLARE @StartDT DATETIME = GETDATE();
--Original example
;with c as (
select distinct w.worker_id, r.combo
from #worker_rate w
cross apply (
select combo = stuff((select ',' + r.rate_code
from #worker_rate r
where w.worker_id = r.worker_id
order by r.rate_code
for xml path('')),1,1,'')
) r
)
select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo
from c
order by 1, 2;
SELECT [Method]='Original concat of rate codes as chars'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
SELECT grp=COUNT(*), worker_id, combo=STUFF(
(
SELECT ',' + rate_code
FROM #worker_rate b
WHERE a.worker_id = b.worker_id
ORDER BY rate_code
FOR XML PATH('')
), 1, 1, '')
FROM #worker_rate a
GROUP BY worker_id
ORDER BY 1, 2;
SELECT [Method]='Improved concat of rate codes as chars'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
--Optimal example
create table #lookup (
rate_code nvarchar(10) collate database_default primary key
,combo smallint
)
insert into #lookup (rate_code,combo)
select q.rate_code, power(2, q.combo-1)
from (
select distinct rate_code, combo=dense_rank() over (order by rate_code)
from #worker_rate
) q
select grp=dense_rank() over (order by combo), worker_id, combo
from (
select p.worker_id, combo=sum(l.combo)
from #worker_rate p
inner join #lookup l on l.rate_code = p.rate_code
group by p.worker_id
) q
order by 1, 2
SELECT [Method]='''Optimal'' Example of Divide and Conquer'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
)
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN lu b ON a.rate_code = b.rate_code
GROUP BY worker_id
ORDER BY 1, 2;
SELECT [Method]='Divide and Conquer but in one step'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
GO
drop table #lookup
drop table #worker_rate
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
April 9, 2014 at 4:42 am
Since there's very few rate codes present, I thought of a way you can have your cake (return delimited strings) and eat it too (get relatively good performance).
1. The all in one query.
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
),
UNIQUEnTuples (n, Tuples, ID) AS (
SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code
FROM lu
UNION ALL
SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code
FROM UNIQUEnTuples n
CROSS APPLY (
SELECT combo, rate_code
FROM lu t
WHERE t.rate_code < n.ID) t
)
SELECT grp, worker_id, combo=Tuples
FROM
(
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN lu b ON a.rate_code = b.rate_code
GROUP BY worker_id
) a
JOIN UNIQUEnTuples b ON a.combo = b.n
ORDER BY 1, 2;
2. Divide and conquer (split above into 3 steps)
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
)
SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))
INTO #lookup2
FROM lu;
WITH UNIQUEnTuples (n, Tuples, ID) AS
(
SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code
FROM #lookup2
UNION ALL
SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code
FROM UNIQUEnTuples n
CROSS APPLY (
SELECT combo, rate_code
FROM #lookup2 t
WHERE t.rate_code < n.ID) t
)
MERGE #lookup2 t
USING UNIQUEnTuples s
ON t.combo = s.n
WHEN MATCHED THEN
UPDATE SET combo2 = s.Tuples
WHEN NOT MATCHED THEN
INSERT (rate_code, combo, combo2)
VALUES (s.Tuples, s.n, s.Tuples);
SELECT grp, worker_id, combo=combo2
FROM
(
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN #lookup2 b ON a.rate_code = b.rate_code
GROUP BY worker_id
) a
JOIN #lookup2 b ON a.combo = b.combo
ORDER BY 1, 2;
Timings (just my 4 queries at 10M rows):
Method ElapsedMS
Improved concat of rate codes as chars 105503
Divide and Conquer but in one step 24816
Use nTuples to create strings 46926
Divide and Conquer 30913
And the test harness:
create table #worker_rate (
worker_id int
,rate_code nvarchar(10) collate database_default
primary key (rate_code,worker_id)
)
declare @rows int = 10000000 --How many rows of data to create
--Create test data
;with tt(x) as (
select row_number() over (order by (select null)) from
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x100000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x1000000 (n),
(values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) x10000000 (n)
)
insert into #worker_rate (worker_id, rate_code)
select top(@rows) tt.x, rc.code
from tt
inner join (values ('a'),('b'),('c')) rc(code)
on tt.x % 4 <> 0
or rc.code <> 'c'
DECLARE @StartDT DATETIME = GETDATE();
--Original example
--;with c as (
-- select distinct w.worker_id, r.combo
-- from #worker_rate w
-- cross apply (
-- select combo = stuff((select ',' + r.rate_code
-- from #worker_rate r
-- where w.worker_id = r.worker_id
-- order by r.rate_code
-- for xml path('')),1,1,'')
-- ) r
--)
--select grp = dense_rank() over (order by c.combo), c.worker_id, c.combo
--from c
--order by 1, 2;
--SELECT [Method]='Original concat of rate codes as chars'
-- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
--SELECT @StartDT = GETDATE();
SELECT grp=COUNT(*), worker_id, combo=STUFF(
(
SELECT ',' + rate_code
FROM #worker_rate b
WHERE a.worker_id = b.worker_id
ORDER BY rate_code
FOR XML PATH('')
), 1, 1, '')
FROM #worker_rate a
GROUP BY worker_id
ORDER BY 1, 2;
SELECT [Method]='Improved concat of rate codes as chars'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
--Optimal example
--create table #lookup (
-- rate_code nvarchar(10) collate database_default primary key
-- ,combo smallint
--)
--insert into #lookup (rate_code,combo)
--select q.rate_code, power(2, q.combo-1)
--from (
-- select distinct rate_code, combo=dense_rank() over (order by rate_code)
-- from #worker_rate
--) q
--select grp=dense_rank() over (order by combo), worker_id, combo
--from (
-- select p.worker_id, combo=sum(l.combo)
-- from #worker_rate p
-- inner join #lookup l on l.rate_code = p.rate_code
-- group by p.worker_id
--) q
--order by 1, 2
--SELECT [Method]='''Optimal'' Example of Divide and Conquer'
-- ,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
--SELECT @StartDT = GETDATE();
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
)
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN lu b ON a.rate_code = b.rate_code
GROUP BY worker_id
ORDER BY 1, 2;
SELECT [Method]='Divide and Conquer but in one step'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
),
UNIQUEnTuples (n, Tuples, ID) AS (
SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code
FROM lu
UNION ALL
SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code
FROM UNIQUEnTuples n
CROSS APPLY (
SELECT combo, rate_code
FROM lu t
WHERE t.rate_code < n.ID) t
)
SELECT grp, worker_id, combo=Tuples
FROM
(
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN lu b ON a.rate_code = b.rate_code
GROUP BY worker_id
) a
JOIN UNIQUEnTuples b ON a.combo = b.n
ORDER BY 1, 2;
SELECT [Method]='Use nTuples to create strings'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
WITH lu AS
(
SELECT rate_code
,combo=POWER(2, ROW_NUMBER() OVER (ORDER BY rate_code)-1)
FROM
(
SELECT rate_code
FROM #worker_rate
GROUP BY rate_code
) a
)
SELECT rate_code, combo, combo2=CAST(NULL AS VARCHAR(8000))
INTO #lookup2
FROM lu;
WITH UNIQUEnTuples (n, Tuples, ID) AS
(
SELECT combo, CAST(rate_code AS VARCHAR(8000)), rate_code
FROM #lookup2
UNION ALL
SELECT n.n+t.combo, CAST(t.rate_code AS VARCHAR(8000)) + ',' + n.Tuples, rate_code
FROM UNIQUEnTuples n
CROSS APPLY (
SELECT combo, rate_code
FROM #lookup2 t
WHERE t.rate_code < n.ID) t
)
MERGE #lookup2 t
USING UNIQUEnTuples s
ON t.combo = s.n
WHEN MATCHED THEN
UPDATE SET combo2 = s.Tuples
WHEN NOT MATCHED THEN
INSERT (rate_code, combo, combo2)
VALUES (s.Tuples, s.n, s.Tuples);
SELECT grp, worker_id, combo=combo2
FROM
(
SELECT grp=COUNT(*), worker_id, combo=SUM(combo)
FROM #worker_rate a
JOIN #lookup2 b ON a.rate_code = b.rate_code
GROUP BY worker_id
) a
JOIN #lookup2 b ON a.combo = b.combo
ORDER BY 1, 2;
SELECT [Method]='Divide and Conquer'
,[ElapsedMS]=DATEDIFF(millisecond, @StartDT, GETDATE());
SELECT @StartDT = GETDATE();
GO
--drop table #lookup;
drop table #lookup2;
drop table #worker_rate;
Note that neither of these new queries will do well if you have scores of rate codes.
Links to explain UNIQUEnTuples[/url] and how I made it faster.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
April 11, 2014 at 3:41 pm
Sorry for the late reply but it's been a very busy week for me. Finally got time to look at these posts and it's always great to see people thinking about how to improve solutions as that's where you get a broader spectrum of knowledge from.
In response to Post #1559754.
1. I would prefer apples, pineapples or a tin of tropical fruit and I know which one I prefer landing on my head powered by gravity alone 😉
2. You spotted one of the further optimisations I left out as the article was getting quite large but it's not the distinct that was causing the issue, it was the rank function on duplicate rows and by moving that to the outer query, as you showed, you gain a little more performance. Well spotted.
The lookup table query changes to the following to get this performance increase.
insert into #lookup (rate_code,combo)
select q.rate_code, power(2, row_number() over (order by q.rate_code) - 1)
from (select distinct rate_code from #worker_rate) q
3. While it's useful to be able to provide the results in a single query you lose that important index so that when the number of rate codes increases the performance drops as you pointed out in your second post. I just wish MS would put in a feature to post-attach indexes to temporary worktables doing away with the need to split the query.
So starting with 1 million rows your single query approach on my server (different server this time so we'll compare both on the same server) with the same three rate codes came out at
1,842ms CPU and 1,177ms elapsed time
The article example in its original form took
2,356ms CPU and 1,175ms elapsed time (after adding the two stages)
The article example with the enhanced lookup table code took
2,013ms CPU and 1,195ms elapsed time (after adding the two stages)
Increasing the number of rate codes (which will reduce the number of worker ids as we are still only taking the top 1 million rows) your single query takes
966ms CPU and 320ms elapsed time
The enhanced article dual stage query takes
889ms CPU and 397ms elapsed time
So great spot with the row_number instead of the dense_rank function but not quite enough performance boost to warrant switching to the single query in my opinion. The result output was changed to be non-contiguous also so if that was important it causes a problem. You are returning the count of the rates for each worker instead of the unique group showing that each worker in that group has the same rate codes.
grp which was 1,2,3 is changed to 3,13,14 and two workers with different rate codes but the same number of them will have the same group code.
It's always very important to check the output of any optimisation and luckily this utility does that for me so is quick to catch as you can see in the attached image if it gets attached.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply