August 14, 2015 at 7:50 am
Hi,
Fairly new to managing large datasets - we have a large table - 16m rows.
We load around 800k rows each month and store data on a monthly basis.
We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.
Would you look at performance tuning - or creating a metrics table for this kind of project?
Nick
August 14, 2015 at 8:06 am
nick.latocha (8/14/2015)
Hi,Fairly new to managing large datasets - we have a large table - 16m rows.
We load around 800k rows each month and store data on a monthly basis.
We want to display high level, aggregated metrics to users but find that our queries are quite slow and sluggish.
Would you look at performance tuning - or creating a metrics table for this kind of project?
Nick
If the data is only changing once a month as you suggest, then it's no contest: precalculate the aggregated metrics into a permanent table, and recalculate each time the source data is refreshed.
As an aside, 16 million rows is trivial and with adequate indexing etc you should get results back in a couple of seconds or less (finger in the air).
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 14, 2015 at 8:08 am
Probably both.
If the data changes once a month and is queried often, then you can add a job after it's loaded to pre-aggregate into tables for later querying, then tune the queries as well. 16 million isn't particularly large, so I'd definitely suggest tuning as well as pre-aggregating, for when the data volumes grow into something large
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 14, 2015 at 3:23 pm
If you (almost) always process data by month (or other date range(s) too) then cluster the table on the related date column. You should far less performance issues then.
Edit: Agree that 16M rows isn't a priori a lot, but if you are constantly scanning the entire 16M rows, it could add up very quickly.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 16, 2015 at 4:27 am
This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.
August 16, 2015 at 11:30 am
nick.latocha (8/16/2015)
This could be where we are going wrong. We have clustered the unique ID column. Let me try this out on our development environment.
You really need to post the DDL for the table and the related constraints and indexes because having a clustered index on a unique ID column is likely not a problem. In fact, it may be essential to keeping the performance up during the inserts. But, "It Depends" on what else is happening. Is there anything else in the data that would uniquely identify each row?
--Jeff Moden
Change is inevitable... Change for the better is not.
August 17, 2015 at 1:14 am
Attached.
Main table, queue table, indexes and execution plan.
Takes around 16 seconds to execute.
Thanks in advance
August 18, 2015 at 4:18 pm
This is so typical of these types of situations. As usual, there is a vastly better clustering key available than identity.
The (first) clustering key for this table should be exceptionDateTime, period. In this case, that's probably the only key you need, although ( exceptionDateTime, productArea ) could perform better depending on your specific query requirements. If you prefer, you can add identity to the end to make it unique (which will of course be required if you want to classify it as a "primary key" rather than just a clustering index).
You should then review the nonclustered index to see if it's still needed -- my guess is it's not worth it to maintain that as well.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 20, 2015 at 9:22 am
Thanks - this definitely improve things. The query alone is now much faster - however the join with the exception table is very slow.
SELECT m.reportMonth
,m.reportYear
,ex.service
,sum(m.volume) AS Vol
,SUM(m.cost) AS effort
FROM [cts].[exception_Main] m WITH (NOLOCK)
LEFT JOIN cts.map_Exception ex ON m.queueID = ex.queueID
WHERE m.exceptionDateTime >= GETDATE() - 365
AND m.productArea = ('FXMM')
AND m.reportable = 'Y'
GROUP BY m.reportMonth
,m.reportYear
,ex.service
Even with a PK/FK relationship the join slows things down massively. My solution was to aggregate the "main" data first in to a temp table - then join the exception table afterwards. Is this the best way - or should it be possible to do it using the original query?
August 20, 2015 at 9:32 am
-- create this index
CREATE INDEX ix_Helper ON [cts].[exception_Main]
(reportYear, reportMonth, exceptionDateTime, productArea)
INCLUDE (volume, cost, queueID, reportable)
-- run this query, capture the actual execution plan and post back
SELECT m.reportMonth
,m.reportYear
,ex.service
,sum(m.volume) AS Vol
,SUM(m.cost) AS effort
FROM [cts].[exception_Main] m WITH (NOLOCK)
LEFT loop JOIN cts.map_Exception ex -- ##
ON m.queueID = ex.queueID
WHERE m.exceptionDateTime >= GETDATE() - 365
AND m.productArea = ('FXMM')
AND m.reportable = 'Y'
GROUP BY m.reportYear -- ##
,m.reportMonth -- ##
,ex.service
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 20, 2015 at 10:13 am
See attached. Looks like it's not using the new index and instead the previous cluster.
August 20, 2015 at 10:24 am
Interesting. Edit: The only reason that particular join should be slow is the large number of lookups, since the map_exception table only has a few hundred rows. End-Edit.
At any rate, first create this index. Or not, this should have only a minor effect anyway, but this may be one of those cases where a nonclus index on the same column as the clus index is worthwhile.
Then test the query below (I went back to the original query, naturally make any adjustments you need to; the "with(nolock)" is entirely at your discretion). No guarantees, but I think it should perform better by grouping before doing the lookup.
[Btw, you should increase the FILLFACTOR on the main map_exception table as well, and also very likely on the exception_Main table. As with the clustering index keys, there's no true good "default" value for ff, it should be specifically chosen index by index.]
CREATE UNIQUE NONCLUSTERED INDEX [IX_map_Exception]
ON cts.map_Exception ( queueID, service )
WITH ( FILLFACTOR = 99 );
SELECT
main.reportMonth
,main.reportYear
,ex.service
,main.zzzCustomer
,sum(main.Vol) AS Vol
,SUM(main.effort) as effort
FROM (
SELECT
em.reportMonth
,em.reportYear
,em.queueID
,em.zzzCustomer
,sum(em.volume) AS Vol
,SUM(em.cost) as effort
FROM [cts].[exception_Main] em with(nolock)
WHERE em.exceptionDateTime >= GETDATE()-365
and em.productArea =('FXMM')
and em.reportable = 'Y'
GROUP BY
em.reportMonth
,em.reportYear
,em.queueID
,em.zzzCustomer
) AS main
LEFT JOIN cts.map_Exception ex with(nolock) ON main.queueID = ex.queueID
GROUP BY
main.reportMonth
,main.reportYear
,ex.service
,main.zzzCustomer
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
August 21, 2015 at 3:36 am
Good catch Scott. I'd write it like this which is more or less the same:
SELECT m.reportMonth,
m.reportYear,
ex.[service],
SUM(m.Vol) AS Vol,
SUM(m.Effort) AS effort
FROM (
SELECT reportMonth,
reportYear,
queueID,
SUM(volume) AS Vol,
SUM(cost) AS Effort
FROM [cts].[exception_Main] -- 14.5M rows
WHERE exceptionDateTime >= GETDATE() - 365
AND productArea = ('FXMM')
AND reportable = 'Y'
GROUP BY
reportYear,
reportMonth,
queueID
) m
LEFT JOIN cts.map_Exception ex -- 420 rows
ON m.queueID = ex.queueID
GROUP BY
m.reportYear,
m.reportMonth,
ex.[service]
The inner select can be accelerated with this index:
CREATE INDEX ix_Helper ON [cts].[exception_Main]
(productArea, reportable, reportYear, reportMonth, queueID)
INCLUDE (volume, cost, exceptionDateTime)
The rationale for the index goes like this:
Once the start point has been found in the index using a seek, scan until the end, filtering exceptionDateTime with a residual predicate. Because productArea and reportable are constants, the data comes off in the correct order to support a streaming aggregate.
A simple test harness shows that the theory works and it's mighty quick too:
-- set up sample data
SELECT
ID,
reportYear = YEAR(ReportDate),
reportMonth = MONTH(ReportDate),
reportable = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'Y' ELSE 'N' END,
queueID = ABS(CHECKSUM(NEWID()))%420,
exceptionDateTime = DATEADD(DAY,ABS(CHECKSUM(NEWID()))%3,ReportDate),
productArea = CASE WHEN ABS(CHECKSUM(NEWID()))%2 = 1 THEN 'FXMM' ELSE 'N' END,
volume = ABS(CHECKSUM(NEWID()))%20,
cost = ABS(CHECKSUM(NEWID()))%30
INTO #Sample
FROM (
SELECT ID, ReportDate = DATEADD(MINUTE, 0-ID/20,GETDATE())
FROM (
SELECT TOP(14500000)
ID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM SYS.COLUMNS a, SYS.COLUMNS b, SYS.COLUMNS c, SYS.COLUMNS d, SYS.COLUMNS e
) d
) e
CREATE UNIQUE CLUSTERED INDEX ucx_Sample ON #Sample (ID)
CREATE INDEX ix_Helper ON #Sample
(productArea, reportable, reportYear, reportMonth, queueID)
INCLUDE (volume, cost, exceptionDateTime)
-- run the (inner) query
SELECT
reportMonth,
reportYear,
queueID,
SUM(volume) AS Vol,
SUM(cost) AS Effort
FROM #Sample
WHERE exceptionDateTime >= GETDATE() - 365
AND productArea = ('FXMM')
AND reportable = 'Y'
GROUP BY
reportYear,
reportMonth,
queueID
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 21, 2015 at 4:43 am
Wow!
That's a dramatic increase! Thanks so much
August 21, 2015 at 8:19 am
ChrisM@Work (8/21/2015)
The inner select can be accelerated with this index:
CREATE INDEX ix_Helper ON [cts].[exception_Main]
(productArea, reportable, reportYear, reportMonth, queueID)
INCLUDE (volume, cost, exceptionDateTime)
But with the best clustered index now on the table, you don't need this extra, fairly large index. You should get as good a response time -- possibly even better, depends on the details -- from a clustered index on exceptionDateTime. That's the huge advantage --edit: of the best clus index -- that you don't have to create the gazillions on covering indexes for so many queries.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply