May 27, 2021 at 10:14 pm
Hi everyone,
I am working on a join that takes days to run, the code is below (it has been edited for security reasons):
with
BCMasterydaily as
(select *
from dbo.BCMasterydaily
where acctno = @controlacctnum
AND Processdate between startdate and enddate
and consol = 'a'
)
SELECT DISTINCTp.acct,
p.processdate,
m.product + m.productcode as pirprod,
CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
ELSE p.plan_id
END AS balancetype
FROMBCMasterydaily m
INNER JOIN dbo.BCPDailybalmasterydaily p
on p.acctno = m.acctno and
p.processdate = m.processdate
The issue is with table dbo.BCPDailybalmasterydaily which has billions of records. The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?
Thanks
May 28, 2021 at 1:25 am
The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?
Are you allowed to create at least non-clustered indexes on the tables? If not, I don't think you can speed that up very much.
May 28, 2021 at 2:58 am
Best guess, without any other details, is to cluster the dbo.BCPDailybalmasterydaily table on ( processdate, acctno ). If those columns are not unique and there's an identity column on that table, include the $IDENTITY column as the last column in the clus key so that you make the index UNIQUE.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
May 28, 2021 at 3:03 am
What's the purpose of the CTE?
Also, giving the CTE the same name as the table does not seem to be a brightest idea ever.
Clustered index on ProcessDate is a must on such tables.
Also the WHERE clause you use in CTE (except for consol='a') must be applied to both tables. What you have in JOIN condition is not enough.
_____________
Code for TallyGenerator
May 28, 2021 at 3:17 am
Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?
May 28, 2021 at 3:47 am
(1) Does the original table have an IDENTITY column? Or a added_date column?
(2) Is the original table compressed (page or row compressed)?
(3) Have you checked to see if compression would be beneficial for this table?
Overall you'd almost certainly want to create a new table and load the new table from the original table, in batches. You'll need some column(s) in the original table that uniquely identify rows, so that you can do the load in batches.
"It is a banks Database, do you think they’ll let that happen?"
I have no idea, you'll have to ask them. If they won't make changes to the table, it will always be extremely slow to process against, since SQL will always have to fully scan the entire table.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
May 28, 2021 at 1:52 pm
I suspect that there's some pretty nasty accidental many-to-many joins based on the account number and date. Take a look at the article at the 2nd link in my signature line below (the one about how to post performance problems) and post those things so that we can help you solve this issue. Otherwise, we're just guessing.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 28, 2021 at 11:36 pm
I would attempt to speed the query up with the use of temporary tables, try this for starters:
IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL
DROP TABLE #BCMasterydaily
SELECT DISTINCT
acctno,
processdate,
product,
productcode
INTO #BCMasterydaily m
FROM dbo.BCMasterydaily
WHERE acctno = @controlacctnum
AND Processdate BETWEEN startdate AND enddate
AND consol = 'a'
CREATE CLUSTERED INDEX IX_#BCMasterydaily_1 ON #BCMasterydaily(acctno, processdate)
SELECT DISTINCT
p.acct,
p.processdate,
m.product + m.productcode AS pirprod,
CASE WHEN LEFT(p.plan_id, 1) = 'C' THEN 'cash'
WHEN LEFT(p.plan_id, 1) = 'B' THEN 'BT'
ELSE p.plan_id
END AS balancetype
FROM #BCMasterydaily m
INNER JOIN dbo.BCPDailybalmasterydaily p
ON p.acctno = m.acctno
AND p.processdate = m.processdate;
May 29, 2021 at 5:05 am
Temp tables won;t change much in terms of overall performance. May be make it a bit worse.
I'd suggest following changes to the query:
SELECT p.acct, p.processdate,
m.product + m.productcode as pirprod,
CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
ELSE p.plan_id
END AS balancetype
FROM dbo.BCMasterydaily m
INNER JOIN dbo.BCPDailybalmasterydaily p
on p.acctno = @controlacctnum and p.processdate = m.processdate
where M.acctno = @controlacctnum
AND M.Processdate between @startdate and @enddate
and M.consol = 'a'
AND p.Processdate between @startdate and @enddate
group by p.acct, p.processdate,
m.product + m.productcode, LEFT(p.plan)
The important points are these:
Presence or absence of the CTE won't affect the query. I removed it because I don't see any benefit of having it in the query.
_____________
Code for TallyGenerator
May 29, 2021 at 5:16 am
Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?
What kind of link you're asking for?
CREATE INDEX on docs.microsoft.com ?
With such big tables creating a clusterd index is not so trivial.
You need to create an empty copy of existing tables, say dbo.BCMasterydaily_C
Then you crawl-copy data from dbo.BCMasterydaily to dbo.BCMasterydaily_C
When the copy table catches up with the source you lock them bot in a transaction, copy the last inserted records to "_C" (USING TABLOCKX), drop the original table and rename dbo.BCMasterydaily_C to dbo.BCMasterydaily. If there are still no errors - COMMIT. Otherwise ROLLBACK and investigate what's gone wrong.
_____________
Code for TallyGenerator
May 29, 2021 at 10:52 am
The important points are these:
- DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".
I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.
May 29, 2021 at 7:35 pm
Thanks everyone,
It was a colleagues code he requested help on. I've send him the link for cluster indexes online and also wrote him your code @SSC Guru.
Also why is it not smart to have CTEs the same name as table names?
and I don't use this:
IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL
DROP TABLE #BCMasterydaily
I use DROP TABLE IF EXISTS #BCMasterydaily;
I find it much easier to read.
May 30, 2021 at 6:32 am
Sergiy wrote:The important points are these:
- DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".
I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.
GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.
Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT.
And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.
DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.
To illustrate the difference I offer you this simple piece of code:
CREATE TABLE #T (Col1 int)
INSERT INTO #T (Col1)
SELECT 1
union ALL
SELECT 2
union ALL
SELECT 1
SELECT NEWID(), COL1
FROM #T
GROUP BY Col1
SELECT DISTINCT NEWID(), COL1
FROM #T
DROP TABLE #T
_____________
Code for TallyGenerator
May 30, 2021 at 8:21 am
Jonathan AC Roberts wrote:
Sergiy wrote:The important points are these:
- DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".
I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.
GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.
Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT. And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.
DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.
To illustrate the difference I offer you this simple piece of code:
CREATE TABLE #T (Col1 int)
INSERT INTO #T (Col1)
SELECT 1
union ALL
SELECT 2
union ALL
SELECT 1
SELECT NEWID(), COL1
FROM #T
GROUP BY Col1
SELECT DISTINCT NEWID(), COL1
FROM #T
DROP TABLE #T
Well that didn't demonstrate any performance improvement from using group by.
May 30, 2021 at 6:46 pm
Well that didn't demonstrate any performance improvement from using group by.
It explained where the performance improvement comes from.
If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.
_____________
Code for TallyGenerator
Viewing 15 posts - 1 through 15 (of 26 total)
You must be logged in to reply to this topic. Login to reply