Speed up join

Question

Speed up join

Scott745618

SSC Enthusiast

Points: 197
More actions
May 27, 2021 at 10:14 pm

#3888970
Hi everyone,
I am working on a join that takes days to run, the code is below (it has been edited for security reasons):
```
with 
BCMasterydaily as
(select *
from dbo.BCMasterydaily
where acctno = @controlacctnum
AND Processdate between startdate and enddate
and consol = 'a'
)
SELECT DISTINCTp.acct,
p.processdate,
m.product + m.productcode as pirprod,
CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
ELSE p.plan_id
END AS balancetype
FROMBCMasterydaily m
INNER JOIN dbo.BCPDailybalmasterydaily p
on p.acctno = m.acctno and
p.processdate = m.processdate
```
The issue is with table dbo.BCPDailybalmasterydaily which has billions of records. The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?
Thanks

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63503 More actions · Answer 1

The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?

Are you allowed to create at least non-clustered indexes on the tables? If not, I don't think you can speed that up very much.

ScottPletcher SSC Guru Points: 101097 More actions · Answer 2

Best guess, without any other details, is to cluster the dbo.BCPDailybalmasterydaily table on ( processdate, acctno ). If those columns are not unique and there's an identity column on that table, include the $IDENTITY column as the last column in the clus key so that you make the index UNIQUE.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Sergiy SSC Guru Points: 110208 More actions · Answer 3

What's the purpose of the CTE?

Also, giving the CTE the same name as the table does not seem to be a brightest idea ever.

Clustered index on ProcessDate is a must on such tables.

Also the WHERE clause you use in CTE (except for consol='a') must be applied to both tables. What you have in JOIN condition is not enough.

_____________
Code for TallyGenerator

Scott745618 SSC Enthusiast Points: 197 More actions · Answer 4

Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?

ScottPletcher SSC Guru Points: 101097 More actions · Answer 5

(1) Does the original table have an IDENTITY column? Or a added_date column?

(2) Is the original table compressed (page or row compressed)?

(3) Have you checked to see if compression would be beneficial for this table?

Overall you'd almost certainly want to create a new table and load the new table from the original table, in batches. You'll need some column(s) in the original table that uniquely identify rows, so that you can do the load in batches.

"It is a banks Database, do you think they’ll let that happen?"

I have no idea, you'll have to ask them. If they won't make changes to the table, it will always be extremely slow to process against, since SQL will always have to fully scan the entire table.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004329 More actions · Answer 6

@scott745618,

I suspect that there's some pretty nasty accidental many-to-many joins based on the account number and date. Take a look at the article at the 2nd link in my signature line below (the one about how to post performance problems) and post those things so that we can help you solve this issue. Otherwise, we're just guessing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 7

I would attempt to speed the query up with the use of temporary tables, try this for starters:

IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL 
    DROP TABLE #BCMasterydaily

SELECT DISTINCT
       acctno,
       processdate,
       product,
       productcode
  INTO #BCMasterydaily m
  FROM dbo.BCMasterydaily
 WHERE acctno = @controlacctnum
   AND Processdate BETWEEN startdate AND enddate
   AND consol = 'a'

CREATE CLUSTERED INDEX IX_#BCMasterydaily_1 ON #BCMasterydaily(acctno, processdate)

SELECT DISTINCT 
       p.acct, 
       p.processdate, 
       m.product + m.productcode AS pirprod,
       CASE WHEN LEFT(p.plan_id, 1) = 'C' THEN 'cash'
            WHEN LEFT(p.plan_id, 1) = 'B' THEN 'BT'
            ELSE p.plan_id
       END AS balancetype
  FROM #BCMasterydaily m
 INNER JOIN dbo.BCPDailybalmasterydaily p 
         ON p.acctno = m.acctno
        AND p.processdate = m.processdate;

Sergiy SSC Guru Points: 110208 More actions · Answer 8

Temp tables won;t change much in terms of overall performance. May be make it a bit worse.

I'd suggest following changes to the query:

SELECT p.acct, p.processdate,
m.product + m.productcode as pirprod,
CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
ELSE p.plan_id
END AS balancetype
FROM dbo.BCMasterydaily m
INNER JOIN dbo.BCPDailybalmasterydaily p
on p.acctno = @controlacctnum and p.processdate = m.processdate
where M.acctno = @controlacctnum
AND M.Processdate between @startdate and @enddate
and M.consol = 'a'
AND p.Processdate between @startdate and @enddate
group by p.acct, p.processdate,
m.product + m.productcode, LEFT(p.plan)

The important points are these:

p.acctno = @controlacctnum in JOIN will help use statistics more effectively
AND p.Processdate between @startdate and @enddate may force selecting only relevant chunk of data (I suspect a very little part of the total set) before proceedingwith the HASH JOIN.
DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

Presence or absence of the CTE won't affect the query. I removed it because I don't see any benefit of having it in the query.

_____________
Code for TallyGenerator

Sergiy SSC Guru Points: 110208 More actions · Answer 9

Scott745618 wrote:

Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?

What kind of link you're asking for?

CREATE INDEX on docs.microsoft.com ?

With such big tables creating a clusterd index is not so trivial.

You need to create an empty copy of existing tables, say dbo.BCMasterydaily_C

Then you crawl-copy data from dbo.BCMasterydaily to dbo.BCMasterydaily_C

When the copy table catches up with the source you lock them bot in a transaction, copy the last inserted records to "_C" (USING TABLOCKX), drop the original table and rename dbo.BCMasterydaily_C to dbo.BCMasterydaily. If there are still no errors - COMMIT. Otherwise ROLLBACK and investigate what's gone wrong.

_____________
Code for TallyGenerator

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 10

Sergiy wrote:
The important points are these:
- DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.

Scott745618 SSC Enthusiast Points: 197 More actions · Answer 11

Thanks everyone,

It was a colleagues code he requested help on. I've send him the link for cluster indexes online and also wrote him your code @SSC Guru.

Also why is it not smart to have CTEs the same name as table names?

and I don't use this:

IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL 
    DROP TABLE #BCMasterydaily

I use DROP TABLE IF EXISTS #BCMasterydaily;

I find it much easier to read.

Sergiy SSC Guru Points: 110208 More actions · Answer 12

Jonathan AC Roberts wrote:

Sergiy wrote:
The important points are these:
DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".
I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.

GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.

Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT.

And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.

DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.

To illustrate the difference I offer you this simple piece of code:

CREATE TABLE #T (Col1 int)
INSERT INTO #T (Col1)
SELECT 1
union ALL
SELECT 2
union ALL
SELECT 1

SELECT NEWID(), COL1
FROM #T
GROUP BY Col1

SELECT DISTINCT NEWID(), COL1
FROM #T

DROP TABLE #T

_____________
Code for TallyGenerator

Jonathan AC Roberts SSCoach Points: 18354 More actions · Answer 13

Sergiy wrote:

Jonathan AC Roberts wrote:
Sergiy wrote:
The important points are these:
DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".
I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.
GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.
Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT. And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.
DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.
To illustrate the difference I offer you this simple piece of code:
CREATE TABLE #T (Col1 int)
INSERT INTO #T (Col1)
SELECT 1
union ALL
SELECT 2
union ALL
SELECT 1

SELECT NEWID(), COL1
FROM #T
GROUP BY Col1

SELECT DISTINCT NEWID(), COL1
FROM #T

DROP TABLE #T

Well that didn't demonstrate any performance improvement from using group by.

Sergiy SSC Guru Points: 110208 More actions · Answer 14

Jonathan AC Roberts wrote:

Well that didn't demonstrate any performance improvement from using group by.

It explained where the performance improvement comes from.

If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.

_____________
Code for TallyGenerator