Speed up join

  • Hi everyone,

    I am working on a join that takes days to run, the code is below (it has been edited for security reasons):

    with 
    BCMasterydaily as
    (select *
    from dbo.BCMasterydaily
    where acctno = @controlacctnum
    AND Processdate between startdate and enddate
    and consol = 'a'
    )
    SELECT DISTINCTp.acct,
    p.processdate,
    m.product + m.productcode as pirprod,
    CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
    WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
    ELSE p.plan_id
    END AS balancetype
    FROMBCMasterydaily m
    INNER JOIN dbo.BCPDailybalmasterydaily p
    on p.acctno = m.acctno and
    p.processdate = m.processdate


    The issue is with table dbo.BCPDailybalmasterydaily which has billions of records. The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?

    Thanks

  • The table does not have an index that I can use to speed up the join and I have tried using a where in clause so there is no Cartesian join. Does anyone have any tricks or tips for dealing with joining on huge tables?

    Are you allowed to create at least non-clustered indexes on the tables? If not, I don't think you can speed that up very much.

  • Best guess, without any other details, is to cluster the dbo.BCPDailybalmasterydaily table on ( processdate, acctno ).  If those columns are not unique and there's an identity column on that table, include the $IDENTITY column as the last column in the clus key so that you make the index UNIQUE.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • What's the purpose of the CTE?

    Also, giving the CTE the same name as the table does not seem to be a brightest idea ever.

    Clustered index on ProcessDate is a must on such tables.

    Also the WHERE clause you use in CTE (except for consol='a') must be applied to both tables. What you have in JOIN condition is not enough.

     

    _____________
    Code for TallyGenerator

  • Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?

  • (1) Does the original table have an IDENTITY column?  Or a added_date column?

    (2) Is the original table compressed (page or row compressed)?

    (3) Have you checked to see if compression would be beneficial for this table?

    Overall you'd almost certainly want to create a new table and load the new table from the original table, in batches.  You'll need some column(s) in the original table that uniquely identify rows, so that you can do the load in batches.

    "It is a banks Database, do you think they’ll let that happen?"

    I have no idea, you'll have to ask them.  If they won't make changes to the table, it will always be extremely slow to process against, since SQL will always have to fully scan the entire table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • @scott745618,

    I suspect that there's some pretty nasty accidental many-to-many joins based on the account number and date.  Take a look at the article at the 2nd link in my signature line below (the one about how to post performance problems) and post those things so that we can help you solve this issue.  Otherwise, we're just guessing.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I would attempt to speed the query up with the use of temporary tables, try this for starters:

    IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL 
    DROP TABLE #BCMasterydaily

    SELECT DISTINCT
    acctno,
    processdate,
    product,
    productcode
    INTO #BCMasterydaily m
    FROM dbo.BCMasterydaily
    WHERE acctno = @controlacctnum
    AND Processdate BETWEEN startdate AND enddate
    AND consol = 'a'

    CREATE CLUSTERED INDEX IX_#BCMasterydaily_1 ON #BCMasterydaily(acctno, processdate)

    SELECT DISTINCT
    p.acct,
    p.processdate,
    m.product + m.productcode AS pirprod,
    CASE WHEN LEFT(p.plan_id, 1) = 'C' THEN 'cash'
    WHEN LEFT(p.plan_id, 1) = 'B' THEN 'BT'
    ELSE p.plan_id
    END AS balancetype
    FROM #BCMasterydaily m
    INNER JOIN dbo.BCPDailybalmasterydaily p
    ON p.acctno = m.acctno
    AND p.processdate = m.processdate;

     

  • Temp tables won;t change much in terms of overall performance. May be make it a bit worse.

    I'd suggest following changes to the query:

    SELECT p.acct, p.processdate,
    m.product + m.productcode as pirprod,
    CASEWHEN LEFT(p.plan_id,1) = 'C' then 'cash'
    WHEN LEFT(p.plan_id,1) = 'B' then 'BT'
    ELSE p.plan_id
    END AS balancetype
    FROM dbo.BCMasterydaily m
    INNER JOIN dbo.BCPDailybalmasterydaily p
    on p.acctno = @controlacctnum and p.processdate = m.processdate
    where M.acctno = @controlacctnum
    AND M.Processdate between @startdate and @enddate
    and M.consol = 'a'
    AND p.Processdate between @startdate and @enddate
    group by p.acct, p.processdate,
    m.product + m.productcode, LEFT(p.plan)

    The important points are these:

    • p.acctno = @controlacctnum in JOIN will help use statistics more effectively
    • AND p.Processdate between @startdate and @enddate may force selecting only relevant chunk of data (I suspect a very little part of the total set) before proceedingwith the HASH JOIN.
    • DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

    Presence or absence of the CTE won't affect the query. I removed it because I don't see any benefit of having it in the query.

    _____________
    Code for TallyGenerator

  • Scott745618 wrote:

    Can you share a link of how to do that? It is a banks Database, do you think they’ll let that happen?

    What kind of link you're asking for?

    CREATE INDEX on docs.microsoft.com ?

    With such big tables creating a clusterd index is not so trivial.

    You need to create an empty copy of existing tables, say dbo.BCMasterydaily_C

    Then you crawl-copy data from dbo.BCMasterydaily to dbo.BCMasterydaily_C

    When the copy table catches up with the source you lock them bot in a transaction, copy the last inserted records to "_C" (USING TABLOCKX), drop the original table and rename dbo.BCMasterydaily_C to dbo.BCMasterydaily. If there are still no errors - COMMIT. Otherwise ROLLBACK and investigate what's gone wrong.

     

    _____________
    Code for TallyGenerator

    • Sergiy wrote:

      The important points are these:

      • DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

    I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.

  • Thanks everyone,

    It was a colleagues code he requested help on. I've send him the link for cluster indexes online and also wrote him your code @SSC Guru.

    Also why is it not smart to have CTEs the same name as table names?

    and I don't use this:

    IF OBJECT_ID('tempdb..#BCMasterydaily', 'U') IS NOT NULL 
    DROP TABLE #BCMasterydaily

    I use DROP TABLE IF EXISTS #BCMasterydaily;

    I find it much easier to read.

  • Jonathan AC Roberts wrote:

    • Sergiy wrote:

      The important points are these:

      • DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

    I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.

    GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.

    Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT.

    And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.

    DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.

    To illustrate the difference I offer you this simple piece of code:

    CREATE TABLE #T (Col1 int)
    INSERT INTO #T (Col1)
    SELECT 1
    union ALL
    SELECT 2
    union ALL
    SELECT 1

    SELECT NEWID(), COL1
    FROM #T
    GROUP BY Col1

    SELECT DISTINCT NEWID(), COL1
    FROM #T

    DROP TABLE #T

    _____________
    Code for TallyGenerator

  • Sergiy wrote:

    Jonathan AC Roberts wrote:

    • Sergiy wrote:

      The important points are these:

      • DISTINCT is pretty nasty beast. There is no excuse for lazy programming using it, even if "everyone does it".

    I don't see why/how GROUP BY would be more efficient than DISTINCT. Group by involves more processing than distinct as it allows for aggregate calculations so, if anything, it should be less efficient than distinct.

    GROUP BY is applied to the source data sets in queries, with possible use of statistics, indexing, etc.

    Calculations, aggregations, conversions apply to already grouped rows, so there is no repetitive calculations of the same values, as it would happen with DISTINCT. And allowing for calculations does not mean executing calculations. There is no processing time required for allowing anything.

    DISTINCT is applied to to the the newly generated output recordset of a query, after all the procesing is done on the expanded recordset.

    To illustrate the difference I offer you this simple piece of code:

    CREATE TABLE #T (Col1 int)
    INSERT INTO #T (Col1)
    SELECT 1
    union ALL
    SELECT 2
    union ALL
    SELECT 1

    SELECT NEWID(), COL1
    FROM #T
    GROUP BY Col1

    SELECT DISTINCT NEWID(), COL1
    FROM #T

    DROP TABLE #T

    Well that didn't demonstrate any performance improvement from using group by.

  • Jonathan AC Roberts wrote:

    Well that didn't demonstrate any performance improvement from using group by.

    It explained where the performance improvement comes from.

    If you want to see it - just rewrite any of your DISTINCT queries running against a significant dataset to use GROUP BY instead of DISTINCT. You may not even need to set STATISTICS ON to notice the difference.

    _____________
    Code for TallyGenerator

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply