purging old data and outer joins

  • 500k in 10 minutes on production. Sqlplan attached

    SELECT TOP (500000) X.ADDRESS_ID

    FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

    EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

    EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS

    EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

    EXCEPT SELECT ADDRESS_ID FROM PACKET

    EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

    EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

    EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

    EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER) X(ADDRESS_ID);

  • Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Sergiy (2/14/2016)


    Jeff Moden (2/14/2016)


    Sergiy (2/14/2016)


    I wonder - how long the SELECT part of the query would take?

    The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! πŸ˜€

    Why do you think it's gonna be none rows deleted?

    I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/15/2016)


    Sergiy (2/14/2016)


    Jeff Moden (2/14/2016)


    Sergiy (2/14/2016)


    I wonder - how long the SELECT part of the query would take?

    The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! πŸ˜€

    Why do you think it's gonna be none rows deleted?

    I could be wrong but it looks as if all you need is one address ID in the other tables that match the first address ID in the address table and nothing will be deleted.

    No, you're not wrong. (as always, b...y h..l!) πŸ™‚

    But look at the data.

    500 mil addresses stored in the database.

    USPS delivers to 154 mil addresses.

    So, the client is 3 times bigger than US.

    Who's bigger?

    China?

    I doubt Chinese would outsource a database of a national scale to India.

    India?

    IndianRock said they are an outsource company, so, it's not India.

    Only one conclusion left - this is not normalised data storage.

    Every new case adds a record of a new client, its lender, associated legal entity, etc.

    Every of those records has an address which is recorded as an add-hoc set of strings.

    In attempt to simulate data normalisation the developers put all addresses into a single table, but all those 9 addresses create a new record in the table no matter what.

    The lending agency most likely did not move once on a course of the last 10 years, but the application adds a new record for its address every time it's mentioned in a new case.

    That's the only way I can explain such a number of addresses in the car dealership database.

    Now, what's happened when a case expires?

    They remove it from the database.

    And they have to remove all the records associated with that case.

    Because they all are one-off, and they cannot be reused in other cases.

    Logically, cases are removed in chronological order, the oldest first.

    Oldest case created addresses with the least ADDRESS_ID.

    When you remove it the MIN used ADDRESS_ID moves up, and my DELETE will find some work to do.

    Yes, I admit, there might be some exclusions, some jumps in the queue.

    Some addresses may be late for deletion for a week or 2.

    But they'll catch up pretty soon, I'm sure.

    IndianRock - how far is my guess from the actual state of affairs?

    _____________
    Code for TallyGenerator

  • Orlando Colamatteo (2/15/2016)


    Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.

    You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.

    I set up some sample data like this:

    SELECT TOP 15000000

    Address_ID = ID

    INTO #Address

    FROM dbo.SomeTableWithAUniqueIDColumn

    ORDER BY NEWID()

    CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #Address (Address_ID)

    with another two tables having the same number of rows captured at random, plus another four tables with a million rows captured at random (from 21 million rows in the permanent table).

    Then I ran these two queries and recorded the best of three executions for each of the three conditions:

    DROP TABLE #Temp1

    SELECT Address_ID

    INTO #Temp1 FROM #Address -- 15,000,000

    EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000

    EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000

    EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000

    -- no hint: 00:00:37

    -- OPTION(MERGE JOIN) -- 00:00:13

    -- OPTION(HASH JOIN) -- 00:00:42

    DROP TABLE #Temp2

    SELECT Address_ID

    INTO #Temp2 FROM #Address -- 15,000,000

    EXCEPT SELECT Address_ID FROM #Address06 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address05 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address04 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address03 -- 1,000000

    EXCEPT SELECT Address_ID FROM #Address02 -- 15,000,000

    EXCEPT SELECT Address_ID FROM #Address01 -- 15,000,000

    -- no hint: -- 00:00:46

    -- OPTION(MERGE JOIN) -- 00:00:16

    -- OPTION(HASH JOIN) -- 00:01:05

    Note that both of these queries, even without the hints, joined the tables in the same order that they appear in the query and used hash joins for the big tables and nested loops for the smaller tables.

    I tested with different sizes of tables and the results were consistent - without hints, the optimiser joined the tables in the same order in which they appeared in the query, and joining in descending order of size always resulted in best performance. Also, a plan utilising merge joins never arose without the merge join hint, and with the hint, always performed better than the alternatives.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Plan attached for tables ordered large to small in the "except" lines, with hint.

    --50k in QA with tables large to small option merge join 02:34

    -- 50k in QA with tables ordered large to small no hint 36 minutes -

    -- original table order in prod result 500K IN 10 MINUTES

    SELECT TOP (50000) X.ADDRESS_ID

    FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

    EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

    EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER

    EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

    EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

    EXCEPT SELECT ADDRESS_ID FROM PACKET

    EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

    EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

    EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS) X(ADDRESS_ID)

    OPTION(MERGE JOIN)

  • ChrisM@Work (2/16/2016)


    Orlando Colamatteo (2/15/2016)


    Can you try it in order from most rows to least, see Jeff's follow-up post to mine. The thought being to reduce the set as much as possible as early as possible.

    You'd think the optimiser would rearrange the join order but it doesn't seem to, and the natural order in the query does have an influence as you'd expect.

    My understanding is that order matters when using EXCEPT. In this case it does not (logically, as long as we have ADDRESS first) but the QO cannot know our intent.

    IF OBJECT_ID(N'tempdb..#a') IS NOT NULL

    DROP TABLE #a;

    GO

    CREATE TABLE #a (item VARCHAR(100));

    CREATE TABLE #b (item VARCHAR(100));

    INSERT INTO #a (item) VALUES ('Soup');

    INSERT INTO #b (item) VALUES ('Soup');

    INSERT INTO #b (item) VALUES ('Crackers');

    SELECT item FROM #a

    EXCEPT

    SELECT item FROM #b;

    SELECT item FROM #b

    EXCEPT

    SELECT item FROM #a;

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • IndianRock,

    Is it possible to record unique ADDRESS_ID's from deleted records of, say, table SERVICE_REQUEST when you clean it up?

    Best way would be to use FOR DELETE triggers on the tables involved to copy ADDRESS_ID's from deleted records into a table AddressID_ToDelete (ADDRESS_ID int PRIMARY KEY).

    But it's not the only method you can use, of course.

    _____________
    Code for TallyGenerator

  • Try this one:

    SELECT X.ADDRESS_ID

    INTO #ToDelete

    FROM dbo.ADDRESS A

    WHERE NOT EXISTS (SELECT * FROM LEGAL_ENTITY L

    WHERE L.ADDRESS_ID = A.ADDRESS_ID

    )

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN SERVICE_REQUEST B ON B.ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN LIENHOLDER B ON B.ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN WORK_QUEUE_ITEM B ON B.ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN PACKET B ON B.ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN DIRECT_LENDING_SERVICE B ON B.PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN CHECK_REQUEST_DEFAULT_ADDRESS B ON B.ADDRESS_ID = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE A

    FROM #ToDelete A

    INNER JOIN EARLY_PAYOFF_INFO B ON B.ADDRESS_ID1 = A.ADDRESS_ID OR B.ADDRESS_ID2 = A.ADDRESS_ID OR B.ADDRESS_ID3 = A.ADDRESS_ID

    IF EXISTS (SELECT * FROM #ToDelete )

    DELETE B

    FROM #ToDelete A

    INNER JOIN dbo.ADDRESS B ON B.ADDRESS_ID = A.ADDRESS_ID

    It replaces 9 "fat joins" with just one.

    The rest will be operating thousands of rows remaining in #ToDelete instead of millions of rows in the Address table.

    If you find the way to record candidates for deletion (see my previous post) you'll be able to get rid of that last "fat join".

    You might wish to limit the number of rows to be deleted by the last statement.

    But most likely after couple of runs the number of records remaining in #ToDelete will never go over several hundreds, so it would not matter.

    Try it on QA and see how it goes.

    _____________
    Code for TallyGenerator

Viewing 9 posts - 46 through 53 (of 53 total)

You must be logged in to reply to this topic. Login to reply