purging old data and outer joins

  • You're correct that the Address table may not be worth bothering with as far as purging. There are ten other tables such as these which have more delete possibilities.

    TableName______________________Rows____________Records deletable

    PERSISTENT_OBJECT_REFERENCE____1,093,267,410______150,501,810

    EVENT__________________________895,659,484________162,105,470

  • Sergiy (2/14/2016)


    In this case the best solution would be simply to stop the job.

    Removing 1% of the records as an ultimate goal does not make much sense. Achieving it won't make any difference at all.

    The purging would make more damage (in terms of resource consuming, locking, etc.) than any benefits it can add.

    To this, +1 Billion! Sergiy beat me to it. If you also consider the size of the other tables, it's much less that 1/2% of the data in the combined tables. It's just not worth the fuss.

    I agree... the best fix is to simply stop the purge job, in this case.

    If you absolutely insist on deleting the rows, then learn to use EXCEPT between the Address table and the other tables (starting with the largest of the other tables). You be left with a list of ALL the AddressID's that aren't used in any of the other tables and it'll only cost you one scan per column in the other tables.

    The DELETEs may still take a while if you have DRI in place for all those other tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Indianrock (2/14/2016)


    You're correct that the Address table may not be worth bothering with as far as purging. There are ten other tables such as these which have more delete possibilities.

    TableName______________________Rows____________Records deletable

    PERSISTENT_OBJECT_REFERENCE____1,093,267,410______150,501,810

    EVENT__________________________895,659,484________162,105,470

    Entities like "Event" or "Object Reference" should be referencing other entities ID's, not being referenced by others.

    When you need to clean up Event table just select N of oldest records which are beyond the retention period (Event Date is a clustered key, of course, so it's a fast running query), remove them from where there Events might be mentioned and then delete from Event.

    You should not be deleting events just because they are not mentioned in other tables.

    _____________
    Code for TallyGenerator

  • By "Except" were you referring to this alternative approach to all of the outer joins? ( sql code block below )? Right now the original query with outer joins took almost 40 minutes in production to build the temp table of 100k records, but after that Hugo's approach to deleting is going quickly enough for our purposes.

    Starting at Feb 14 2016 3:03PM

    temp table for Address has 100000 at Feb 14 2016 3:40PM

    Counter down to 99000 and deleted this loop=1000 at Feb 14 2016 3:40PM

    Counter down to 98000 and deleted this loop=1000 at Feb 14 2016 3:40PM

    Counter down to 97000 and deleted this loop=1000 at Feb 14 2016 3:40PM

    Counter down to 96000 and deleted this loop=1000 at Feb 14 2016 3:40PM

    Counter down to 95000 and deleted this loop=1000 at Feb 14 2016 3:40PM

    Counter down to 94000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 93000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 92000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 91000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 90000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 89000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 88000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 87000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    Counter down to 86000 and deleted this loop=1000 at Feb 14 2016 3:41PM

    INSERT INTO #tempdata

    SELECT TOP (500000) ADDRESS_ID

    FROM dbo.ADDRESS

    WHERE NOT EXISTS (SELECT * FROM EARLY_PAYOFF_INFO EPI1 WHERE EPI1.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM EARLY_PAYOFF_INFO EPI2 WHERE EPI2.ADDRESS_ID1 = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM EARLY_PAYOFF_INFO EPI3 WHERE EPI3.ADDRESS_ID2 = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM CHECK_REQUEST_DEFAULT_ADDRESS WHERE CHECK_REQUEST_DEFAULT_ADDRESS.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM DIRECT_LENDING_SERVICE WHERE DIRECT_LENDING_SERVICE.PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM PACKET WHERE PACKET.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM WORK_QUEUE_ITEM WHERE WORK_QUEUE_ITEM.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM LEGAL_ENTITY WHERE LEGAL_ENTITY.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM SERVICE_REQUEST WHERE SERVICE_REQUEST.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    AND NOT EXISTS (SELECT * FROM LIENHOLDER WHERE LIENHOLDER.ADDRESS_ID = ADDRESS.ADDRESS_ID)

    ORDER BY ADDRESS_ID;

  • Jeff Moden (2/14/2016)


    If you absolutely insist on deleting the rows, then learn to use EXCEPT between the Address table and the other tables (starting with the largest of the other tables). You be left with a list of ALL the AddressID's that aren't used in any of the other tables and it'll only cost you one scan per column in the other tables.

    This is exactly what I was trying to accomplish with the CTE approach I submitted earlier and the plans look very similar, 3 scans and the rest seeks. Using EXCEPT may be a bit more persuasive with the QO, am interested to see how it stacks up.

    This is what it looks like Indianrock:

    SELECT TOP (500000) X.ADDRESS_ID

    FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

    EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

    EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS

    EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

    EXCEPT SELECT ADDRESS_ID FROM PACKET

    EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

    EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

    EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

    EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER) X(ADDRESS_ID);

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Sergiy (2/14/2016)


    In this case the best solution would be simply to stop the job.

    What an Agile thing to propose.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Indianrock (2/13/2016)


    Thanks. I added the order by because the actual purge script is loading records into a temp table ( address_id ) and someone suggested that deleting by that order would be better than random.

    Deleting by batch only makes sense when the batches have continuous blocks of records from the clustered index, with no records in the range excepted from the deletion.

    Otherwise it will be random anyway. Those 1000 records you delete in a batch come from different pages, totally irrelevant to each other, scattered by unpredictable patterns over different ranges.

    Key Lookup performed and RBAR deletions happening behind the scene.

    A cursor should not be much slower for such deletions.

    But locking overhead with single-row deletions should be almost non-existent.

    _____________
    Code for TallyGenerator

  • Orlando Colamatteo (2/14/2016)


    Sergiy (2/14/2016)


    In this case the best solution would be simply to stop the job.

    What an Agile thing to propose.

    I don't think so.

    "Agile thing" is to quickly bake a solution for a task which should not be implemented at all.

    Agility especially required to quickly bake another "brilliant and effective" solution while uselessness of the previous one is being discovered.

    _____________
    Code for TallyGenerator

  • Sergiy (2/14/2016)


    Orlando Colamatteo (2/14/2016)


    Sergiy (2/14/2016)


    In this case the best solution would be simply to stop the job.

    What an Agile thing to propose.

    I don't think so.

    "Agile thing" is to quickly bake a solution for a task which should not be implemented at all.

    Agility especially required to quickly bake another "brilliant and effective" solution while uselessness of the previous one is being discovered.

    Not quite, but close. An important component of being Agile is saying no to work that does not hold true business value. In this case someone has asked Indianrock to do this work so it is likely too late. Who knows what could have been had you been in the Product Owner role.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (2/14/2016)


    Jeff Moden (2/14/2016)


    If you absolutely insist on deleting the rows, then learn to use EXCEPT between the Address table and the other tables (starting with the largest of the other tables). You be left with a list of ALL the AddressID's that aren't used in any of the other tables and it'll only cost you one scan per column in the other tables.

    This is exactly what I was trying to accomplish with the CTE approach I submitted earlier and the plans look very similar, 3 scans and the rest seeks. Using EXCEPT may be a bit more persuasive with the QO, am interested to see how it stacks up.

    This is what it looks like Indianrock:

    SELECT TOP (500000) X.ADDRESS_ID

    FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

    EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

    EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS

    EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

    EXCEPT SELECT ADDRESS_ID FROM PACKET

    EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

    EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY

    EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

    EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER) X(ADDRESS_ID);

    Mostly. I was thinking more in the line of ...

    DECLARE @BatchSize INT;

    SELECT @BatchSize = 50000; --or whatever

    SELECT BatchNumber = (ROW_NUMBER() OVER (ORDER BY ADDRESS_ID)-1)/@BatchSize

    ,ADDRESS_ID

    INTO #WhatToDelete

    FROM ( SELECT ADDRESS_ID FROM dbo.ADDRESS

    EXCEPT SELECT ADDRESS_ID FROM LEGAL_ENTITY --This one first because it's the largest

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1

    EXCEPT SELECT ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2

    EXCEPT SELECT ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3

    EXCEPT SELECT ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS

    EXCEPT SELECT PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE

    EXCEPT SELECT ADDRESS_ID FROM PACKET

    EXCEPT SELECT ADDRESS_ID FROM WORK_QUEUE_ITEM

    EXCEPT SELECT ADDRESS_ID FROM SERVICE_REQUEST

    EXCEPT SELECT ADDRESS_ID FROM LIENHOLDER

    ) d (ADDRESS_ID)

    ;

    Then you'd have a delete loop based on the batch number.

    Of course, part of the reason to not do any deletes is the same reason why the above or any delete might cause a problem... nothing has examined the process well enough to know if any given Address_ID is about to be used in one of the other tables. There has to be either a process guarantee or an expected temporal guarantee or we might be deleting data that shouldn't be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Since ADDRESS_ID on dbo.ADDRESS is a clustered PK you better delete only those entries which make up a continuous range of unused addresses.

    Which means - ADDRESS_ID is less than any ADDRESS_ID in any of the tables.

    Yes, it will leave some unused addresses in the table, but it would be a really minor trade-off.

    As we agreed, less than 1% of extra records won't make any influence on the overall performance.

    So, find the minimal ADDRESS_ID used in any referencing table and delete anything with lower ADDRESS_ID:

    DELETE TOP 10000 FROM dbo.ADDRESS

    WHERE ADDRESS_ID < (SELECT MIN(ADDRESS_ID) FROM (

    SELECT TOP 1 ADDRESS_ID FROM LEGAL_ENTITY ORDER BY ADDRESS_ID

    UNION

    SELECT TOP 1 ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI1 ORDER BY ADDRESS_ID1

    UNION

    SELECT TOP 1 ADDRESS_ID1 FROM EARLY_PAYOFF_INFO EPI2 ORDER BY ADDRESS_ID1

    UNION

    SELECT TOP 1 ADDRESS_ID2 FROM EARLY_PAYOFF_INFO EPI3 ORDER BY ADDRESS_ID2

    UNION

    SELECT TOP 1 ADDRESS_ID FROM CHECK_REQUEST_DEFAULT_ADDRESS ORDER BY ADDRESS_ID

    UNION

    SELECT TOP 1 PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID FROM DIRECT_LENDING_SERVICE ORDER BY PREV_LIENHOLDER_MAIL_TO_ADDRESS_ID

    UNION

    SELECT TOP 1 ADDRESS_ID FROM PACKET ORDER BY ADDRESS_ID

    UNION

    SELECT TOP 1 ADDRESS_ID FROM WORK_QUEUE_ITEM ORDER BY ADDRESS_ID

    UNION

    SELECT TOP 1 ADDRESS_ID FROM SERVICE_REQUEST ORDER BY ADDRESS_ID

    UNION

    SELECT TOP 1 ADDRESS_ID FROM LIENHOLDER ORDER BY ADDRESS_ID

    ) A (ADDRESS_ID)

    )

    I wonder - how long the SELECT part of the query would take?

    _____________
    Code for TallyGenerator

  • Sergiy (2/14/2016)


    I wonder - how long the SELECT part of the query would take?

    The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/14/2016)


    Sergiy (2/14/2016)


    I wonder - how long the SELECT part of the query would take?

    The DELETE part would probably take as long as not doing any deletes and would probably delete the same number of rows... none. Mission accomplished! 😀

    Why do you think it's gonna be none rows deleted?

    _____________
    Code for TallyGenerator

  • Sergiy (2/14/2016)


    Indianrock (2/14/2016)


    You're correct that the Address table may not be worth bothering with as far as purging. There are ten other tables such as these which have more delete possibilities.

    TableName______________________Rows____________Records deletable

    PERSISTENT_OBJECT_REFERENCE____1,093,267,410______150,501,810

    EVENT__________________________895,659,484________162,105,470

    Entities like "Event" or "Object Reference" should be referencing other entities ID's, not being referenced by others.

    When you need to clean up Event table just select N of oldest records which are beyond the retention period (Event Date is a clustered key, of course, so it's a fast running query), remove them from where there Events might be mentioned and then delete from Event.

    You should not be deleting events just because they are not mentioned in other tables.

    Each table has its own defined purge logic. I just mentioned a couple of the other ones beyond address because they had much more work to be done if I wanted to just ignore address for now.

  • "nothing has examined the process well enough to know if any given Address_ID is about to be used in one of the other tables. There has to be either a process guarantee or an expected temporal guarantee or we might be deleting data that shouldn't be. "

    Well now we're getting off into the realm of flux capacitors. 🙂

    However I do feel more like I do now than I did when I first posted this.

Viewing 15 posts - 31 through 45 (of 53 total)

You must be logged in to reply to this topic. Login to reply