How to bulk SPECIFC data from a DB respecting ALL CONSTRAINTS!

  • Hi all, my company has a centralized database that contains data that belongs to many different locations. I have now been tasked with coming up with a way to be able to take a copy of this database and then 'DELETE' ALL data that does not belong a specific group of locations.
    Basically I need to generate scripts that can delete data in the correct hierarchical order based upon FK's etc that will have a 'WHERE' clause like this

    WHERE [dbo].[tbl_VPOS_Locations].LocationID NOT IN (0, 199, 200, 346, 370)

    So all data that DOES NOT belong to one of the location ID's in the 'WHERE' clause is deleted.

    I have searched this forum, and numerous others, and found many different fragments of what I need to do and from them I have so far managed to achieve the following; (too many people to thank to be able to specify names but massive thanks to you all!)

    1. I have a script that generates ALL of the 'DROP; statements based upon the FK relationship between my table '[dbo].[tbl_VPOS_Locations]' and the column 'LocationID' and ALSO all of the tables related. I end up with a MASSIVE list of delete statements that look like this


    DELETE FROM [TARGET TABLE}
    FROM [TARGET TABLE}
    INNER JOIN {JOIN 1} ON {JOIN 2}.{COL}={JOIN 1}.{COL}
    INNER JOIN {JOIN 3} ON {JOIN 2}.{COL}={JOIN 3}.{COL}
    WHERE [dbo].[tbl_VPOS_Locations].LocationID NOT IN (0, 199, 200, 346, 370)

    I do see that are multiple deletes being carried out from the same table BUT with different 'JOINS' which I assuming is correct as this should cover all data scenarios.
    Now when I run this script it does actually work, to a point, BUT it is taking well over 36+ hours which way out of specified time limit of 7 hours MAX (1 working day downtime for the specified location group).

    I am desperate to find out any ways to massively improve this process.

    Things that I have already tried:

    1. I set the DB recovery mode to SIMPLE - Certainly prevented the LOG from getting TOO big.
    2. I have tried disabling ALL indexes before doing the 'DELETE' - hoping that might speed things but it is actually appears detrimental as the JOIN's most likely rely on the INDEX to some extent.
    3. Tries creating temporary tables without any constraints and inserting into them ONLY the required data with the intention of dropping the original table and then renaming and adding constraints - this seemed to work until I realized that my original multi-delete statements that I changed to inserts were inserting the same data multiple times.
    4. Exclude ALL duplicate 'DELETE FROM {Table}' statements regardless of 'JOINS' so delete only executed once - Improved time taken BUT did not delete ALL data as required and caused many FK issues when trying to delete other data
    5. Excluded some of the larger tables with the intention of manually addressing those later afterwards - improved time in conjunction with item 4 BUT same problem with other deletes and FK issues.

    My best 2 combinations from above were items 4 & 5 together, got down to 11 hours BUT too many failures which would require rerunning the script again.
    I am now going a bit code blind looking at this problem so if anybody can suggest ANY better ways of achieving what I need to do or can point me to any software that would help I would be extremely grateful!!

  • Do you have an actual execution plan from a long-running delete that you could post here, please?

    John

  • I do not at the moment unfortunately. I will have to restore my DB backup and try running a section of the script in isolation to generate one. How is it best to attachan exceution plan to this post?

  • Save it as a .sqlplan file and attach it to your post.

    John

  • You will need indexes on the columns you are joining on to get good performance. 
    When deleting rows from tables that have FK's on different tables pointing to it the deletes can be slow as it does referential checks on the other tables to check the primary key (or unique index) on the table you are deleting on doesn't have references on the other tables, so an index will help on the tables with the FKs. And/Or you can set the constraints on the tables with the FKs on to nocheck the constraints while the delete is running, then enable the constraints after the delete:
    ALTER TABLE [dbo].[MyTableWithFkOn] NOCHECK CONSTRAINT ALL
    ALTER TABLE [dbo].[MyTableWithFkOn] CHECK CONSTRAINT ALL
    Also wrap the code in a try catch block and have a transaction on all the deletes. In the catch you should rollback the transaction and set the tables to check constraint.

  • Would it be easier to copy the data you want to keep out to another location?  If those 5 locations are small relative to the overall data set, inserting the good data into a new set of tables would tend to be a LOT faster.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks for your replies. I noticed when I finally got to look at a few execution plans that there were quite a few Index scans going on that were not very performant. I have now tried a slightly different approach.
    I have created a script that generates individual delete statements based on the primary key of the row that I need to delete.


    As you can see this generates a very nice execution plan indeed HOWEVER there are over 20+ million of these single line delete statements generated.
    Now I would expect, despite the quantity, of statements to execute I would still expect these deletes to be as fast as they could possibly be but that is not the case.
    It is still extremely slow. New things I have tried;

    DROPPED ALL Foreign Key Constraints FOR ALL tables BEFORE running the scripts.
    DISABLED all NONCLUSTERED indexes on ALL tables BEFORE running the scripts.
    Generated delete statements are being execute via SQLCMD utility and Output is being written to a text file.

    This whole thing started with having a script that generates the hierarchical relationship between an initial table and it's FK tables and then their FK tables in descending order so that all related is deleted cleanly.
    The statements generated from this script are then used to generate all of the individual delete statements, hence the large number of delete statements.
    Something else that I have noticed that is obviously having an impact is that my FK dependency script is generating multiple select against the SAME table BUT the only difference is the tables that are joined and in some cases these slightly differing queries return the some of same data, so my delete statements are being duplicated. I think that is a very different problem though.

    Now the other thing is the hardware that this whole process is being run on;
    SQL 2008
    i5-5200 CPU @ 2.2GHz
    16GB RAM
    WIndows 10

    Yes, I know, I should have checked this information first but I only remote access initially with very limited permissions so could not be sure what this info was, now got admin access.
    I can see that this hardware WILL have an impact on the performance but would it really be to such a detrimental affect?
    So far, doing PK deletes, this has been running for 26 hours and is not even half way through! I mentioned earlier that my time window for doing this is very small.

    Short of gaining access to a massively equipped server is there anything that I am obviously missing?
    Shouldn't PK deletes be about as fast as you can get?
    Should I be grouping the delete statements in some way? 
    There is currently a 'GO' generated for EVERY delete statement is that problematic?
    Should there only be a 'GO' after x number of deletes and batched together?

    I have considered the option of creating temporary tables and inserting into them only the data that I want to keep, dropping the original table and renaming the tables back to original.
    My questions about this are;
    Do I create the temporary tables with the same constraints? i.e. PK, indexes, etc? (But they cant be the same name can they?)
    As mentioned above, I cant really use a PK as there is a very high risk of the exact same data being inserted again based upon my scripts, should I create without any constraints and only add them after the table has been renamed?
    Is it feasible to delete duplicate records from a table that has no constraints or keys?
    Is a select into faster that a PK delete?

    Sorry for all the questions but I somehow feel committed to the path I have chosen to be able to start from scratch UNLESS there is no alternative.

  • If you are deleting from a table based on a query joining several tables then some if the indexes will help performance. So you shouldn't be disabling all non-clustered indexes.

  • If you're deleting 20 million rows one at a time, of course it's going to be slow.  Do it in one go - that's what RDMSs are designed for.  If you're worried about holding locks for too long, or lock escalations, or the size of your transaction log, then by all means split it into batches.  And please don't drop your foreign key constraints - they're there to ensure the integrity of your data.  If you get rid of them as soon as they become inconvenient, then you lose another benefit of an RDBMS.

    John

  • Jonathan AC Roberts - Friday, November 16, 2018 4:31 AM

    If you are deleting from a table based on a query joining several tables then some if the indexes will help performance. So you shouldn't be disabling all non-clustered indexes.

    Thank you for your reply but the delete statements have no joins in them at all. They are generated from a script that has the required joins and then this query just spits out the resulting delete statement based upon the primary key fields required. Admittedly disabling the indexes was just something I tried thinking it might speed things up if it was not necessary to handle indexes after a delete. I think I may have misunderstood how deletes affect indexes..

  • John Mitchell-245523 - Friday, November 16, 2018 4:41 AM

    If you're deleting 20 million rows one at a time, of course it's going to be slow.  Do it in one go - that's what RDMSs are designed for.  If you're worried about holding locks for too long, or lock escalations, or the size of your transaction log, then by all means split it into batches.  And please don't drop your foreign key constraints - they're there to ensure the integrity of your data.  If you get rid of them as soon as they become inconvenient, then you lose another benefit of an RDBMS.

    John

    Thank you for your reply. Regards the FK's, it is not that they are inconvenient and I know they are very much required for data integrity but I have only removed whilst the delete is in progress and they are recreated at the end of the process and any anomalies found during recreation are captured to be corrected so that the DB's data integrity remains intact, if not better due to the data corrections as required, than it was at the very start.

  • Understood, but if it were me I'd rather prevent any anomalies than try to capture and correct them after the fact.  In the worst case, there could be hundreds or even millions of them.  Do you have reason to believe that their existence has a significant effect on the performance of the deletes?

    John

  • scout7-500377 - Friday, November 16, 2018 4:55 AM

    Jonathan AC Roberts - Friday, November 16, 2018 4:31 AM

    If you are deleting from a table based on a query joining several tables then some if the indexes will help performance. So you shouldn't be disabling all non-clustered indexes.

    Thank you for your reply but the delete statements have no joins in them at all. They are generated from a script that has the required joins and then this query just spits out the resulting delete statement based upon the primary key fields required. Admittedly disabling the indexes was just something I tried thinking it might speed things up if it was not necessary to handle indexes after a delete. I think I may have misunderstood how deletes affect indexes..

    In your original post you say are deleting with statements like this:
    DELETE FROM [TARGET TABLE}
    FROM [TARGET TABLE}
    INNER JOIN {JOIN 1} ON {JOIN 2}.{COL}={JOIN 1}.{COL}
    INNER JOIN {JOIN 3} ON {JOIN 2}.{COL}={JOIN 3}.{COL}
    WHERE [dbo].[tbl_VPOS_Locations].LocationID NOT IN (0, 199, 200, 346, 370)

    Now you say "the delete statements have no joins in them at all"
    So how are you deleting from the tables? Do you store the keys in a temporary table? I don't understand how you are deleting the rows.

  • Jonathan AC Roberts - Friday, November 16, 2018 5:47 AM

    scout7-500377 - Friday, November 16, 2018 4:55 AM

    Jonathan AC Roberts - Friday, November 16, 2018 4:31 AM

    If you are deleting from a table based on a query joining several tables then some if the indexes will help performance. So you shouldn't be disabling all non-clustered indexes.

    Thank you for your reply but the delete statements have no joins in them at all. They are generated from a script that has the required joins and then this query just spits out the resulting delete statement based upon the primary key fields required. Admittedly disabling the indexes was just something I tried thinking it might speed things up if it was not necessary to handle indexes after a delete. I think I may have misunderstood how deletes affect indexes..

    In your original post you say are deleting with statements like this:
    DELETE FROM [TARGET TABLE}
    FROM [TARGET TABLE}
    INNER JOIN {JOIN 1} ON {JOIN 2}.{COL}={JOIN 1}.{COL}
    INNER JOIN {JOIN 3} ON {JOIN 2}.{COL}={JOIN 3}.{COL}
    WHERE [dbo].[tbl_VPOS_Locations].LocationID NOT IN (0, 199, 200, 346, 370)

    Now you say "the delete statements have no joins in them at all"
    So how are you deleting from the tables? Do you store the keys in a temporary table? I don't understand how you are deleting the rows.

    Because if you read my update from earlier today you will see that I changed my approach

    " I have now tried a slightly different approach. I have created a script that generates individual delete statements based on the primary key of the row that I need to delete."

  • scout7-500377 - Friday, November 16, 2018 6:08 AM

    Jonathan AC Roberts - Friday, November 16, 2018 5:47 AM

    scout7-500377 - Friday, November 16, 2018 4:55 AM

    Jonathan AC Roberts - Friday, November 16, 2018 4:31 AM

    If you are deleting from a table based on a query joining several tables then some if the indexes will help performance. So you shouldn't be disabling all non-clustered indexes.

    Thank you for your reply but the delete statements have no joins in them at all. They are generated from a script that has the required joins and then this query just spits out the resulting delete statement based upon the primary key fields required. Admittedly disabling the indexes was just something I tried thinking it might speed things up if it was not necessary to handle indexes after a delete. I think I may have misunderstood how deletes affect indexes..

    In your original post you say are deleting with statements like this:
    DELETE FROM [TARGET TABLE}
    FROM [TARGET TABLE}
    INNER JOIN {JOIN 1} ON {JOIN 2}.{COL}={JOIN 1}.{COL}
    INNER JOIN {JOIN 3} ON {JOIN 2}.{COL}={JOIN 3}.{COL}
    WHERE [dbo].[tbl_VPOS_Locations].LocationID NOT IN (0, 199, 200, 346, 370)

    Now you say "the delete statements have no joins in them at all"
    So how are you deleting from the tables? Do you store the keys in a temporary table? I don't understand how you are deleting the rows.

    Because if you read my update from earlier today you will see that I changed my approach

    " I have now tried a slightly different approach. I have created a script that generates individual delete statements based on the primary key of the row that I need to delete."

    "I have created a script that generates individual delete statements based on the primary key of the row that I need to delete."
    So you are generating a delete statement for every row you are deleting??
    Instead, just insert the primary key values that you need to delete into temporary tables (one temporary table for each table you want to delete from) and then delete from each table with a join to its related temporary table.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply