Unexpected Behavior With TOP clause in a DELETE

  • Thank you for taking the time respond, but this thread has kind of gotten way off topic. I originally just wanted to know why the optimizer wasn't picking up the value assigned as the default variable in the TOP filter. The performance problem gets fixed as soon as SQL uses an appropriate value.

    But, to answer some of your questions:

    CELKO (12/6/2016)


    I would like you to think about the proprietary syntax and what it really means. Essentially, although you probably do not know know it, you are doing a nineteen fifties tape file operation in SQL. The ANSI/ISO standard model for how this stuff works as we go to a FROM clause and construct a temporary working table that will exist only for the duration of the statement. Officially, this means that all you deletions would have been on this temporary working table! Which would be useless. But the old Sybase model from 3+ decades ago, tries to map back to the base table in the delete clause. Starting to see all kinds of implementation problems with this?

    No, not really. See the responses below.

    Unfortunately you have a TOP () operator. This proprietary and totally nonrelational operation requires a sort, hence the order by clause. Sorts are a killer in a relational system. You cannot do parallelization, map and reduce, or other modern optimizations. But remember the original Sybase version of this product was built on UNIX filesystems.

    The ORDER BY is by the clustered key, so the sort should cost nothing since it's already sorted. All this operation has to do is delete data. It doesn't need to map reduce. It doesn't need to scale across servers. I just needs to delete lots of data without making the server unusable. If it were otherwise, then those things would definitely be a concern. That being the case, I don't see a problem using the TOP operator if it servers the purpose.

    I also see you used A and B as table aliases. This is because tape drives and later floppy disk drives were named in alphabetical order. You are basically mimicking old tape filesystems.

    Not a good assumption. In reality, the aliases are meaningful abbreviations. I just used A and B in the post because it was convenient.

    The one that scares me is your magic generic "id"; this implies they are issued sequentially in some meaningful way. Unfortunately, the SQL Server world, they are too often IDENTITY columns. This is totally nonrelational and simply mimics the records on a tape file.

    In reality, the IDs are GUIDs, not sequential IDs. Before anybody goes off on a tangent, I know all the argument against using them. I didn't design the system, I'm just trying to delete lots of data from it and I have to deal with it in it's current state. Because they are guids, I can't do a range select from them like I could with integers, hence the TOP and ORDER BY.

    DELETE A

    FROM #TableA AS A

    INNER JOIN

    (SELECT TOP(@ non_rdms_filter_nbr) generic_id

    FROM #TableB

    ORDER BY generic_id) AS B

    ON B.generic_id = A.generic_id;

    In ANSI/ISO standard SQL. We would write something like this:

    DELETE FROM Alpha

    WHERE EXISTS

    (SELECT *

    FROM Beta

    WHERE Beta.generic_id = Alpha.generic_id

    AND Beta.generic_id = ???);

    I tried many different ways of structuring the delete, including the latter way you suggest. The only one that I could get to perform reasonably was using a TOP. It is entirely possible that I simply implemented the delete badly when I tried using this method. Be that as it may, TOP works and is falls within the performance criteria that we need it to, and that is more important at this point than whether or not it follows ANSI/ISO SQL standards. I have no objection to doing things the ANSI/ISO SQL way if I can, but I don't see the point in spending large amounts of time trying to get it there if something else works. Sometimes, good enough it just that.

  • In reality, the IDs are GUIDs, not sequential IDs. Before anybody goes off on a tangent, I know all the argument against using them. I didn't design the system, I'm just trying to delete lots of data from it and I have to deal with it in it's current state. Because they are guids, I can't do a range select from them like I could with integers, hence the TOP and ORDER BY.

    That's actually not true. My method of doing ID >= @i AND ID < @i + 5000 Will work just fine against GUIDs. Just look at the character range of them and use LIKE with as many characters as you need and you will still get seeks. 0-9 and A-F are the allowable characters.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Interesting. I've never seen that method of range seeking on GUIDs and I'm not quite sure I follow what you've posted. Can you post a short example? These GUIDs are not sequential so there could be large gaps between them or lots of them clustered together.

  • Yes, gaps and cluster-groups with lots of rows can happen, just like with identities. But a TOP operator will limit the rows and make sure you never do so many that you escalate to a table lock. And if you do a range that has few to no rows it will finish in a flash and you can advance to the next range. Something like this works, and gets index seeks without a convert. Note I am doing this from memory and am jet lagged. :hehe:

    where myGUIDField >= '00000000-0000-0000-0000-000000000000'

    and myGUIDField < '00000000-0000-0000-0000-000FFFFFFFFF'

    where myGUIDField >= '00000000-0000-0000-0000-000FFFFFFFFF'

    and myGUIDField < '00000000-0000-0000-0000-00FFFFFFFFFF'

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • drew.allen (12/5/2016)


    I also notice that your deleting from tableA, but filtering TableB, so you're going to keep getting the same 10000 rows from TableB every time through the loop.

    Have you tried the following version

    DECLARE @TopFilter INT = 10000;

    DELETE TOP(@TopFilter) A

    FROM #TableA A

    JOIN #TableB

    ON B.ID = A.ID;

    -- You don't need to specify an order which would require a sort.

    Drew

    Just to close up this thread, this option was better than what I was doing. I replaced @TopFilter with a hard coded value since I didn't see a lot of point using a variable for @TopFilter and then having to put an OPTIMIZE FOR clause in there with a hard coded value for SQL to pick a decent plan. It was measurably faster than what I was doing before.

    I also did try the GUID range option but I couldn't quickly find any kind of increment that made sense. Even just changing one character at a time, 90% of all the data was in one range, so I'm just going to stick with the solution above.

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply