Unexpected Behavior With TOP clause in a DELETE

Question

Unexpected Behavior With TOP clause in a DELETE

Recurs1on

SSCommitted

Points: 1974
More actions
December 5, 2016 at 7:50 am

#315574

I recently encountered some unexpected behavior, or at least not expected by me, using the TOP clause in a DELETE statement. I'm working on SQL 2012 standard. I was writing a procedure that would delete data in smallish chunks from a large table that needed to be cleaned up while the system remained online, so I was doing it inside a while loop using the TOP clause. It was generally working as expected and would delete the data within about 10 seconds which was allowing other queries to run before it deleted the next chunk so nothing was getting blocked. What wasn't working well was the performance. It was dreadfully slow, and at first I couldn't figure out why. After closer examination, it became evident that SQL was using a very poor execution plan. Below is a script that will reproduce the results. Run this first block to set up the temp tables
SELECT TOP 1000000
IDENTITY(INT,1,1) AS ID
INTO #TableA
FROM master.dbo.SysColumns sc1,
master.dbo.SysColumns sc2
ALTER TABLE #TableA
ADD CONSTRAINT pkTableA
PRIMARY KEY CLUSTERED (ID)
WITH FILLFACTOR = 100
SELECT TOP 1000000
IDENTITY(INT,1,2) AS ID
INTO #TableB
FROM master.dbo.SysColumns sc1,
master.dbo.SysColumns sc2
ALTER TABLE #TableB
ADD CONSTRAINT pkTableB
PRIMARY KEY CLUSTERED (ID)
WITH FILLFACTOR = 100
Now paste the following query into the same window and look a the difference in execution plans.
DECLARE @TopFilter INT = 10000;
DELETE A
FROM #TableA A
JOIN (
SELECT TOP(@TopFilter) ID
FROM #TableB
ORDER BY ID
) AS B
ON B.ID = A.ID;
DELETE A
FROM #TableA A
JOIN (
SELECT TOP(@TopFilter) ID
FROM #TableB
ORDER BY ID
) AS B
ON B.ID = A.ID
OPTION( OPTIMIZE FOR ( @TopFilter = 100000 ) );
Without the optimize clause in the second query, the optimizer assumes that @TopFilter is 100, even though it was declared with a default value, and picks a nested loop instead of a merge join. When this code ran against large tables in a while loop that subsequently deletes the rows from the second table that were just deleted from the first, the difference in execution plans turned out to be about an extra 4.5 hours of run time. Am I missing something or is the optimizer not behaving as it should here? Having to use an OPTION OPTIMIZE FOR clause seems like it should be unnecessary, but maybe I'm missing something.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Adi Cohn-120898 SSC-Dedicated Points: 33944 More actions · Answer 1

I don't think that this behavior is because of the top clause. I think that this is just the way that SQL Server works with variables. When SQL Server creates a query plan, it ignores the statements that assign value to variables. When you later use those variables in other statements, SQL Server doesn't know what are the values so it is using some rules to decide how many rows will be affected. Unfortunately most times those rules has nothing to do with reality. Bellow is another example of this behavior (check the expected number of rows for both queries):

declare @BeginDate datetime = '20170101'

select *

from sys.objects

where create_date > @BeginDate

go

select *

from sys.objects

where create_date > '20170101'

Adi

--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

I would just hard-code the value I want to delete. Or use OPTION (RECOMPILE) if you do this with a variable.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Recurs1on SSCommitted Points: 1974 More actions · Answer 3

The understand that I could replace the variable, but the reason it was there in the first place was because there were multiple tables of relatively the same size involved in the procedure and so I use the @TopFilter so that I could control the delete block size for all of them with a single value. I'm also not sure how OPTION RECOMPILE would work for two reasons: it doesn't seem that SQL will read the value of the parameter even if I recompile it, and since the DELETE statement is inside a WHILE loop, wouldn't it get expensive to recompile on every iteration of the loop?

drew.allen SSC Guru Points: 76997 More actions · Answer 4

I also notice that your deleting from tableA, but filtering TableB, so you're going to keep getting the same 10000 rows from TableB every time through the loop.

Have you tried the following version

DECLARE @TopFilter INT = 10000;

DELETE TOP(@TopFilter) A

FROM #TableA A

JOIN #TableB

ON B.ID = A.ID;

-- You don't need to specify an order which would require a sort.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

Recurs1on (12/5/2016)
The understand that I could replace the variable, but the reason it was there in the first place was because there were multiple tables of relatively the same size involved in the procedure and so I use the @TopFilter so that I could control the delete block size for all of them with a single value. I'm also not sure how OPTION RECOMPILE would work for two reasons: it doesn't seem that SQL will read the value of the parameter even if I recompile it, and since the DELETE statement is inside a WHILE loop, wouldn't it get expensive to recompile on every iteration of the loop?

1) removed

2) Speaking of blocking, this type of operation REALLY needs to be done against an index that you get a SEEK plan on ALWAYS and NEVER escalates to a table lock (obviously unless you have a system with no concurrent activity).

3) Another issue with TOP @var is you don't know when you need to stop and backup the transaction log. That is another thing to watch out for obviously - filling it up and/or causing it to grow uncontrolled.

4) I have done this type of operation at least dozens of times at clients over the years. I don't think I have ever used a driver table like you are with a join. I have used some form of key (like the identity here, date field, etc) to seek into my range and use a simple variable to drive the loop. Then I have a very simple WHERE clause such as :

WHERE @id < @i + 50000 AND (other requisite filters)

Just iterate @i as appropriate.

I bake in WAITFOR DELAY ... to each loop to give server headroom. I also check tlog size if necessary and trigger a tlog backup when appropriate.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Recurs1on SSCommitted Points: 1974 More actions · Answer 6

drew.allen (12/5/2016)
I also notice that your deleting from tableA, but filtering TableB, so you're going to keep getting the same 10000 rows from TableB every time through the loop.
Have you tried the following version
DECLARE @TopFilter INT = 10000;
DELETE TOP(@TopFilter) A
FROM #TableA A
JOIN #TableB
ON B.ID = A.ID;
-- You don't need to specify an order which would require a sort.
Drew

I will need to delete all rows that exist in Table B from Table A. There is another statement inside the while loop that deletes the rows from table B that were just deleted from table A in the procedure which I did not include because it doesn't change the behavior of the query.

Recurs1on SSCommitted Points: 1974 More actions · Answer 7

2) Speaking of blocking, this type of operation REALLY needs to be done against an index that you get a SEEK plan on ALWAYS and NEVER escalates to a table lock (obviously unless you have a system with no concurrent activity).

Both tables do contain appropriate indexes so that when the OPTIMIZE FOR clause is used SQL uses a MERGE JOIN.

3) Another issue with TOP @var is you don't know when you need to stop and backup the transaction log. That is another thing to watch out for obviously - filling it up and/or causing it to grow uncontrolled.

Yes, this is always a concern, but we have automatic LOG backups on a schedule are making sure that there will be sufficient space available for them to contain all the data that will get into them.

4) I have done this type of operation at least dozens of times at clients over the years. I don't think I have ever used a driver table like you are with a join. I have used some form of key (like the identity here, date field, etc) to seek into my range and use a simple variable to drive the loop. Then I have a very simple WHERE clause such as :

I do understand that driving a delete with a control table like this is somewhat odd, but the reason is that the criteria for determining which data should be deleted is both expensive to determine and not part of the PK, or in fact even in the tables that contain the data that I'm deleting. However, even though it is expensive to query, it is consistent across all the tables that need to have data removed, so I'm gathering it once and indexing it in the driver tables and then using those to control the actual deletes. If you have a better way of doing this, I would love to simplify it.

Recurs1on SSCommitted Points: 1974 More actions · Answer 8

drew.allen (12/5/2016)
I also notice that your deleting from tableA, but filtering TableB, so you're going to keep getting the same 10000 rows from TableB every time through the loop.
Have you tried the following version
DECLARE @TopFilter INT = 10000;
DELETE TOP(@TopFilter) A
FROM #TableA A
JOIN #TableB
ON B.ID = A.ID;
-- You don't need to specify an order which would require a sort.
Drew

Sorry, there's another reason that I neglected to put in the previous answer. Both tables can be very large and while all the columns from B's PK will always be in A, a may have other columns and/or not be in the same order. That is not usually the case, but it can happen. Also, if I do it the way you suggest, I will always be joining to a full sized table B. I was attempting to make the join as small a possible by limiting B and then deleting the rows that had been already taken care of. However, you bring up an interesting option and I'll have to test it to see if the performance is acceptable. It may very well work fine that way.

drew.allen SSC Guru Points: 76997 More actions · Answer 9

Recurs1on (12/5/2016)
drew.allen (12/5/2016)
I also notice that your deleting from tableA, but filtering TableB, so you're going to keep getting the same 10000 rows from TableB every time through the loop.
Have you tried the following version
DECLARE @TopFilter INT = 10000;
DELETE TOP(@TopFilter) A
FROM #TableA A
JOIN #TableB
ON B.ID = A.ID;
-- You don't need to specify an order which would require a sort.
Drew
I will need to delete all rows that exist in Table B from Table A. There is another statement inside the while loop that deletes the rows from table B that were just deleted from table A in the procedure which I did not include because it doesn't change the behavior of the query.

Still, sorts are expensive, so getting rid of the sort should greatly improve the performance of the query. Also, if the only reason for deleting the rows from Table B is to update the loop, this approach would make that unnecessary further improving the performance of the query.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

It sounds like you have a pretty good handle on a complex situation. Kudos for that honestly.

As Drew mentioned SORTS are painful (despite getting you a MERGE JOIN apparently) so could be suboptimal. I would definitely try without that.

I would also try not deleting from the driver table. That's a LOT of work. Use my key-value driver trick to range through the B table. Seek there and on main A table and you never miss or double-hit records.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Recurs1on SSCommitted Points: 1974 More actions · Answer 11

Thanks for the suggestions guys. I'll try them out and see if either of them makes a difference.

Joe Celko SSCertifiable Points: 5749 More actions · Answer 12

I would like you to think about the proprietary syntax and what it really means. Essentially, although you probably do not know know it, you are doing a nineteen fifties tape file operation in SQL. The ANSI/ISO standard model for how this stuff works as we go to a FROM clause and construct a temporary working table that will exist only for the duration of the statement. Officially, this means that all you deletions would have been on this temporary working table! Which would be useless. But the old Sybase model from 3+ decades ago, tries to map back to the base table in the delete clause. Starting to see all kinds of implementation problems with this?

Unfortunately you have a TOP () operator. This proprietary and totally nonrelational operation requires a sort, hence the order by clause. Sorts are a killer in a relational system. You cannot do parallelization, map and reduce, or other modern optimizations. But remember the original Sybase version of this product was built on UNIX filesystems.

I also see you used A and B as table aliases. This is because tape drives and later floppy disk drives were named in alphabetical order. You are basically mimicking old tape filesystems.

The one that scares me is your magic generic "id"; this implies they are issued sequentially in some meaningful way. Unfortunately, the SQL Server world, they are too often IDENTITY columns. This is totally nonrelational and simply mimics the records on a tape file.

DELETE A

FROM #TableA AS A

INNER JOIN

(SELECT TOP(@ non_rdms_filter_nbr) generic_id

FROM #TableB

ORDER BY generic_id) AS B

ON B.generic_id = A.generic_id;

In ANSI/ISO standard SQL. We would write something like this:

DELETE FROM Alpha

WHERE EXISTS

(SELECT *

FROM Beta

WHERE Beta.generic_id = Alpha.generic_id

AND Beta.generic_id = ???);

This is only a skeleton, created without any DDL or specs, but you get the idea. One of my heuristics over the last three decades has been, "when your queries start to get slow as you scale up, then the DDL was probably wrong."

Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 13

CELKO:

Unfortunately you have a TOP () operator. This proprietary and totally nonrelational operation requires a sort, hence the order by clause. Sorts are a killer in a relational system. You cannot do parallelization, map and reduce, or other modern optimizations.

1) Please provide links to proof of any of those statements for SQL Server.

2) STOP SPOUTING YOUR 30-YEAR-OLD KNOWLEDGE ABOUT NON-SQL-SERVER-STUFF ON THIS FORUM!!!!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

drew.allen SSC Guru Points: 76997 More actions · Answer 14

CELKO (12/6/2016)
Unfortunately you have a TOP () operator. This proprietary and totally nonrelational operation requires a sort, hence the order by clause.

Actually, that's not true. The ORDER BY clause is only required if you want the results to be deterministic. In situations such as this, where you are only using TOP() to create batches to break the process into smaller pieces, you don't care about determinism, because you're processing all of the records, and it doesn't matter which order you are processing them in.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA