April 4, 2012 at 2:46 am
Hello,
As I am relatively new to SQL and still puzzling over some of the mysteries it offers up at times, I wonder if someone could explain in simple terms why I get the performance differential with a DELETE using different indexing strategies on both tables concerned.
Now I know that indexing tables can have an enormous effect on the performance of all sorts of SQL operations, but it's the following scenario that confuses me somewhat.
here's what I'm doing:
I have two tables, one called TempSales2 and the other called TempSales. Both tables have a column of type INT called RecID and I'm using this to DELETE rows from TempSales2 where the same RecID value does NOT exist in TempSales.
To achieve this, the following is the statement I am using:
DELETE FROM TempSales2
WHERE NOT EXISTS (SELECT 1 FROM TempSales
WHERE RecID = dbo.TempSales2.RecID);Now, originally I had only one index for either table and that was a UNIQUE index on the RecID column of the TempSales table. My reason for this was because I believed that the sub-query would greatly benefit if the RecID column value being passed from the DELETE statement from the TempSales2 table could be located much faster if the table being searched (I.e. TempSales) were indexed on its RecID column. My understanding is therefore that the DELETE statement is reading through an unordered table and executing the sub-query each time it reads a row and then the ReciD value is then compared within the sub-query. So, it's the sub-query that has to be fast.
In addition, I figured that as the table being deleted from (I.e. TempSales2) was not searched or ordered AND that having an index would slow deletion, that this table should remain unindexed.
The tables being searched/deleted had approximately 36 million rows each. The scenario above took some 6 hours and 2 minutes to run. I had deliberately engineered the data so that all rows in the TempSales2 table should be deleted - and they were.
A former colleague who was our resident SQL expert at the time suggested that I should also index the RecID column on the table being deleted from as this would speed up things considerably. I therefore reset all data and tried again this time adding a UNIQUE index to the suggested column.
The result? Well, this time the task completed in just over 10 minutes! A massive improvement in performance, but one I cannot understand given my novice status.
If anyone can help me with this - please in simple terms - I would much appreciate it.
People often tell me that it's not necessary to understand why, only to "do". But I'm somewhat anal in this regard and like to understand why things work the way they do.
Also, if there's a better way to perform this DELETE, then by all means let me know. My code is only the way I know how to achieve what I need, so please feel free to show me something better.
Regards
Steve
April 4, 2012 at 5:44 am
Without setting up the tests, I can't tell you why. But... I can tell you where to look.
Check out both execution plans. In fact, you don't have to rerun the full set of tests. Just reset the structures and update the statistics and get estimated execution plans. They will tell you how the indexes are being used for each structure and that will tell you why it's working the way it is.
Suspicion: There's a JOIN type being used that results in mulitple executions of a scan in the long running version of the query, but I can't be sure without seeing the execution plan.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
April 5, 2012 at 2:47 am
I will try to explain, even having no time to set up a test.
A delete can use indexes to find the deletion key, especially when it is a small number of records to delete.
Your query has 2 aspects:
1. A select part to determine which records to delete
2. The deletion part.
The select part is hugely improved by the unique index on TempSales2, as the optimizer now doesn't even need to access the table. All the information is now available in 2 unique indexes.
The deletion part is also improved by the unique index on TempSales2. (How much depends on the number of deletions).
Your idea that an index on a table might slow down a delete (or any modification to the table) is partly true, but in general that is more true for indexes on other than the deletion key columns. It is difficult to give a general guideline. Look at the following considerations:
Other indexes must be maintained during the operation which slows down the query and, even worse, makes the indexes grow in size.
The alternative is to drop or disable the indexes before the operation and to recreate them afterwards. This is also slowing down the process and there are more things to consider. For small numbers of modifications I usually choose to leave all the indexes on and will usually create an extra index on the update or delete key. But that is also possible because I will drop the indexes once a month anyway when the regular monthy ETL-processes are run.
One more remark:
I always try to avoid "where not exists"-query's as they are difficult to troubleshoot. Instead I prefer to use the CTE-format:
WITH deletes (RecID)
AS
(SELECT t2.RecID
FROM TempSales2 t2
LEFT OUTER JOIN TempSales t
ON T2.RecID=T.RecID
WHERE t.RecID IS NULL)
DELETE FROM deletes;
This way I can test the inner query and directly use it in the DELETE statement.
Hope this helps.
Rudolf van der Heide
Data warehouse architect
Currently working in a MSSQLServer environment
15 years experience in design and development of data warehouses and ETL,
and in ETL and data warehouse performance tuning, in both MSSQLServer and Oracle.
April 5, 2012 at 3:43 am
Thank you for the excellent advice.
I am especially pleased with your CTE-based technique as an alternative to my DELETE statement as originally posted. This is precisely the sort of elegant suggestion that is really helpful for me as it adds useful knowledge.
April 5, 2012 at 5:08 am
Thanks for the tip Rudolf, never knew that the delete from the CTE would affect the underlying table but the code as posted as a note earlier would not work due to a 4405 error. The CTE does not know which table the deletes are to be made in even though only the TempSales table is refferred to in the select list. The error suggests that the CTE is being used as a function or view which can only be edited if only one underlying table is involved.
with deletes(ID) as
(
select TempSales.ID
from TempSales
left outer join TempSalesToKeep
on TempSales.ID = TempSalesToKeep.ID
where TempSalesToKeep.ID IS NULL
)
delete from deletes
/*
Msg 4405, Level 16, State 1, Line 1
View or function 'deletes' is not updatable because the modification affects multiple base tables.
*/
go
with deletes(ID) as
(
select TempSales.ID
from TempSales
where TempSales.ID = 3
)
delete from deletes
/*
Row with ID 3 is deleted
*/
Fitz
April 5, 2012 at 5:30 am
Thanks for your reply Fitz. That's what I got for posting without testing. That means that CTE for deletes have not much value. I can rewrite it to make it work, but this way it is not very useful.
with deletes(ID) as
(
select Temp_Sales.ID
from Temp_Sales
WHERE temp_sales.id
IN (select Temp_Sales.ID
from Temp_Sales
left outer join Temp_Sales2
on Temp_Sales.ID = Temp_Sales2.ID
where Temp_Sales2.ID IS NULL)
)
delete from deletes
I can use a simple delete instead:
delete
from TempSales
WHERE TempSales.id IN
( select TempSales.ID
from TempSales
left outer join TempSales2
on TempSales.ID = TempSales2.ID
where TempSales2.ID IS NULL
)
I still prefer this query above the original one, because it is easier to test and understand than with the WERE NOT EXISTS construction.
The CTE works fine for updates as long as you update the columns from only one base table.
Rudolf van der Heide
Data warehouse architect
Currently working in a MSSQLServer environment
15 years experience in design and development of data warehouses and ETL,
and in ETL and data warehouse performance tuning, in both MSSQLServer and Oracle.
April 5, 2012 at 4:30 pm
Rudolf,
Gail Shaw wrote and excellent series of articles covering IN vs. Exists, NOT EXISTS vs. LEFT JOIN.
You should take a look:
http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/
Todd Fifield
April 5, 2012 at 6:20 pm
A super simple answer as to "why" the index on the deleted table made such a difference is that because a correlated subquery with an equality predicate will be translated into something that more resembles an inner join than most anything else.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply