March 13, 2008 at 8:44 am
Okay, this question came up at my workplace due to an ETL we're doing and in part because of the recent article about conditional WHERE Clauses & Functions. I wanted to see what y'all think.
Our ETL prepends an A or an X to an integer ID depending on what system the ID comes from. When I went to update a different field for only those records with an "A" on the ID, I thought it would be faster to do a WHERE Substring(ID,1,1) < 'X' in the Code instead of an = 'A'.
So, here's what I did (creating 2 temp tables so that you can follow along):
Create Table #MyTable1 (ID int identity(1,1) Not NULL, TestValue char(1));
Create Table #MyTable2 (ID int identity(1,1) Not NULL, TestValue char(1));
Declare @MyCnt1 int, @MyCnt2 int;
Set @MyCnt1 = 1000;
Set @MyCnt2 = 1000;
While @MyCnt1 > 0
Begin
Insert into #MyTable1 (TestValue)
(Select 'A');
Set @MyCnt1 = @MyCnt1 -1;
End
While @MyCnt2 > 0
Begin
Insert into #MyTable2 (TestValue)
(Select 'B');
Set @MyCnt2 = @MyCnt2 -1;
End
Select * from #MyTable1;
Select * from #MyTable2;
--All the above just sets up test data.
Select Top 1000 'A' + Convert(varchar(54),ID) as ID
into ##MyTemp
from #MyTable1
UNION ALL
Select Top 1000 'X' + Convert(varchar(54),ID) as ID
from #MyTable2
Order by ID;
--Creates a new Temp table with an ID that has a character on it
Select *
from ##MyTemp
where Substring(ID,1,1) < 'X'
--Tests the execution plan & time of this WHERE statement
Select *
from ##MyTemp
where Substring(ID,1,1) = 'A';
--Tests the execution plan & time of this WHERE statement
I tested this theory with 1000 rows from each database and it turns out I was wrong. The execution plans are exactly the same (both turn up a table scan), but the query with the < 'X' takes 2 miliseconds and the query with the = 'A' takes so little time, nothing registers in SSMS except 00:00:00.
Any thoughts on why this is?
Thanks,
March 13, 2008 at 8:53 am
Why the function? If that ETL ID is indexed - then:
...WHERE ID<'X'....
Should give you what you want, but would probably leverage an index (assuming it's selective enough or it's the clustered index).
Personally - I kind of think that the source identifier should be its own field. It really doesn't buy you anything (and costs you plenty) to concatenate the two.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 13, 2008 at 9:04 am
This isn't my code. But I think I'll make that suggestion to the person whose code it is. I agree that the Source is causing problems.
An you make a good point about the index. I'll check into this and see if it actually is indexed or not.
March 13, 2008 at 9:23 am
Matt's correct, why the function? If that column does contain an index then it will not be used. Simply put, the function will use all rows in the column (a table scan) to find the ones that satisfy that condition. I'm thinking that a LIKE 'A%' could be slightly better as now the optimizer will have some value as a guide when performing the lookup. YOu may improve from table scan to index scan or even an index seek.
By the way, no only look at the query plan but also SET STATISTICS IO ON and include client statistics. This will also give you an indication of how many packets are sent/received and any physical reads.
DAB
March 13, 2008 at 9:27 am
You both have good points. But for the sake of argument, let's say the column isn't indexed and can't be for whatever reason. Why does an Equals seem to work better than a Less Than or Greater Than?
Or is this dependant on how the database is set up?
March 13, 2008 at 9:33 am
Equal is an exact match while greater than or less than is a ranged match. Without looking at the query plan my guess is that the EQUALS operation just performs a scan. Since the requires a range there is probably an additional sort operation.
DAB
March 13, 2008 at 9:54 am
I'm not sure differences that small are significant. I just ran it against something a bit bigger ( okay, a 2M row table instead of a 2K row table), and a few runs back and for and the numbers were just flip/flopping. Sometimes one was better, sometimes the other was.
The other two solutions were quite a bit faster though. The < was fastest, followed by LIKE, and then the two function calls.
It may have something to do with data distribution too.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 13, 2008 at 10:54 am
If your numbers look like "A123456" and "X123457" and you want just the "A" rows AND you want it to scream...
WHERE LEFT(somecol,1) = 'A'
Substring takes a comparitively long time compared to Left. Another way might be (and, it may even be faster)...
WHERE SomeCol LIKE 'A%'
And, to answer your question, it seems that inequalities always take less than equalities... dunno why for sure... it's part of the engine...
--Jeff Moden
Change is inevitable... Change for the better is not.
March 13, 2008 at 10:56 am
p.s. A thousand rows isn't really enough to show the timing difference on something like this. Try it on a million rows more than once.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 13, 2008 at 11:19 am
All right - you wanted it - you go it.... a 2 Million row table.(sorry - 1M was still a bit wimpy on this test).
Add the following expressions to this base query
Select @fun=ID --using assignment to get around the display slowdowns
from #MyTemp
where....
and you get:
===============================
Substring(ID,1,1)< X
===============================
SQL Server Execution Times:
CPU time = 1172 ms, elapsed time = 587 ms.
===============================
Substring(ID,1,1)=A
===============================
SQL Server Execution Times:
CPU time = 1156 ms, elapsed time = 578 ms.
===============================
Left(id,1)=A
===============================
SQL Server Execution Times:
CPU time = 1157 ms, elapsed time = 579 ms.
===============================
ID< X
===============================
SQL Server Execution Times:
CPU time = 781 ms, elapsed time = 401 ms.
===============================
ID like A%
==============================
SQL Server Execution Times:
CPU time = 937 ms, elapsed time = 470 ms.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
March 13, 2008 at 11:20 am
Jeff,
My initial run was against 10 million records (again, not my code) and after about 16 + hours, I killed it. Hence me trying against 1000 rows to test for possible code changes. I didn't want to wait a whole 'nother day before I found code changes to suggest to the person I'm peer-reviewing.
I just threw in a nonclustered index (table didn't have any) and now it's starting to move a lot better. I'm going to try the Left() suggestion and see if that makes more of a difference.
Thanks, everyone. I appreciate the help.
March 13, 2008 at 7:53 pm
You're not trying to return the output to the screen are you? And, according to Matt's fine testing, ID<'X' is the fastest. I don't think a NCI will help the LEFT example I gave.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 14, 2008 at 9:28 am
Another possible solution (if you can) is a computed column with the first character of the ID and an index on that. Or, if you can't modify the table, an indexed view on that.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 14, 2008 at 11:37 am
Wow. I just found the main problem with the query.
I gave my co-worker a couple of CTEs which pull into temp tables. Unbeknownst to me, he decided to alter them by adding 23 columns and a join to a new table. Of course, he's been testing all this with a limited data set, so when I go to test it using all existing data, it runs forever.
DOH! Not that the other stuff wasn't important too, but now that I've realized the CTE has been thrown out of whack from what it was designed to do, I can at least go back and fix that first.
Then maybe I can go back to concentrating on this finiky WHERE clause.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply