December 10, 2010 at 10:25 am
Hi,
I'm creating a stored procedures that accepts a comma delimited string of record ids (e.g. '1,2,3') as a parameter.
I'm considering using two approaches, 1. using the WHERE...IN clause and 2. using a JOIN. I've illustrated both approaches below. @ProductTypeIDs is the input parameter of type nvarchar containing the ids (i.e. '1,2,3') and fn_IntListToTbl is a function that unpacks the string into a table.
Example 1. WHERE...IN
SELECT
ProductID,
ProductName,
ProductTypeID,
ProductDescription
FROM
Products
WHERE
ProductTypeID IN (SELECT number FROM fn_IntListToTbl(@ProductTypeIDs))
Example 2. JOIN
SELECT
ProductID,
ProductName,
ProductTypeID,
ProductDescription
FROM
Products INNER JOIN fn_IntListToTbl(@ProductTypeIDs) i ON ProductTypeID = i.number
I'm seeking advice on which of the two approaches is likely to be best (in terms of performance and least error prone). I know that may be a bit subjective but just wondering if there are hard and fast rules or gotchas I should be aware of (e.g. is the WHERE...IN likley to degrade as the size of the string grows?). I did some tests on which performed best in terms of time taken to execute (caches etc. were cleared before each iteration). In my tests the WHERE...IN was faster by an average of 6% over the JOIN.
In the final version of my stored procedure I'm going to have to return data from more than one table. I did a couple of more tests using the faster of the approaches tested above (i.e. IN...WHERE) which I've illustrated below. This time the Products table is joined to the ProductType table using the key ProductTypeID. In Example 3 the WHERE...IN clause is executed against the larger Products table (i.e. the table containing the ProductTypeID FK) whereas in Example 4 its executed against the smaller ProductTypes table (i.e. the table containing the ProductTypeID PK).
Example 3.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID
WHERE
Products.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))
Example 4.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID
WHERE
ProductTypes.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))
In my tests Example 3 was faster by an average of 50% over Example 4 - a big difference. Is there an obvious explanation why this is the case?
Finally I compared the situation using JOINs. In Example 5 the JOIN is to the FK field in the larger Products table whereas in Example 6 it's to the PK field in the smaller ProductTypes table.
Example 5.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN
fn_IDS_IntListToTbl(@ProductTypeIDs) i ON Products.ProductTypeID = i.number
Example 6.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN
fn_IDS_IntListToTbl(@ProductTypeIDs) i ON ProductTypes.ProductTypeID = i.number
In these tests Example 6 was faster by an average of 12% over Example 5. However both methods were considerably slower than Example 3 (>38%). The final standings in terms of speed between the 4 (equivalent?) methods was Example 3 (fastest), Example 6, Example 5, Example 4 (slowest).
I'm a bit confused. I started off thinking that the WHERE...IN method was quickest with Example 3 being the fastest but then Example 4 was also the slowest. Example 6 suggests its better to join against the smaller PK table (when compared to the larger FK one in Example 5) but in my mind this assumption conflicts with the previous one where the WHERE...IN clause was executed against the smaller table (Example 4) which was the slowest.
Just wondering if anybody could provide any insight/tips?
Thanks
December 10, 2010 at 1:01 pm
Just on the IN and the Join
http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/
As for the rest, look at the exec plan and see what's happening.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 10, 2010 at 1:49 pm
I tend to avoid comma-delimited lists for parameters these days, but when I do use them, I often find it useful to insert them into a temp table as the PK, then join to that.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
December 14, 2010 at 4:59 am
Hi Gail,
Thanks for the reply (much appreciated and sorry for the slight delay in getting back to you - I've only just now had a chance to return to this).
Thanks for the pointer to the article - it looks like a more comprehensive examination than mine so I'll certainly look at it in more detail.
Regarding gleaning stuff from the execution plan I'll have to check out what I should be looking for.
Thanks again
December 14, 2010 at 5:05 am
Hi GSquared,
Again, thanks for the reply.
I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).
Thanks
December 15, 2010 at 7:59 am
Makes sense.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
December 15, 2010 at 3:41 pm
landingnormand (12/14/2010)
Hi GSquared,Again, thanks for the reply.
I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).
Thanks
Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.
December 15, 2010 at 10:10 pm
Lamprey13 (12/15/2010)
landingnormand (12/14/2010)
Hi GSquared,Again, thanks for the reply.
I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).
Thanks
Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.
I'm pretty sure you will find the performance issue is caused by the function in the where clause. The function is evaluated for EVERY record created in the Join- Not the Select.
In real terms the order any query is internally processed in this order
1. From Clause
2. Where - and this includes your funky function 🙂
3. Group by
4. Having
yahoo finally we get to the 5. The Select clause - heaven forbid you should have the same function in here cos it will do it again.
6. and lastly the order by clause.
Source
Inside Microsoft T-SQL Querying by Itzik Ben-Gan
Chapter 1 will open your eyes to 'real' way queries are processed. In SQL 2005 they introduced two excellent constructs that reduce the need and the performance overhead of imbedded funtions the APPLY keyword and 'Conmmon Table expressions' CTE. I would suggest you read up on these and then review the code you posted..
Come into the light and enjoy the joys of Set based SQL...
CodeOn
😛
PEBCAK!
December 19, 2010 at 4:34 pm
I know this is an older post but could you post the fn_IntListToTbl, please? Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 20, 2010 at 8:43 am
I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.
I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it
December 20, 2010 at 9:04 am
Mad Myche (12/20/2010)
I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it
Just to be sure, fn_Split isn't an SQL Server provided function. If it has either a recursive CTE or a While Loop in it, you should post it so we can show how how to get around the performance problems associated with such things in a split function.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 20, 2010 at 9:23 am
I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shop
I revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId
INNER JOIN ProductCats r ON c.ParentId = r.CatId
WHERE r.CatId = 4
ORDER BY c.Cat, p.CartDesc
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID
ORDER BY c.Cat, p.CartDesc
December 20, 2010 at 9:12 pm
Mad Myche (12/20/2010)
I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shopI revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId
INNER JOIN ProductCats r ON c.ParentId = r.CatId
WHERE r.CatId = 4
ORDER BY c.Cat, p.CartDesc
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID
ORDER BY c.Cat, p.CartDesc
Thanks... no need for a While Loop, though, even in SQL Server 2000. Take a look at the following article. When I first learned about it, it opened a whole new world for me and tried to "pass it forward" with that article...
http://www.sqlservercentral.com/articles/T-SQL/62867/
Several optimizations have been made to the splitter code in that article (which also means the article needs an update... someday). I'll be back with a function that uses the optimizations I speak of but you really should read the article.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 20, 2010 at 10:27 pm
Ok... since I don't subscribe to SQL Server Magazine, I can't get to the actual code listings but I cobbled together a non-RBAR function that uses the Tally Table instead of a loop from the "hints" left in the link you provided.
If you haven't read the Tally table article I provided, yet, you should. But, to get you "out of the woods" for testing, here's how to build a permanent Tally table in SQL Server 2000. It's very quick...
--===================================================================
-- Create a Tally table from 1 to 11000
--===================================================================
--===== Create and populate the Tally table on the fly.
-- This ISNULL function makes the column NOT NULL
-- so we can put a Primary Key on it
SELECT TOP 11000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2
;
--===== Add a CLUSTERED Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC
;
GO
I don't care for the "Hungarian Notation" nor the column names nor the zero based IDX used in the article but, just to give you what I think is a direct replacement so you don't have to change any of your other code, I followed that example. Here's the code to do an iTVF like in the article to do a split up to 7999 characters without a loop and a small test where you can verify that it works.
CREATE FUNCTION dbo.fn_Split
(
@sText VARCHAR(7999), --Missing one because we need to add 1 delimiter
@sDelim VARCHAR(1) --Just to keep the same datatype
)
RETURNS TABLE AS
RETURN
(
SELECT IDX = t.N - LEN(REPLACE(LEFT(@sText, t.N), @sDelim, '')),
Value = SUBSTRING(@sText, t.N, CHARINDEX(@sDelim, @sText + @sDelim, t.N) -t.N)
FROM dbo.Tally t
WHERE t.N BETWEEN 1 AND DATALENGTH(@sText) + 1 --DATALENGTH allows trailing spaces to be delimiters
AND SUBSTRING(@sDelim + @sText, t.N, 1) = @sDelim
)
GO
SELECT * FROM dbo.fn_Split('1,10,100,1000,10000,100000,1000000,A,B,C,D,E,Jeff Moden',',')
--Jeff Moden
Change is inevitable... Change for the better is not.
December 21, 2010 at 9:16 am
Thank you for the great information. I created the Tally Table (mine took 93,516 ms) and created that function. The performance gain was on the order of ~350%. I then trimmed the tally table down to 8000 and that gave me an extra 100% performance.
Running the 2 versions back to back placed 82% of the query cost on the Loop method, and the remaining 18% on the Tally method. This was consistent across parameters varying from 25 characters out to 7950.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply