Using a Variable for an IN Predicate

Introduction

This is an old subject but, because of the number of posts that still appear every month about the subject here on SSC, another article seems appropriate as a reminder.

The posts where someone wants to use a variable for an IN type qualifier in a query are quite common. They want to use a comma delimited string for an IN. The string would typically be something passed in from a user interface. Something on the order of:

CREATE PROCEDURE SomeProc
( @StartDate   DATETIME
, @EndDate     DATETIME
, @String      VARCHAR(50)
)
AS
SELECT SomeColumns
FROM   SomeTable
WHERE  SomeDate BETWEEN @StartDate And @EndDate
AND    SomeColumn IN (@String)

Of course, this is something SQL Server can't do (yet). We've all wished for something like this (don't lie - I know we have ALL wished for it!

So how do we do this? We could use Dynamic SQL, which can be messy to maintain, can be a source of "SQL Injection" and really isn't necessary for this type of thing.

Wait a minute! An array is a SET of like data! SQL Server deals with sets! What do we call these SET things.... Hmm, let's see... Oh yeah - they're called tables. What we want to do is split up the array into a table of individual elements and use that for our qualifying condition. Let's see how we can do that.

References and Sources

There have been many good articles here on SSC on arrays and splitting up strings into tables. The most definitive, probably, is one by Jeff Moden:

Passing Parameters as (almost) 1, 2, and 3 Dimensional Arrays

Another excellent article by Alex Grinberg:

The ARRAY In SQL Server 2000

This one goes into how to do it with XML. XML has its uses, but really isn't necessary for this very simple operation. It also puts puts a load on network traffic and SQL Server itself.

Then the latest and greatest from Jeff Moden on splitting a string into a table. A very good read - dust bunnies and all:

Tally OH! An Improved SQL 8K "CSV Splitter" Function

I will be using the DelimitedSplit8K from this article in my examples below.

Test Data

For the purpose of this article I'm using a 2005 version of Adventureworks, since most people have it and can test the queries.

How a Hard-Coded "WHERE IN" Works

Let's say there is an application that passes 2 dates, a start and end date, and a list of sales people and wants the total sales and number of orders for each sales person in the date range. The following code will emulate a such a stored procedure:

DECLARE
 @StartDate DATETIME
, @EndDate DATETIME
, @SalesPeople VARCHAR(255)
SET @StartDate = '2001-07-01'
SET @EndDate = '2001-07-31'
SET @SalesPeople = '275,279'
SELECT
 SO.SalesPersonID, C.FirstName + ' ' + C.LastName AS SalesPerson
, COUNT(*) AS Orders
, SUM(SO.SubTotal) AS TotalAmount
FROM Sales.SalesOrderHeader SO
INNER JOIN HumanResources.Employee E ON
 SO.SalesPersonID = E.EmployeeID
INNER JOIN Person.Contact C ON
 E.ContactID = C.ContactID
WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
AND SO.SalesPersonID IN (275, 279) -- The IN condition
GROUP BY
 SO.SalesPersonID, C.FirstName, C.LastName

It produces a result set like:

Here is the meat part of the query plan:

There is an index seek in SalesOrderHeader on SalesPersonID. Here are the details of that seek:

The IN part of the WHERE predicate generates an OR condition in the query plan. You would see exactly the same plan if the WHERE clause were:

WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
AND (SO.SalesPersonID = 275
  OR SO.SalesPersonID = 279)

The only reason I bring this up is that once the parameter array has been split up into a table, the query plan will change. It will no longer generate an OR type condition.

Splitting and Using the "Array" Parameter as a Table

So how do we split our array into a table and use it? Here is a procedure that will do just that:

IF  EXISTS (SELECT * FROM sys.objects 
WHERE object_id = OBJECT_ID(N'[dbo].[SalesPersonSumByDates]') 
AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[SalesPersonSumByDates]
GO
CREATE PROCEDURE SalesPersonSumByDates
( @StartDate   DATETIME
, @EndDate     DATETIME
, @SalesPeople VARCHAR(50)
)
AS
-- Create a temp table of Sales Persons.
-- See Jeff Moden's Tally OH article for DelimitedSplit8K
-- It returns VARCHAR so we convert it to INT to match SalesPersonID.
SELECT CONVERT(INT, Item) AS SalesPersonID
INTO #Sls
FROM dbo.DelimitedSplit8K(@SalesPeople, ',')
-- The query for the report.
SELECT  
  SO.SalesPersonID, C.FirstName + ' ' + C.LastName AS SalesPerson
, COUNT(*) AS Orders
, SUM(SO.SubTotal) AS TotalAmount
FROM Sales.SalesOrderHeader SO
INNER JOIN HumanResources.Employee E ON
SO.SalesPersonID = E.EmployeeID
INNER JOIN Person.Contact C ON
E.ContactID = C.ContactID
WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
-- The dynamic IN predicate.
AND   SO.SalesPersonID IN (SELECT SalesPersonID FROM #Sls)
GROUP BY
  SO.SalesPersonID, C.FirstName, C.LastName
RETURN

Notice that I'm still using an IN - it's just a SELECT statement rather than a hard coded IN condition.

Now run the procedure with this code:

EXEC SalesPersonSumByDates
  @StartDate   = '2001-07-01'
, @EndDate     = '2001-07-31'
, @SalesPeople = '275,279'

Notice that it produces exactly the same results as before. The query plan will look a little different since there are no hard coded scalar values for an OR. It really produces a JOIN.

I used a temp table rather than a table variable. Why? I've found that table variables used in a JOIN can be slow since the optimizer considers that there is only 1 row and the table has no statistics. This can make for a bad query plan, especially if there are a large number of elements in the "array" parameter. Here is the meat part of this plan:

And the Seek predicate used:

There is only 1 Seek Key and it's based on the temp table #Sls.

Here is another version using EXISTS that produces the exact same query plan on my SQL Server:

SELECT
  SO.SalesPersonID, C.FirstName + ' ' + C.LastName AS SalesPerson
, COUNT(*) AS Orders
, SUM(SO.SubTotal) AS TotalAmount
FROM Sales.SalesOrderHeader SO
INNER JOIN HumanResources.Employee E ON
SO.SalesPersonID = E.EmployeeID
INNER JOIN Person.Contact C ON
E.ContactID = C.ContactID
WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
AND EXISTS  -- Uses EXISTS instead of IN.
  (SELECT 1 FROM #Sls S
   WHERE S.SalesPersonID = SO.SalesPersonID)
GROUP BY
  SO.SalesPersonID, C.FirstName, C.LastName

This can also be done with a direct JOIN of the temp table to the other tables. In this case you have to be careful that the user didn't enter the same parameter twice or the results will be doubled for that parameter - e.g. @SalesPeople = '275,279,275,279,279'. The user interface isn't likely to do this, but I've found it's better to assume that you can be passed some garbage.

Get around this by putting DISTINCT in the temp table creation part:

SELECT DISTINCT CONVERT(INT, Item) AS SalesPersonID -- Only 1 of each
INTO #Sls
FROM dbo.DelimitedSplit8K(@SalesPeople, ',')
SELECT
  SO.SalesPersonID, C.FirstName + ' ' + C.LastName AS SalesPerson
, COUNT(*) AS Orders
, SUM(SO.SubTotal) AS TotalAmount
FROM Sales.SalesOrderHeader SO
INNER JOIN #Sls S ON    -- INNER JOIN
S.SalesPersonID = SO.SalesPersonID
INNER JOIN HumanResources.Employee E ON
SO.SalesPersonID = E.EmployeeID
INNER JOIN Person.Contact C ON
E.ContactID = C.ContactID
WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
GROUP BY
  SO.SalesPersonID, C.FirstName, C.LastName

The INNER JOIN produces a slightly different plan on my SQL 2008, but runs in about the same amount of time as IN and EXISTS. Here is the plan for the JOIN version:

The plan is a little more involved in that the first JOIN is from #Sls to Employee and from there to SalesOrderHeader. It then gets resolved in the Hash Match:

It's not as simple and straightforward as the other two plans and it takes on the average 10 ms. longer to run. You would have to try each one for any particular application to see what works best.

This could also probably be done with a CROSS APPLY, but I've never done it for this type of application.

Splitting and Using the "Array" Parameter as a CTE

If you have SQL 2005 or later, you can skip the creation of a temp table and simply use a CTE. Here is the IN version using a CTE instead of a temp table:

-- Use CTE instead of table.
; WITH CTESplit AS
( SELECT CONVERT(INT, Item) AS SalesPersonID
  FROM dbo.DelimitedSplit8K(@SalesPeople, ',')
)
SELECT
  SO.SalesPersonID, C.FirstName + ' ' + C.LastName AS SalesPerson
, COUNT(*) AS Orders
, SUM(SO.SubTotal) AS TotalAmount
FROM Sales.SalesOrderHeader SO
INNER JOIN HumanResources.Employee E ON
SO.SalesPersonID = E.EmployeeID
INNER JOIN Person.Contact C ON
E.ContactID = C.ContactID
WHERE SO.OrderDate BETWEEN @StartDate And @EndDate
AND   SO.SalesPersonID IN (SELECT SalesPersonID FROM CTESplit)
GROUP BY
  SO.SalesPersonID, C.FirstName, C.LastName

The CTE actually shaved 20 ms. off of my execution time. It may scale upwards better than the temp table, but you should try both and see for yourself.

I specifically didn't go into using the new table-value parameters for 2008. I'm a little embarrassed because I haven't had the chance to actually use one yet in a production environment so I'm far from an expert on them.

In short, a dynamic IN predicate is fairly easy to handle in SQL Server. Just convert it to what SQL handles best - a table.

Todd Fifield

Using a Variable for an IN Predicate

Introduction

References and Sources

Test Data

How a Hard-Coded "WHERE IN" Works

Splitting and Using the "Array" Parameter as a Table

Splitting and Using the "Array" Parameter as a CTE

Rate

Share

Categories

Share

Rate

Using a Variable for an IN Predicate

Introduction

References and Sources

Test Data

How a Hard-Coded "WHERE IN" Works

Splitting and Using the "Array" Parameter as a Table

Splitting and Using the "Array" Parameter as a CTE

Rate

Share

Categories

Recent comments

Share

Rate

Related content

A Normalization Primer

Reaching the Outer Limits

Working with email addresses in SQL Server!

Querying Responses

Storing Checkbox Responses as Integers

Cookies on SQLServerCentral