Help avoiding temp table with string of IDs passed into stored procedure

  • I was wondering if anyone has a way of using a string of IDs passed into a stored procedure as a VARCHAR parameter ('1,2,100,1020,') in an IN without parsing the list to a temp table or table variable. Here's the situation, I've got a stored procedure that is called all the time. It's working with some larger tables (100+ Million rows). The procedure passes in as one of the variables a list of IDs for the large table. This list can have anywhere from 1 to ~100 IDs passed to it.

    Currently, we are using a function to parse the list of IDs into a temp table then joining the temp table to get the query:

    CREATE PROCEDURE [dbo].[GetStuff] (

    @IdList varchar(max)

    )

    AS

    SET NOCOUNT ON

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

    CREATE table #IdList (

    [LtID] int not null primary key)

    INSERT INTO #IdList (LtId) EXEC dbo.spGetTableFromString @IdList,','

    SELECT A bunch of things...

    FROM [dbo].[LargeTable] LT

    INNER JOIN #IdList IDS ON IDS.LtId = LT.LtID

    INNER JOIN [dbo].[LargeTable2] LT2 ON LT2.LtID = IDS.LtID

    DROP TABLE #IdList;

    GO

    The problem we're running into is that since this proc gets called so often, we sometimes run into tempDB contention that slows this down. In my testing (unfortunately I don't have a good way of generating a production load) swapping the #table for an @table didn't make any difference which makes sense to me given that they are both allocated in the tempDB.

    One approach that I tried was that since the SELECT query is pretty simple, I moved it to dynamic SQL:

    CREATE PROCEDURE [dbo].[GetStuff] (

    @IdList varchar(max)

    )

    AS

    SET NOCOUNT ON

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

    EXEC ('SELECT A bunch of things... '

    + 'FROM [dbo].[LargeTable] LT '

    + 'INNER JOIN [dbo].[LargeTable2] LT2 ON LT2.LtID = LT.LtID AND '

    + ' LT2.LTId in (' + @IdList + ') ' )

    GO

    The problem I had there, is that it creates an Ad Hoc plan for the query and only reuses it if the same list of parameters are passed in, so I get a higher CPU cost because it compiles a plan and it also causes the plan cache to bloat since the parameter list is almost always different. I feel like we'd probably trading one problem for another.

    Is there an approach that I haven't considered that may help get the best of both worlds, avoiding or minimizing tempDB contention but also not having to compile a new plan every time the proc is run?

  • The bigger issue is the way you are doing this it is now wide open to sql injection. Do you need to use varchar(max)? Can you fit your list inside 8,000 characters. If so, take a look at the link in my signature about splitting strings.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • It is almost always better to use a temp table or dynamic sql for this than to try to do something else. Both of those will/should get you the optimal plan for each execution (caveated with some temp table statistics anomalies covered by Paul White in a SQL Blog post). Plan caching, CPU "burn for compilations, etc are almost always WAY secondary issues compared to the DISASTROUSLY bad query plans and performance you could get if you have data value skew and don't have statistics available every time you pass in a collection of values. i.e. if you use a table variable here you can get crushed with bad plans because of the lack of statistics on the values.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • First of all, thanks for the responses. I don't think SQL Injection is a big problem given that the only time this procedure is called is by an application that passes in the list of IDs.

    As for the question I had originally, I've been testing all 3 different approaches (temp table, table variable, and dynamic SQL) using a copy of a prod DB and a large variety of executions. From what I've seen, there are between 1-1000 IDs passed into the proc meaning a varchar(8000) wouldn't cut it given that the IDs being passed in are generally 7-8 digit long INTs with a comma separating them.

    So far, I haven't run into any scenario where the lack of stats on the @table variable have caused SQL to create a bad plan. I've tried having the first call pass in 1 ID, and followed up with the rest (including some that passed in ~1000 IDs) and the plan that is generated works well for all scenarios. I've also arranged it so that after clearing the plan cache, I do a call with a large list, followed by the rest and it too works well. In either case the same plan is used since the optimizer treats the table as if it has 1 record.

    For every scenario I tried, the @table variable proc used much less CPU, did fewer reads and took far less time than the #temp table version and did about the same number of reads (although usually a couple more) used far less CPU and took far more time than the dynamic SQL version did.

    After looking at the plans, for each of the three scenarios, I can see that the difference in performance isn't because of any inherent advantage between each plan, but because the plan used by the @table variable is more efficient because with both the other approaches, the estimated rows vs actual rows is way off (160K estimated vs 264 actual) leading to a less efficient plan. I've tried updating statistics on all the tables involved and recompiling the plan but it still creates the same plan because of the overestimation. I've added some query hints as a test and was able to get the #temp table and dynamic SQL procs to use the same plan as the @table variable and when that happens performance between all 3 is similar, but I'm not going to have query hints in the production proc.

    Any idea what else could cause the bad estimation? The query is pretty close to as simple as what I put above, although it does do a couple of left joins to some small tables after inner joins between the large table.

  • Sorry - not sure how to help with this one, at least not via a forum thread. 🙁

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • ErikMN (9/30/2014)


    I don't think SQL Injection is a big problem given that the only time this procedure is called is by an application that passes in the list of IDs.

    Today that might be true but what about next month when the app is converted to a web application? It is easy enough to protect against sql injection that not doing it is irresponsible.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • On the issue of your problem I have to agree with Kevin. This sounds like a much bigger topic than a forum is designed to deal with.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Can we assume the large tables have indexes keyed on the IDs that are being passed in?

    If so, and the IDs are in a (fairly) tight range, you could also try adding a hi and lo check to the joins that reflect the min/max ID value in the table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • As far I understood the plan you get with table variable is the most efficient one. Usually you get the best plan when you use literals in queries, without variables or parameters. If this plan is better then the plan with dynamic SQL that could mean that the QO overestimates number of rows for the predicate T2.LTId IN (1,2,3...). Since the estimation for table variable is always 1, this wrong estimation helps the final cardinality estimation because it is wrong in the other direction i.e. two wrongs make a right. In case of large tables a table variable is usually used as filter (inner join), especially when the number of rows for operators dealing with the large table is overestimated.

    The plan with temp table on the other side has the estimations from the first procedure execution. So, if you called it with 500 IDs, for all subsequent invocations you will have the same estimation regardless of number of items in the comma separated list.

    And again, even if you would have the best plan with dynamic SQL, as Sean and Kevin warned, do not underestimate the risk of SQL injection. Better try to tune not only the SP, but the application. Do you really need a lot of rows in SELECT between two large tables in a SP which is very frequently called? What does your application with the result of this SP? If you have performance issues it's worth to consider the application (re)design too. Dynamic SQL is powerfull, but it should be done after you (unsuccessfully) tried all the other options.

    HTH.

    ___________________________
    Do Not Optimize for Exceptions!

  • Heh... have you done an actual test to see which is better? You have to remember that even supposed "actual" excution plans are estimates in many ways.

    Execution plans help you figure out what needs to be done for performance tuning and shouldn't be used as some form of performance "measurement".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/2/2014)


    Heh... have you done an actual test to see which is better? You have to remember that even supposed "actual" excution plans are estimates in many ways.

    Execution plans help you figure out what needs to be done for performance tuning and shouldn't be used as some form of performance "measurement".

    First of all, by all means get rid of the MAX as your input parameter. That alone will kill performance if you need to split it.

    I don't know all the particulars of your tables, but I'd have a look at this approach. It uses Jeff's DelimitedSplit8K ITVF to perform the split and a correlated subquery to enforce the match. This approach also injection-proofs your code because the split input parameter is being treated like a table of data, which is should be, and not a string.

    SELECT A bunch of things...

    FROM dbo.LargeTable lt

    INNER JOIN dbo.LargeTable2 lt2 ON lt2.LtID = lt.LtID

    WHERE EXISTS (SELECT 1

    FROM DelimitedSplit8K(@IdList, ',') s

    WHERE s.Item = lt.LtID);

    If you aren't familiar with DelimitedSplit8K, it's well worth the time to read up on it at http://www.sqlservercentral.com/articles/Tally+Table/72993/. It will change the way you think about data and change your expectations of performance.

    Another tip is that if you don't need dbo.LargeTable2, then getting rid of it will help. But like I said, I don't know all the specifics involved.

    Lastly, you'll definitely want to set up different approaches and performance test each one. You're going to want to eek all the performance you can out of your code.

  • For passing large numbers of ID's, you might want to consider passing them as XML with some short tags so that you can pass more than 8K worth of ID's. Then just split the XML which can be quite quick when you don't need to first convert a CSV to XML.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Would it help it a permanent table was used ? You could dump the id's along with a common GUID and when you are done delete all rows with that GUID.

    INSERT INTO idLIst(Owner, ID)

    SELECT @GUID, id

    FROM "the list of id's"

    ---------------

    SELECT b.col1, '...

    FROM idList l

    INNER JOIN BigTable b ON b.ID = l.ID

    WHERE l.OwnerID = @GUID

    ---------------

    DELETE FROM idLIst WHERE OwnerID = @GUID

  • The following approach assumes the same list of IDs as a comma seperated set. It does not involve a temp table or dynamic sql, and should keep a fixed execution plan. It also properly handles spaces, trailing comma, empty set members, and duplicate members. The end result is a table variable on which you can leverage for an inner join.

    declare @IdList varchar(max) = '1, 2, 100,,1020,1020,,';

    declare @IdList2 xml = '<id>'+replace(replace(@IdList,' ',''),',','</id><id>')+'</id>';

    declare @IdList3 table (id int not null primary key with (ignore_dup_key = on));

    insert into @IdList3 ( id )

    select id from

    (select items.id.value('.','varchar(9)') as id

    from @IdList2.nodes('id') as items(id)) x

    where id > '';

    select * from @IdList3;

    id

    -----------

    1

    2

    100

    1020

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Try your XML method under stress and see how it holds up. Or with lots of rows.

    There are some very nice threads here on SSC.com about string parsing into tables. Search DelimitedSplit8K

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply