Memory-optimized queries using table scan instead of seek

  • I've been having some trouble getting a single-column "varchar(5)" field to reliably use a table seek instead of a table scan. The production table in this case contains 25 million rows. As impressive as it is to scan 25 million rows in 35 seconds, the query should run much faster.

    Here's a partial table description:

    CREATE TABLE [dbo].[Summaries_MO]

    (

    [SummaryId] [int] IDENTITY(1,1) NOT NULL,

    [zipcode] [char](5) COLLATE Latin1_General_100_BIN2 NOT NULL,

    [Golf] [bit] NULL,

    [Homeowner] [bit] NULL,

    [IncomeCode] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,

    [Pets] [bit] NULL,

    CONSTRAINT [Summaries_MO_primaryKey] PRIMARY KEY NONCLUSTERED HASH

    (

    [SummaryId]

    )WITH ( BUCKET_COUNT = 33554432),

    INDEX [ixZIP] NONCLUSTERED

    (

    [zipcode] ASC

    )

    )WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )

    Typically, this table is accessed with a query that includes:

    SELECT ...

    FROM SummaryTable

    WHERE ixZIP IN (SELECT ZipCode FROM @ZipCodesForMO)

    This query insists on using a table scan. I've tried WITH (FORCESEEK) for example, but that just makes the query fail.

    As I've investigated this issue I also tried:

    SELECT * FROM Summaries WHERE ZipCode IN ('xxxxx', 'xxxxx', 'xxxxx')

    When I run this query with 64 or fewer (actual, valid) ZIP codes, the query uses a table seek.

    But when I give it 65 or more ZIP codes it uses a table scan.

    To summarize, the production query always uses a table scan, and when I specify 65 or more ZIP codes the query also uses a table scan.

    Frankly, I'm wondering if the data type of the indexed column (Latin1_General_100_BIN2) is somehow the problem. I'll likely try converting the ZIP codes to an integer to see what happens.

    But I'd rather know what's going on than simply try things randomly.

  • My suspicion is that this has more to do with the selectivity, that is the number of duplicate key values in the nonclustered index.

    😎

    What is the output of this query?

    SELECT

    COUNT(DISTINCT SMO.zipcode) AS KEY_COUNT

    ,COUNT(*) AS TOTAL_COUNT

    ,(COUNT(*)+ 0.0) / (COUNT(DISTINCT SMO.zipcode) + 0.0) AS KEY_RATIO

    FROM dbo.Summaries_MO SMO;

  • As you might expect, the key count is simply the number of ZIP codes in our data:

    KEY_COUNT TOTAL_COUNT KEY_RATIO

    38245Β Β Β Β Β Β 25932222 678.05522290495489

    I haven't added additional columns to the index as memory-optimized indexes are always covering, in other words I didn't think there would be a benefit per se (and the index space would grow).

    Selecting 65 versus 64 out of 38,000 ZIP codes is still selecting a pretty small subset of the total. I've also reproduced my 65 versus 64 hard-coded ZipCode IN ('xxxxx', 'yyyyy') test on my development server, which includes only Florida ZIP codes.

    I'll be eager to read more.

  • Now I've added a second column to the index, but I still see my 64/65 column split:

    WHERE ZipCode IN ('64 values', '...') AND

    IncomeCode in ('A','B','C','D','E','F')

    my query does one index seek, but

    WHERE ZipCode IN ('65 values', '...') AND

    IncomeCode in ('A','B','C','D','E','F')

    runs 65 index seeks, which took three seconds to run the first time before "it" was cached.

  • I've reproduced the 64/65 split using a small set of 1,000,000 rows with much higher duplication on the key than in your set, roughly 4 times higher. The difference between the two in execution time is around 2.5x.

    Using table variable is 3-4 times slower than the scan and 6-7 times slower than the seek, even though both of those will use an index seek.

    The interesting thing is that once passed the 64 entries in the IN value clause, the optimizer introduces a constant table scan and a sort operator but when it's less than 65 it uses a loop for constant values. This is something to look into;-)

    😎

    Edit: added missing text.

  • It is best practice to add zipcode in Inner Join Instead of where clause.

    Because when you use In Operator It consume more memory rather than other.

  • There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (9/21/2015)


    There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.

    Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.

    😎

  • Eirikur Eiriksson (9/21/2015)


    Grant Fritchey (9/21/2015)


    There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.

    Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.

    😎

    Weird that it's a hard coded value, but not that surprising in terms of behavior.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I've tried JOIN as well.

    The issue appears to be only with the memory-optimized table, resulting in the M-O table running slower than the disk-based table.

    When it uses the index seek the M-O table is faster.

  • Grant Fritchey (9/21/2015)


    There are only 38k distinct values in a data set of 25 million. ... ... The number of pages being accessed alone implies that a scan is probably the better way to go.

    Unfortunately SQL Server's choice is not the correct solution for the memory-optimized version of the table, going from sub-1 second query time to 35 seconds as it scans all 25 million rows.

    It might be okay if the memory-optimized table query would accept my hints, but so far it won't run the query if I include hints.

  • There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (9/22/2015)


    There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.

    Thanks Chris!

    😎

  • I'm afraid that this one was a Home Simplson [doh!]

    My issue turned out to be that one side of my comparison was nchar

    and the other side was char.

    I had been "playing with" the bin collations versus the bin2 collations (I'm sticking with bin2), but in the process, somehow, I ended up with nchar versus char. SQL Server must have thought that it was doing an implicit conversion (i.e. no warnings or errors of any kind), but in turn that disabled the index.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply