Memory-optimized queries using table scan instead of seek

Question

Memory-optimized queries using table scan instead of seek

Wayne Erfling

SSC-Addicted

Points: 454
More actions
September 19, 2015 at 10:07 pm

#399066

I've been having some trouble getting a single-column "varchar(5)" field to reliably use a table seek instead of a table scan. The production table in this case contains 25 million rows. As impressive as it is to scan 25 million rows in 35 seconds, the query should run much faster.
Here's a partial table description:
CREATE TABLE [dbo].[Summaries_MO]
(
[SummaryId] [int] IDENTITY(1,1) NOT NULL,
[zipcode] [char](5) COLLATE Latin1_General_100_BIN2 NOT NULL,
[Golf] [bit] NULL,
[Homeowner] [bit] NULL,
[IncomeCode] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Pets] [bit] NULL,
CONSTRAINT [Summaries_MO_primaryKey] PRIMARY KEY NONCLUSTERED HASH
(
[SummaryId]
)WITH ( BUCKET_COUNT = 33554432),
INDEX [ixZIP] NONCLUSTERED
(
[zipcode] ASC
)
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
Typically, this table is accessed with a query that includes:
SELECT ...
FROM SummaryTable
WHERE ixZIP IN (SELECT ZipCode FROM @ZipCodesForMO)
This query insists on using a table scan. I've tried WITH (FORCESEEK) for example, but that just makes the query fail.
As I've investigated this issue I also tried:
SELECT * FROM Summaries WHERE ZipCode IN ('xxxxx', 'xxxxx', 'xxxxx')
When I run this query with 64 or fewer (actual, valid) ZIP codes, the query uses a table seek.
But when I give it 65 or more ZIP codes it uses a table scan.
To summarize, the production query always uses a table scan, and when I specify 65 or more ZIP codes the query also uses a table scan.
Frankly, I'm wondering if the data type of the indexed column (Latin1_General_100_BIN2) is somehow the problem. I'll likely try converting the ZIP codes to an integer to see what happens.
But I'd rather know what's going on than simply try things randomly.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 1

My suspicion is that this has more to do with the selectivity, that is the number of duplicate key values in the nonclustered index.

😎

What is the output of this query?

SELECT

COUNT(DISTINCT SMO.zipcode) AS KEY_COUNT

,COUNT(*) AS TOTAL_COUNT

,(COUNT(*)+ 0.0) / (COUNT(DISTINCT SMO.zipcode) + 0.0) AS KEY_RATIO

FROM dbo.Summaries_MO SMO;

Wayne Erfling SSC-Addicted Points: 454 More actions · Answer 2

As you might expect, the key count is simply the number of ZIP codes in our data:

KEY_COUNT TOTAL_COUNT KEY_RATIO

38245 25932222 678.05522290495489

I haven't added additional columns to the index as memory-optimized indexes are always covering, in other words I didn't think there would be a benefit per se (and the index space would grow).

Selecting 65 versus 64 out of 38,000 ZIP codes is still selecting a pretty small subset of the total. I've also reproduced my 65 versus 64 hard-coded ZipCode IN ('xxxxx', 'yyyyy') test on my development server, which includes only Florida ZIP codes.

I'll be eager to read more.

Wayne Erfling SSC-Addicted Points: 454 More actions · Answer 3

Now I've added a second column to the index, but I still see my 64/65 column split:

WHERE ZipCode IN ('64 values', '...') AND

IncomeCode in ('A','B','C','D','E','F')

my query does one index seek, but

WHERE ZipCode IN ('65 values', '...') AND

IncomeCode in ('A','B','C','D','E','F')

runs 65 index seeks, which took three seconds to run the first time before "it" was cached.

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 4

I've reproduced the 64/65 split using a small set of 1,000,000 rows with much higher duplication on the key than in your set, roughly 4 times higher. The difference between the two in execution time is around 2.5x.

Using table variable is 3-4 times slower than the scan and 6-7 times slower than the seek, even though both of those will use an index seek.

The interesting thing is that once passed the 64 entries in the IN value clause, the optimizer introduces a constant table scan and a sort operator but when it's less than 65 it uses a loop for constant values. This is something to look into;-)

😎

Edit: added missing text.

schaudhary SSCommitted Points: 1709 More actions · Answer 5

It is best practice to add zipcode in Inner Join Instead of where clause.

Because when you use In Operator It consume more memory rather than other.

Grant Fritchey SSC Guru Points: 398886 More actions · Answer 6

There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 7

Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.

Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.

😎

Grant Fritchey SSC Guru Points: 398886 More actions · Answer 8

Eirikur Eiriksson (9/21/2015)
Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.
Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.
😎

Weird that it's a hard coded value, but not that surprising in terms of behavior.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Wayne Erfling SSC-Addicted Points: 454 More actions · Answer 9

I've tried JOIN as well.

The issue appears to be only with the memory-optimized table, resulting in the M-O table running slower than the disk-based table.

When it uses the index seek the M-O table is faster.

Wayne Erfling SSC-Addicted Points: 454 More actions · Answer 10

Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. ... ... The number of pages being accessed alone implies that a scan is probably the better way to go.

Unfortunately SQL Server's choice is not the correct solution for the memory-optimized version of the table, going from sub-1 second query time to 35 seconds as it scans all 25 million rows.

It might be okay if the memory-optimized table query would accept my hints, but so far it won't run the query if I include hints.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 12

ChrisM@Work (9/22/2015)
There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.

Thanks Chris!

😎

Wayne Erfling SSC-Addicted Points: 454 More actions · Answer 13

I'm afraid that this one was a Home Simplson [doh!]

My issue turned out to be that one side of my comparison was nchar

and the other side was char.

I had been "playing with" the bin collations versus the bin2 collations (I'm sticking with bin2), but in the process, somehow, I ended up with nchar versus char. SQL Server must have thought that it was doing an implicit conversion (i.e. no warnings or errors of any kind), but in turn that disabled the index.