September 19, 2015 at 10:07 pm
I've been having some trouble getting a single-column "varchar(5)" field to reliably use a table seek instead of a table scan. The production table in this case contains 25 million rows. As impressive as it is to scan 25 million rows in 35 seconds, the query should run much faster.
Here's a partial table description:
CREATE TABLE [dbo].[Summaries_MO]
(
[SummaryId] [int] IDENTITY(1,1) NOT NULL,
[zipcode] [char](5) COLLATE Latin1_General_100_BIN2 NOT NULL,
[Golf] [bit] NULL,
[Homeowner] [bit] NULL,
[IncomeCode] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Pets] [bit] NULL,
CONSTRAINT [Summaries_MO_primaryKey] PRIMARY KEY NONCLUSTERED HASH
(
[SummaryId]
)WITH ( BUCKET_COUNT = 33554432),
INDEX [ixZIP] NONCLUSTERED
(
[zipcode] ASC
)
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
Typically, this table is accessed with a query that includes:
SELECT ...
FROM SummaryTable
WHERE ixZIP IN (SELECT ZipCode FROM @ZipCodesForMO)
This query insists on using a table scan. I've tried WITH (FORCESEEK) for example, but that just makes the query fail.
As I've investigated this issue I also tried:
SELECT * FROM Summaries WHERE ZipCode IN ('xxxxx', 'xxxxx', 'xxxxx')
When I run this query with 64 or fewer (actual, valid) ZIP codes, the query uses a table seek.
But when I give it 65 or more ZIP codes it uses a table scan.
To summarize, the production query always uses a table scan, and when I specify 65 or more ZIP codes the query also uses a table scan.
Frankly, I'm wondering if the data type of the indexed column (Latin1_General_100_BIN2) is somehow the problem. I'll likely try converting the ZIP codes to an integer to see what happens.
But I'd rather know what's going on than simply try things randomly.
September 20, 2015 at 1:31 am
My suspicion is that this has more to do with the selectivity, that is the number of duplicate key values in the nonclustered index.
π
What is the output of this query?
SELECT
COUNT(DISTINCT SMO.zipcode) AS KEY_COUNT
,COUNT(*) AS TOTAL_COUNT
,(COUNT(*)+ 0.0) / (COUNT(DISTINCT SMO.zipcode) + 0.0) AS KEY_RATIO
FROM dbo.Summaries_MO SMO;
September 20, 2015 at 6:34 am
As you might expect, the key count is simply the number of ZIP codes in our data:
KEY_COUNT TOTAL_COUNT KEY_RATIO
38245Β Β Β Β Β Β 25932222 678.05522290495489
I haven't added additional columns to the index as memory-optimized indexes are always covering, in other words I didn't think there would be a benefit per se (and the index space would grow).
Selecting 65 versus 64 out of 38,000 ZIP codes is still selecting a pretty small subset of the total. I've also reproduced my 65 versus 64 hard-coded ZipCode IN ('xxxxx', 'yyyyy')
test on my development server, which includes only Florida ZIP codes.
I'll be eager to read more.
September 20, 2015 at 7:14 am
Now I've added a second column to the index, but I still see my 64/65 column split:
WHERE ZipCode IN ('64 values', '...') AND
IncomeCode in ('A','B','C','D','E','F')
my query does one index seek, but
WHERE ZipCode IN ('65 values', '...') AND
IncomeCode in ('A','B','C','D','E','F')
runs 65 index seeks, which took three seconds to run the first time before "it" was cached.
September 20, 2015 at 9:57 am
I've reproduced the 64/65 split using a small set of 1,000,000 rows with much higher duplication on the key than in your set, roughly 4 times higher. The difference between the two in execution time is around 2.5x.
Using table variable is 3-4 times slower than the scan and 6-7 times slower than the seek, even though both of those will use an index seek.
The interesting thing is that once passed the 64 entries in the IN value clause, the optimizer introduces a constant table scan and a sort operator but when it's less than 65 it uses a loop for constant values. This is something to look into;-)
π
Edit: added missing text.
September 21, 2015 at 7:03 am
It is best practice to add zipcode in Inner Join Instead of where clause.
Because when you use In Operator It consume more memory rather than other.
September 21, 2015 at 7:35 am
There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 21, 2015 at 8:34 am
Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.
Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.
π
September 21, 2015 at 8:45 am
Eirikur Eiriksson (9/21/2015)
Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. Add to that the fact that you're selecting a broad swath of those values. Assuming an even distribution, you have ~657 rows for each key value. Selecting 65 of them means you're pulling 42k rows. That's about 1/50th of the total. The optimizer just doesn't have enough information to go on, so it's going to use the All Density value and that's going to result in scans. That's not necessarily a bad thing since it's likely to be a range scan, but it's not going to be able to use a simple seek for operations like this. The number of pages being accessed alone implies that a scan is probably the better way to go.Grant, it looks like there is a hard limit on the number of entries in an "IN" list where the optimizer goes from a seek to a scan. When this happens the look up values (in clause) go from being a loop through the constants to constant table scan. I've reproduced this behavior with many times more rows per key value pulling 150K rows with a seek and then suddenly, by adding one more entry to the lookup list, it goes for a scan.
π
Weird that it's a hard coded value, but not that surprising in terms of behavior.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
September 22, 2015 at 4:05 am
I've tried JOIN as well.
The issue appears to be only with the memory-optimized table, resulting in the M-O table running slower than the disk-based table.
When it uses the index seek the M-O table is faster.
September 22, 2015 at 4:11 am
Grant Fritchey (9/21/2015)
There are only 38k distinct values in a data set of 25 million. ... ... The number of pages being accessed alone implies that a scan is probably the better way to go.
Unfortunately SQL Server's choice is not the correct solution for the memory-optimized version of the table, going from sub-1 second query time to 35 seconds as it scans all 25 million rows.
It might be okay if the memory-optimized table query would accept my hints, but so far it won't run the query if I include hints.
September 22, 2015 at 4:11 am
There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 22, 2015 at 4:19 am
ChrisM@Work (9/22/2015)
There's a couple of references to the 64/65 split in this article by Paul White. I'm sure I've seen something more comprehensive somewhere but the Google - force eludes me today.
Thanks Chris!
π
October 13, 2015 at 10:42 am
I'm afraid that this one was a Home Simplson [doh!]
My issue turned out to be that one side of my comparison was nchar
and the other side was char.
I had been "playing with" the bin collations versus the bin2 collations (I'm sticking with bin2), but in the process, somehow, I ended up with nchar versus char. SQL Server must have thought that it was doing an implicit conversion (i.e. no warnings or errors of any kind), but in turn that disabled the index.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply