Correct index strategy for BETWEEN

Question

Post reply

Correct index strategy for BETWEEN

Jakub.Janda

Default port

Points: 1439
More actions
March 23, 2015 at 9:40 am

#315974

Two tables:
GeoIpLocation (Id, City, Country, Latitude, Longitude; PK: Id; 2m rows)
GeoIpBlock (IpFrom, IpTo, GeoIpLocation_Id; PK: IdFrom, IdTo; 100k rows)
Simple concept: get location for given IP. So:
SELECTTOP 1
GeoIpLocation.Country,
GeoIpLocation.City
FROMGeoIPLocation
INNER JOIN GeoIPBlock
ON GeoIPBlock.GeoIpLocation_Id = GeoIPLocation.Id
WHERE@nIpNumber BETWEEN GeoIPBlock.IpFrom AND GeoIPBlock.IpTo
Result: disaster.
The between operator uses the index on PK on GeoIpBlock, which results on half the table for first part of between and half the table for the second part of between. A bit better query is with FORCESCAN on GeoIPBlock, but it still runs slow, as it scans a lot of records.
Question: Is there a better way to index this kind of data?

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

roger.price-1150775 Default port Points: 1458 More actions · Answer 1

It would be good to post the table and current index creation scripts.

This may not be the final solution but consider working on the query in two parts.

SELECT TOP 1 @ID = GeoIpLocation_Id FROM GeoIPBlock WHERE @nIpNumber BETWEEN GeoIPBlock.IpFrom AND GeoIPBlock.IpTo

SELECTTOP 1

GeoIpLocation.Country,

GeoIpLocation.City

FROMGeoIPLocation

WHERE GeoIPLocation.Id = @ID

The first part looks like it will result in a table scan but if you alter it and there is an index on IPfrom thenyou may be able to get closer

WHERE GeoIPBlock.IpFrom >= @nIpNumber AND GeoIPBlock.IpTo <= @nIpNumber

Try create a non clustered index on IPFrom and IpTo and enure the ID column is in that index. That way it should see to the IPFrom, scan until it hits the first row meeting the IPTo and grab the ID, all in the one index structure.

CREATE NONCLUSTERED INDEX IPFromTo ON GeoIPBlock(IpFrom, IPTo, ID)

The second part of the query should be quick as it's looking for one row based on @ID.

serg-52 SSCrazy Eights Points: 9913 More actions · Answer 2

serg-52

SSCrazy Eights

Points: 9913

March 24, 2015 at 3:39 am

#1785896

Have you tried index on IpTo ?

Phil Parkin SSC Guru Points: 247227 More actions · Answer 3

We really need to see the DDL for the table and its indexes.

Also, can you explain what you mean by this:

..which results on half the table for first part of between and half the table for the second part of between

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 4

Please post at least the execution plan so we can see how the optimizer is resolving the query based on your structures.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Jakub.Janda Default port Points: 1439 More actions · Answer 5

Ok, I am sorry; here are the execution plans. The first one is without any hints, the second is with FORCESCAN on GeoIpBlock. FORCE ORDER hint produces the same plan, btw.

The first query takes on average about 400ms, the second one about 200 ms.

Phil Parkin SSC Guru Points: 247227 More actions · Answer 6

Actual rows and Est rows are very different. Are your statistics up to date?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

Phil Parkin (3/24/2015)
Actual rows and Est rows are very different. Are your statistics up to date?

Agreed. We don't have index definitions yet either:

SELECT TOP 1

l.fkTblCountry,

l.sCity

FROM tblGeoIPLocation l

INNER JOIN tblGeoIPBlock b

ON b.fkTblGeoIPLocation = l.pkTblGeoIPLocation

WHERE @nIpNumber BETWEEN b.nIpFrom AND b.nIpTo

Try these indexes -

tblGeoIPLocation: index on pkTblGeoIPLocation include fkTblCountry, sCity

tblGeoIPBlock: index on nIpFrom, nIpTo, fkTblGeoIPLocation

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 8

Phil Parkin (3/24/2015)
Actual rows and Est rows are very different. Are your statistics up to date?

Yeah, estimated 1, actual 87000, that's a slight disparity. No wonder it assumed a scan would work well. It only thought it was returning a single row. I suspect you have auto_updatestats disabled. You might want to turn that on and get some statistics maintenance in place.

Without the statistics, the optimizer will make very bad choices. Also, the fact that the plan shows a runtime value, but not a compile time value for @nlpNumber, I suspect that's a local variable, not a parameter. Once you have statistics, you may be better off with a parameter so that the optimizer can use specific values to retrieve the data.

As to your index, I'd make it a compound key on nIpFrom and nIpTo with an INCLUDE of [Alzasoft].[dbo].[tblGeoIPLocation].pkTblGeoIPLocation, [Alzasoft].[dbo].[tblGeoIPLocation].fkTblCountry, [Alzasoft].[dbo].[tblGeoIPLocation].sCity. At least as an experiment to test things out. You may get index intersection on this, so two indexes, one for nIpFrom and one for nIpTo might work better. Testing is your friend.

I didn't look at the plans with hints. You're going after hints when the statistics and the structure is off. That's not how hints should be used. They should be the last thing tried when we've exhausted all the standard approaches.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

Hey Grant, in the Plan1 execution plan there's a NLIJ with "Expr1004" in the Outer References list. Any idea what the "Expr1004" is for and where it was constructed?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Jakub.Janda Default port Points: 1439 More actions · Answer 10

Auto update for statistics is not disabled; they are even manually updated in a nightly maintenance job, however, the query does not want to use the PK index. I guess it has something to do with a fact that it's not a <column> BETWEEN <value1> AND <value2>, but it's <value> BETWEEN <column1> AND <column2>.

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 11

ChrisM@Work (3/24/2015)
Hey Grant, in the Plan1 execution plan there's a NLIJ with "Expr1004" in the Outer References list. Any idea what the "Expr1004" is for and where it was constructed?

I'm not sure. It doesn't appear elsewhere in the plan. I searched the XML.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 12

Jakub.Janda (3/24/2015)
Auto update for statistics is not disabled; they are even manually updated in a nightly maintenance job, however, the query does not want to use the PK index. I guess it has something to do with a fact that it's not a <column> BETWEEN <value1> AND <value2>, but it's <value> BETWEEN <column1> AND <column2>.

Looking at the plan again, you'll note that the BETWEEN operation is being satisfied by the [Alzasoft].[dbo].[tblGeoIPBlock].[IX_tblGeoIPBlock_Location] index. The problem is the scan of the other table, tblGeoIPLocation. And that's because the statistics are radically off there. The estimated number of rows is 1. It's scanning 87000. That is the problem. Not the BETWEEN operation or the other index. That part of your structure is working fine. Try UPDATE STATISTICS WITH FULL SCAN on that primary key. See if that changes things. Your statistics are WAY off in there.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

roger.price-1150775 Default port Points: 1458 More actions · Answer 13

Here is an alternative to the between approach but it depends on what the data looks like in GepIPBlock.

If GepIPBlock.IpFrom and IPTo represent range of ip's or subnets like 123.123.123.1 to 123.123.123.255 then you may be able to take a @nIPNumber like 123.123.123.54, strip the .54 off it and search GEOIpBlock where GepIPBlock.IpFrom LIKE '123.123.123.*' which should use a index seek.

If GeopIPBlock has multiple rows fro the same subnet 123.123.123.??? then you could strip one character off the @nIPNumber at a time until you hit the row. This will be a small number of seeks but may be quicker than a scan.

roger.price-1150775 Default port Points: 1458 More actions · Answer 14

As per previous post, This test reduces reads to 3 Logical reads.....

DROP TABLE TEST

GO

CREATE TABLE TEST (ID INT IDENTITY(1,1), IPFROM VARCHAR(120), IPTO VARCHAR(120))

GO

INSERT INTO TEST(IPFROM) VALUES (1)

GO -- Add one row to test table

INSERT INTO TEST(IPFROM) SELECT IPFROM FROM TEST

GO 15 -- Add loads more rows

SELECT COUNT(1) FROM TEST

-- CONVERT Identity Column to something that looks like an IP Range

UPDATE TEST SET IPFROM = (ID*100)+100000000000, IPTO = (ID*100)+100000000099

UPDATE TEST SET IPFROM = SUBSTRING(Ipfrom,1,3)+'.'+SUBSTRING(ipfrom,4,3)+'.'+SUBSTRING(ipfrom,7,3)+'.'+SUBSTRING(ipfrom,10,3)

UPDATE TEST SET IPTO = SUBSTRING(Ipto,1,3)+'.'+SUBSTRING(ipto,4,3)+'.'+SUBSTRING(ipto,7,3)+'.'+SUBSTRING(ipto,10,3)

SELECT TOP 1000 * FROM TEST ORDER BY ID DESC

CREATE NONCLUSTERED INDEX TEST_IPFROM ON TEST(ipfrom, ipto, id)

set statistics io on

select * from test where '100.003.276.001' BETWEEN ipfrom and ipto

select * from test where ipfrom LIKE '100.003.276.%' AND '100.003.276.001' BETWEEN ipfrom and ipto