TSQL Lookup using BETWEEN (Ip Addresses)

Question

TSQL Lookup using BETWEEN (Ip Addresses)

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Stefan_G SSCertifiable Points: 6609 More actions · Answer 1

Jeff Moden (4/12/2010)
On the subject of a non-clustered index in a similar order at work... it caused very frequent timouts but, admittedly, that was on a table with a fair number of inserts. Still, putting the lower cardinality column at the end of the index still allowed seeks for the SELECTs without the extra page splits (extent splits in this case).

Hmm. Not sure I really understand what you are trying to say here.

But it sounds like you are talking about the order of the columns in the non-clustered index. Are you saying that the order should be IpNumber, CountryCode just because IpNumber has much higher cardinality than CountryCode ?

Since the index in this case is used to find (the very few) rows with CountryCode=null, it is vital that the index has CountryCode in the first column.

But I am probably misunderstanding you. Perhaps you want to explain a bit more what you mean ?

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 2

r5d4 (4/12/2010)
Anyone got any ideas to improve on this?
My tables are below. For reference, my IPcountrylookup table has 108726 rows
and I've recorded 22515504 unique IPs so far.

I realise I'm a bit late here - and the original question has been well answered, but I'd like to mention just a couple of general points, if I may 🙂

The lookup table seems to have one row per assigned IP range - so I am going to assume you have downloaded one of the publicly-available lists (the latest list I have contains 101,517 records). The record format can be trimmed, for the purpose of country lookup, by collapsing adjacent ranges for the same country. There are a large number of contiguous ranges per country that are represented by a number of records in the source data.

I like to limit this process so that each class A address always has its own record, for reasons I'll come to in a second. Collapsing adjacent records results in 43,911 records - less than half the size. This sort of thing can become important where a large number of records must be looked up - such as when updating existing records based on a new version of the IP-to-country mapping.

When dealing with large volumes (which you might be, in future, who knows) the limitation of using BETWEEN to find the country becomes clearer: the join on an inequality means that only a Loop Join can be used. With larger sets, we would probably look to do a HASH join - but that requires at least one equality predicate.

That is why I like to collapse the IP ranges only as far as retaining at least one record per class A address - I can then use the class A part of the address as a sort of hash function - and the equality comparison then allows a HASH join. I store just the class A part of the address as a computed column in the look up table (not persisted, so it takes no space) and then create an index on it, and include the from & to ip range data. In your case the UPDATE query would look like this:

UPDATE MIP

SET countryCode = ICM.country_code

FROM my.ipAddresses MIP

JOIN dbo.IPtoCountryMap ICM

ON MIP.ipNumber / 16777216 = ICM.classA -- initial 'hash' seek

AND MIP.ipNumber BETWEEN ICM.from_ip and ICM.to_ip -- detail match after the hash match

WHERE MIP.countryCode IS NULL;

As a final very small final point - the index on countryCode could be filtered, since you're only ever interested in finding NULL country codes with it:

CREATE NONCLUSTERED INDEX

[IX my.ipAddresses countryCode IS NULL]

ON my.ipAddresses (countryCode)

INCLUDE (ipNumber)

WHERE countryCode IS NULL;

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 3

Just in case anyone finds it useful, here's the code I used to compact the records:

-- Compacted version of the full allocated-IPs table

CREATE TABLE dbo.IPtoCountryMap

(

from_ip BIGINT NOT NULL,

to_ip BIGINT NOT NULL,

country_code CHAR(2) NOT NULL,

classA AS

ISNULL(CONVERT(TINYINT, from_ip / 16777216), 0)

CONSTRAINT [PK dbo.IPtoCountryMap from_ip]

PRIMARY KEY CLUSTERED (from_ip ASC)

);

GO

WITH Ranges

AS (

-- Find each IP assignment record that starts a range

-- (no other range ends just before this one starts)

-- Limit the range search to a single class A range

SELECT IC.from_ip,

IC.to_ip,

IC.country_code,

root_id = ROW_NUMBER() OVER (ORDER BY IC.from_ip),

recursion = 0

FROM dbo.IpToCountry IC

WHERE NOT EXISTS

(

-- No immediately preceding record within

-- the same class A address

SELECT *

FROM dbo.IpToCountry IC2

WHERE IC2.to_ip + 1 = IC.from_ip

AND IC2.to_ip / 16777216 = IC.from_ip / 16777216

)

UNION ALL

-- Recursive part finds all the IP assignments

-- that follow on from the anchor record (no gap)

-- within the same class A range

SELECT IC.from_ip,

IC.to_ip,

IC.country_code,

R.root_id,

R.recursion + 1

FROM Ranges R

JOIN dbo.IpToCountry IC

ON IC.from_ip = R.to_ip + 1

AND IC.to_ip / 16777216 = R.from_ip / 16777216

),

Groups

AS (

-- Identify sequential assignments

-- using magic

SELECT R.root_id,

R.country_code,

R.from_ip,

R.to_ip,

group_id =

ROW_NUMBER()

OVER (

PARTITION BY R.root_id

ORDER BY R.country_code, R.recursion)

- R.recursion

FROM Ranges R

),

ClassA_Ranges

AS (

-- Compact sequential groups into one record

SELECT G.country_code,

from_ip = MIN(G.from_ip),

to_ip = MAX(G.to_ip)

FROM Groups G

GROUP BY

G.root_id,

G.country_code,

G.group_id

)

INSERT dbo.IPtoCountryMap

(

from_ip,

to_ip,

country_code

)

SELECT CAR.from_ip,

CAR.to_ip,

CAR.country_code

FROM ClassA_Ranges CAR

OPTION (MAXRECURSION 0);

GO

-- Index on the computed column

CREATE NONCLUSTERED INDEX [IX dbo.IPtoCountryMap classA (country_code, to_ip)]

ON dbo.IPtoCountryMap (classA)

INCLUDE (country_code, to_ip);

GO

The example 1,000 rows aren't enough to produce a hash join with the indexes shown, but you can force it.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 4

Interesting concept.

Unfortunately If I am not missing something, it will not really work correctly.

Consider the following consecutive rows in the IpToCountryMap table:

from_ip to_ip country_code classA

0x06000000 0x07394B1F US 6

0x0A000000 0x0AFFFFFF NL 10

If you try to lookup ip-number 0x07123456 with your code the lookup will fail even though that value is within the US range above.

The initial hash-seek will look for an entry with classA=7 which will not be found at all so the lookup will fail.

Further more, I am not so sure that this method is any faster than the method I suggested using range seeks limited to a certain blocksize.

The problem with your method is that there are several classA-values that have a lot of entries, for example classA=212 has over 4500 different entries in the table.

The hash search will only help us find the correct slot, finding the correct entry then requires a linear search.

So, looking up ip-addresses that has classA=212 will require a linear search of on average 2000 rows. This will not be good for performance.

To see how bad the problem is I ran the following query on my dataset:

select classA, COUNT(*) as rows

from IPtoCountryMap

group by classA

having count(*)>100

order by 1

result:

classArows

24107

622202

64253

66340

69117

80764

81569

130130

134120

137112

138118

139115

146147

147128

158114

159158

161122

164122

1923671

1931995

1944996

1956611

196134

198754

199528

200468

202908

203365

204482

205141

206169

207164

209216

210101

2124512

2132026

216394

2171711

To improve performance you would need to use more than 8 bits for your hash lookup so each slot contains fewer entries, but this will make the problem of entries that span more than one hash value even worse.

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 5

Stefan_G (4/14/2010)
Unfortunately If I am not missing something, it will not really work correctly.
Consider the following consecutive rows in the IpToCountryMap table:

Your US IP address range is 6.0.0.0 to 7.57.75.31 when decoded.

This is not valid (in the source) since it spans a class A address, and the end of the range contains a class D element that is not 255.

It would have to be (a minimum of) two ranges in the source file 6.0.0.0 to 6.255.255.255 and 7.0.0.0 to 7.57.75.255, for example.

That's why I go to such great lengths to avoid compacting ranges over a class A address - it would break the rules 😉

Further more, I am not so sure that this method is any faster than the method I suggested using range seeks limited to a certain blocksize.

You can see how a hash join might beat a loop join for larger inputs though? The 'hash' method doesn't produce a hash join until the row count gets big enough, but the loop join is pretty darn fast too - maybe comparable with yours, not sure since that isn't the case I wrote it for.

I'm quite happy to test both - but I'm not sure exactly how your idea works, or what block size I should choose - can you briefly explain?

The problem with your method is that there are several classA-values that have a lot of entries, for example classA=212 has over 4500 different entries in the table.

Class A = 212 has 1028 entries in the compacted table on my test rig - much smaller than the 43,911 in the table as a whole.

I feel confident that the class A hash will hold up well in testing.

The hash search will only help us find the correct slot, finding the correct entry then requires a linear search.

What makes you so sure that SQL Server uses a linear search when applying a residual predicate after a hash probe? 😀

I know the class-A level hash isn't perfect, but I believe it helps (I tried a class AB hash, but it's impossible to guarantee the source data has a row per AB, but it is guaranteed not to span class A records per entry).

So, looking up ip-addresses that has classA=212 will require a linear search of on average 2000 rows. This will not be good for performance.

Hey I never claimed my idea was better than yours 😛

I said the question had been well answered - I'm just exploring an alternative for large inputs. I think I mentioned that 🙁

If you want to contribute positively here, you are more than welcome - and I'll play too.

If you're going to get all defensive and stuff, then you're not :w00t:

Paul

edit: to fix that quote tag

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 6

Paul White NZ (4/14/2010)
Hey I never claimed my idea was better than yours 😛
I said the question had been well answered - I'm just exploring an alternative for large inputs. I think I mentioned that 🙁
If you want to contribute positively here, you are more than welcome - and I'll play too.
If you're going to get all defensive and stuff, then you're not :w00t:

I apologize of you felt attacked by my previous post.

I am not trying to be defensive about anything, I am just discussing the relative merits of two different methods to solve a quite interesting problem.

I think exploring alternative methods is truly interesting and educational.

It is obvious that my raw data is different from yours. The row with 6.0.0.0 to 7.57.75.31 is not made-up, it is actually present in my dataset.

I have no idea of the actual quality of my raw data though - it is just something i downloaded from a random site after googling for it.

There are definitely no hard feelings from me here.

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 7

Stefan_G (4/14/2010)
I am not trying to be defensive about anything, I am just discussing the relative merits of two different methods to solve a quite interesting problem.

Great! Because I agree - it is very interesting.

It is obvious that my raw data is different from yours. The row with 6.0.0.0 to 7.57.75.31 is not made-up, it is actually present in my dataset. I have no idea of the actual quality of my raw data though - it is just something i downloaded from a random site after googling for it.

Hmmm - that is interesting.

Hey look, I really need to go now (it's 2:20am) but I am going to look at it again tomorrow.

I still can't get my head around a suitable block size for your earlier method (one that is guaranteed to work all the time) but it is late and I am past my peak for today 😀

If you get a minute, could you give it some thought?

Thanks!

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 8

Paul White NZ (4/14/2010)
I still can't get my head around a suitable block size for your earlier method (one that is guaranteed to work all the time) but it is late and I am past my peak for today 😀

Here is a script that tests the two different methods (hash based and range based)

-- Setup

-- Generate numbers between @start and @end

CREATE function [dbo].[GetRange](@start int, @end int) returns table as

return

select top (@end-@start+1) (row_number() over(order by (select null)))-1+@start as n

from sys.columns c1

cross join sys.columns c2

cross join sys.columns c3

GO

-- Create a table to hold the ipnumbers we want to lookup

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[my].[ipAddresses]') AND type in (N'U'))

DROP TABLE [my].[ipAddresses]

GO

CREATE TABLE [my].[ipAddresses](

[ipNumber] [bigint] NOT NULL,

[countryCode] [char](2) NULL,

[c2] [char](2) NULL,

CONSTRAINT [PK_ipAddresses_1] PRIMARY KEY CLUSTERED

(

[ipNumber] ASC

)

) ON [PRIMARY]

-- create dataset with random ip-addresses in the 217.0.0.0 area - This range is very common in europe.

-- Run the following statement manually a few times until you have 10000 rows

-- If you get a PK violation just try again

insert into my.ipaddresses (ipnumber)

select top 1000 cast(217 as bigint)*0x1000000+(abs(CHECKSUM(newid()))%0x1000000)

from sys.all_columns c1, sys.all_columns c2

select COUNT(*) from my.ipAddresses

-- Add some more completely random addresses

insert into my.ipaddresses (ipnumber)

select top 10000 CHECKSUM(newid())

from sys.all_columns c1, sys.all_columns c2

-- Make sure there are no blocks in the lookup data that spans a 64K boundary

-- Split any block that spans a 64K boundary

-- Change the constant 0x10000 if you want to try another blocksize

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'IPCountryMap2') AND type in (N'U'))

DROP TABLE IPCountryMap2

GO

select

case when from_ip > n*0x10000 then from_ip else n*0x10000 end as from_ip,

case when to_ip < (n+1)*0x10000-1 then to_ip else (n+1)*0x10000-1 end as to_ip,

country_code

into IPCountryMap2

from IPtoCountryMap cross apply dbo.GetRange(from_ip/0x10000, to_ip/0x10000)

CREATE CLUSTERED INDEX CL ON IPCountryMap2(from_ip)

-- Perform update of all rows using the hash-method

SET STATISTICS TIME ON

update my.ipAddresses

set c2=(

select country_code

from IPtoCountryMap

where

classA = ipNumber/0x1000000

and

from_ip <= ipNumber

and

to_ip >= ipNumber

)

GO

--SQL Server Execution Times:

-- CPU time = 4649 ms, elapsed time = 4696 ms.

-- Perform update of all rows using the range-metod

SET STATISTICS TIME ON

update my.ipAddresses

set countryCode=(

select country_code

from IPCountryMap2

where

from_ip <= ipNumber

and

from_ip > ipNumber - 0x10000 -- The same blocksize as above

and

to_ip >= ipNumber

)

--SQL Server Execution Times:

-- CPU time = 250 ms, elapsed time = 263 ms

-- Verify that the two methods give the same result

select CAST(ipNumber as binary(4)), c2, countryCode

from my.ipAddresses

where isnull(c2,'')<>isnull(countryCode,'')

-- should return nothing, but with my dataset I get lots of rows here where the hash-lookup failed because of rows that span multiple

-- ClassA entries.

As you can see for this test the hash-method took 4649 ms and the range-method took 250 ms.

So for this test the range-method is much faster.

Note that I have deliberately chosen to work with a range of ip-numbers that cause problems for the hash-method.

The block where ClassA=217 contains 1711 rows in my dataset, this means that the amount of linear scans needed is very large.

The dataset I used can be found here:http://www.canadiancontent.net/tech/download/IP-Country_mapping_Database.html

(I have no idea of the quality of this data)

On the other hand, I believe that ClassA=217 is a very common range for european ip-numbers. (My current adress is in that range for example) this means that I also think it is a fairly realistic test.

Some notes:

* It is very important to split up the ranges so there are no ranges in the table with a size larger than 64K. This is done by the code that creates IPCountryMap2

* The hash-method does not use an actual HASH join on my machine.

* If desired it is possible to use a different blocksize than 64K - just change the constant 0x10000 to something else in the script to try different sizes. Smaller blocks gives a much larger lookup table but should increase performance slightly.

/SG

Stefan_G SSCertifiable Points: 6609 More actions · Answer 9

Paul White NZ (4/14/2010)
Your US IP address range is 6.0.0.0 to 7.57.75.31 when decoded.
This is not valid (in the source) since it spans a class A address, and the end of the range contains a class D element that is not 255.
It would have to be (a minimum of) two ranges in the source file 6.0.0.0 to 6.255.255.255 and 7.0.0.0 to 7.57.75.255, for example.
That's why I go to such great lengths to avoid compacting ranges over a class A address - it would break the rules 😉

Just a few more comments before I shut up.

I think the concept of Class A-E of ip-adresses is actually obsolete. Nowadays they use something called Classless Inter-Domain Routing instead. See:http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing

See http://en.wikipedia.org/wiki/Classful_network for more info about the old system.

If I understand this correctly this means that there is nothing strange about an ip-range that spans several different Class A networks - or several Class B or Class C networks for that matter.

What you call a ClassA address is really just the most significant 8 bits of the ip-number.

There is no special reason why you should only use 8 bits in the hash-method. You could just as easily use 12 bits or 16 bits - all you have to do is to split the rows in the lookup table just like I did in the range-method.

By increasing the number of bits you can get a smaller amount of rows in each hash-bucket and increase the lookup performance.

I believe that if one uses the same blocksize the two methods will have almost identical performance.

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 10

Stefan_G (4/14/2010)
Here is a script that tests the two different methods (hash based and range based)

Thanks - that clarifies the details of the 64K range idea nicely 🙂

Of course, I have some observations...(not criticisms)

The number of IP addresses is too small to benefit from a hash join - 20,000 IPs just isn't enough (compared to the size of the lookup table) to make the loop expensive enough. Putting half the sample IPs in the 'difficult' range is a bit unfair too.

The correlated subquery form of the update statement makes it impossible to use a hash join - a hashable query equivalent would look something like:

UPDATE M

SET c2 = CM.country_code

FROM my.ipAddresses M

JOIN IPtoCountryMap CM

ON classA = ipNumber / 0x1000000

AND M.ipNumber BETWEEN CM.from_ip AND CM.to_ip;

Though this does highlight a problem - the result of the hash join needs a distinct before the update, since SQL Server has no way of knowing that ranges never overlap.

The results I got when running your test rig with this alternative construction were 2312ms for the hash, and 250ms for the loop-range.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 11

Stefan_G (4/14/2010)
I think the concept of Class A-E of ip-adresses is actually obsolete.

Agreed. I was using 'class A' as a quick way to refer the the first octet - nothing more.

If I understand this correctly this means that there is nothing strange about an ip-range that spans several different Class A networks

Sure. The data set I have just happens to be split that way.

There is no special reason why you should only use 8 bits in the hash-method. You could just as easily use 12 bits or 16 bits - all you have to do is to split the rows in the lookup table just like I did in the range-method.

Yep. The important point about the hash idea is that there needs to be at least one equality predicate - I'm not wedded to the 'high 8-bits' idea - it was simply convenient. The general point is that I think that pre-processing the country map ranges to allow an equality is likely to be optimal for large input sets.

I believe that if one uses the same blocksize the two methods will have almost identical performance.

This is the part that intrigues me: The loop-block method has only inequalities, forcing a loop join. My idea is to test the assertion that a hash will be more efficient for large numbers of IPs.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 12

I just thought of another even more efficient way of solving this problem.

The problem with both your hash-method and my range-method is that they use a fast method to find a group of possibly matching values, then a linear search in the group to find the single value that really matches.

The only reason my method is faster in my tests is because my groups are smaller. The hash-method could be modified to give smaller groups which would probably make it slightly faster, but at the cost of generating a much larger lookup-table.

But, there is a simple way to make sure that only a single index-seek is required for each lookup.

The idea is to build a table that splits the full range of ip-numbers into parts. some parts will have a defined country while others will have XX to indicate that they are invalid or at least unknown.

So, I want to transform a table like this:

from_ipto_ipcountry_code

3399634433996351GB

5033164883886079US

9458542494585439SE

Into a table like this:

from_ipto_ipcountry_code

0000000033996343XX

3399634433996351GB

3399635250331647XX

5033164883886079US

8388608094585423XX

9458542494585439SE

With a table like that it is very fast to find the country code for an ip-number with a single indexed lookup on from_ip. No linear searching is needed, and no splitting of the original table in smaller blocks is necessary.

Size of the lookup-table after applying the different transforms:

new_method block_split original

45163 58734 38270

So, the new method generates fewer extra rows than the block-split method.

This script demonstrates the new method:

--

-- build the expanded table

--

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'IPCountryMap3') AND type in (N'U'))

DROP TABLE IPCountryMap3

go

with t1 as (

select

from_ip,

to_ip,

country_code,

isnull((select min(m.from_ip) from iptocountrymap m where m.from_ip > m1.from_ip),0xffffffff) as next_ip

from iptocountrymap m1

)

select *

into IPCountryMap3

from (

-- all original rows

select from_ip, to_ip, country_code

from iptocountrymap

union all

-- fill gaps between original rows and after the last row

select to_ip+1, next_ip-1, 'XX'

from t1

where next_ip > to_ip+1

union all

-- a single row that covers the range from 0 to first valid adress

select 0, min(from_ip)-1, 'XX'

from iptocountrymap

) t

-- create an index used for lookup

create index ix1 on IPCountryMap3(from_ip) include(country_code)

--

-- Use the expanded table to perform the update

--

update my.ipAddresses

set countryCode=(select top 1 country_code from IPCountryMap3 icm where icm.from_ip <= ipNumber order by icm.from_ip desc)

-- SQL Server Execution Times:

-- CPU time = 249 ms, elapsed time = 269 ms.

--

-- Test the performance of the join without an update to make the difference more obvious

--

-- old block-range method

select MAX(country_code) from (

select ipNumber,(select country_code from IPCountryMap2 icm where from_ip <= ipNumber and from_ip > ipNumber - 0x10000 and to_ip >= ipNumber) as country_code

from my.ipAddresses a

) t

-- SQL Server Execution Times:

-- CPU time = 187 ms, elapsed time = 190 ms.

-- new single lookup method

select max(country_code) from (

select ipNumber,(select top 1 country_code from IPCountryMap3 icm where icm.from_ip <= a.ipNumber order by icm.from_ip desc) as country_code

from my.ipAddresses a

) t

-- SQL Server Execution Times:

-- CPU time = 93 ms, elapsed time = 140 ms.

As you can see, the new method is even more efficient than any of the old methods.

Putting half the sample IPs in the 'difficult' range is a bit unfair too.

Maybe, but there are many "difficult" ranges, and they are the ranges that are actively used today to define new adresses. I think that if you actually look at adresses used on the internet by normal end users you will find that most ip-numbers that are not from the US comes from "difficult" ranges.

But of course, it depends on what you are doing. If the majority of your users are from the US it is possible that most numbers will come from "simple" ranges.

I actually think I am being nice by putting half the numbers in "easy" ranges. 🙂

Also, of course, I know that the hash-method would work perfectly in easy ranges. It is only in difficult ranges it has any problems. So I need to do it this way to prove my point. 😉

/SG

Stefan_G SSCertifiable Points: 6609 More actions · Answer 13

OK, I obviously cannot stop thinking about this problem.

I just thought of another really simple and very fast solution.

Just write the update statement like this:

update a

set CountryCode = case when a.ipNumber <= t.to_ip then t.country_code else 'XX' end

from my.ipAddresses a

cross apply (select top 1 * from ipcountry2 icm where icm.from_ip <= a.ipNumber order by icm.from_ip desc) t

This version works immediately - no need for any preprocessing of the ipcountry table.

This version is also just as fast as the fastest of all the other methods.

Amazing that a single problem can be solved in so many different ways.

Also amazing that the simplest solution is actually also the fastest.

I also find it interesting that the CROSS APPLY operator can be used in this way. In BOL they only talk about it for calling Table-Valued functions.

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 14

Stefan_G (4/15/2010)
I just thought of another even more efficient way of solving this problem.

You don't even need to go to the trouble of creating the invalid/unknown groups - just use an OUTER APPLY instead of CROSS APPLY.

Both APPLY solutions are pretty optimal for the original problem - no question about that. The only real issue with using APPLY here, is that it is correlated, and so forces the join order, and the use of nested loops to do the join. This becomes less than optimal for large sets of IP addresses, since we end up performing a look-up (no matter how efficient) per IP address.

The IP address set I have been playing around with contains one million randomly-generated addresses. Being random was the fairest way I could think of to test this.

Surprisingly, the original loop-join method turns out to be not so bad. Rather than the IP Address table driving the join, the optimiser flips things around so that the allocated ranges drive a loop join against the (now much larger) IP address table. It actually helps a fair bit to remove the 'from_ip > ipNumber - 0x10000' optimisation as soon as this flip-around occurs.

The hash-method could be modified to give smaller groups which would probably make it slightly faster, but at the cost of generating a much larger lookup-table.

This turns out to be true. I found that a group size of 16384 worked well, generating a look-up table with 250,203 rows (roughly double that of the original).

Testing with the million-row table, I tried MERGE and HASH joins, the fastest version of the loop-join method (flipped) and the APPLY. You'll have to take it on faith at this stage that I did everything I could think of with indexes and code optimisations to get the best out of all the methods.

Results:

-- Merge

Table 'IPCM'. Scan count 1, logical reads 966

Table 'ipAddresses'. Scan count 1, logical reads 2238

Table 'Worktable'. Scan count 6540, logical reads 131981

CPU time = 1188 ms, elapsed time = 1370 ms.

-- Hash

Table 'IPCM'. Scan count 1, logical reads 874

Table 'ipAddresses'. Scan count 1, logical reads 2238

CPU time = 968 ms, elapsed time = 1136 ms.

-- Loop

Table 'IPCM'. Scan count 1, logical reads 874

Table 'ipAddresses'. Scan count 250203, logical reads 1051465

CPU time = 2140 ms, elapsed time = 2239 ms.

-- Apply

Table 'ipAddresses'. Scan count 1, logical reads 2238

Table 'IPCM'. Scan count 999752, logical reads 2999256

CPU time = 4000 ms, elapsed time = 4124 ms.

So, it's not like the CPU or elapsed time differences are hugely dramatic - but check out the logical reads!

I also find it interesting that the CROSS APPLY operator can be used in this way. In BOL they only talk about it for calling Table-Valued functions.

Heh. So I hope you read part one of my article on Monday: http://www.sqlservercentral.com/articles/APPLY/69953/

Anyway, this whole problem continues to fascinate me. I'm not 100% happy with any of the methods found so far - though the APPLY is clearly optimal for small sets, and the grouped mapping table is an excellent improvement for the hash/merge approach. It bugs me that there is no easy way to perform a proper interval-join in SQL Server.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 15

Could you possibly post the actual SQL used in your tests ?

You dont have to post a full repro script, but it would be interesting to see the actual queries that you were timing.

I am not sure if you where performing an update, or if you where just testing the join with some other kind of query.

Thanks

/SG