April 13, 2010 at 1:01 am
Jeff Moden (4/12/2010)
On the subject of a non-clustered index in a similar order at work... it caused very frequent timouts but, admittedly, that was on a table with a fair number of inserts. Still, putting the lower cardinality column at the end of the index still allowed seeks for the SELECTs without the extra page splits (extent splits in this case).
Hmm. Not sure I really understand what you are trying to say here.
But it sounds like you are talking about the order of the columns in the non-clustered index. Are you saying that the order should be IpNumber, CountryCode just because IpNumber has much higher cardinality than CountryCode ?
Since the index in this case is used to find (the very few) rows with CountryCode=null, it is vital that the index has CountryCode in the first column.
But I am probably misunderstanding you. Perhaps you want to explain a bit more what you mean ?
/SG
April 14, 2010 at 2:41 am
r5d4 (4/12/2010)
Anyone got any ideas to improve on this?My tables are below. For reference, my IPcountrylookup table has 108726 rows
and I've recorded 22515504 unique IPs so far.
I realise I'm a bit late here - and the original question has been well answered, but I'd like to mention just a couple of general points, if I may 🙂
The lookup table seems to have one row per assigned IP range - so I am going to assume you have downloaded one of the publicly-available lists (the latest list I have contains 101,517 records). The record format can be trimmed, for the purpose of country lookup, by collapsing adjacent ranges for the same country. There are a large number of contiguous ranges per country that are represented by a number of records in the source data.
I like to limit this process so that each class A address always has its own record, for reasons I'll come to in a second. Collapsing adjacent records results in 43,911 records - less than half the size. This sort of thing can become important where a large number of records must be looked up - such as when updating existing records based on a new version of the IP-to-country mapping.
When dealing with large volumes (which you might be, in future, who knows) the limitation of using BETWEEN to find the country becomes clearer: the join on an inequality means that only a Loop Join can be used. With larger sets, we would probably look to do a HASH join - but that requires at least one equality predicate.
That is why I like to collapse the IP ranges only as far as retaining at least one record per class A address - I can then use the class A part of the address as a sort of hash function - and the equality comparison then allows a HASH join. I store just the class A part of the address as a computed column in the look up table (not persisted, so it takes no space) and then create an index on it, and include the from & to ip range data. In your case the UPDATE query would look like this:
UPDATE MIP
SET countryCode = ICM.country_code
FROM my.ipAddresses MIP
JOIN dbo.IPtoCountryMap ICM
ON MIP.ipNumber / 16777216 = ICM.classA -- initial 'hash' seek
AND MIP.ipNumber BETWEEN ICM.from_ip and ICM.to_ip -- detail match after the hash match
WHERE MIP.countryCode IS NULL;
As a final very small final point - the index on countryCode could be filtered, since you're only ever interested in finding NULL country codes with it:
CREATE NONCLUSTERED INDEX
[IX my.ipAddresses countryCode IS NULL]
ON my.ipAddresses (countryCode)
INCLUDE (ipNumber)
WHERE countryCode IS NULL;
April 14, 2010 at 3:11 am
Just in case anyone finds it useful, here's the code I used to compact the records:
-- Compacted version of the full allocated-IPs table
CREATE TABLE dbo.IPtoCountryMap
(
from_ip BIGINT NOT NULL,
to_ip BIGINT NOT NULL,
country_code CHAR(2) NOT NULL,
classA AS
ISNULL(CONVERT(TINYINT, from_ip / 16777216), 0)
CONSTRAINT [PK dbo.IPtoCountryMap from_ip]
PRIMARY KEY CLUSTERED (from_ip ASC)
);
GO
WITH Ranges
AS (
-- Find each IP assignment record that starts a range
-- (no other range ends just before this one starts)
-- Limit the range search to a single class A range
SELECT IC.from_ip,
IC.to_ip,
IC.country_code,
root_id = ROW_NUMBER() OVER (ORDER BY IC.from_ip),
recursion = 0
FROM dbo.IpToCountry IC
WHERE NOT EXISTS
(
-- No immediately preceding record within
-- the same class A address
SELECT *
FROM dbo.IpToCountry IC2
WHERE IC2.to_ip + 1 = IC.from_ip
AND IC2.to_ip / 16777216 = IC.from_ip / 16777216
)
UNION ALL
-- Recursive part finds all the IP assignments
-- that follow on from the anchor record (no gap)
-- within the same class A range
SELECT IC.from_ip,
IC.to_ip,
IC.country_code,
R.root_id,
R.recursion + 1
FROM Ranges R
JOIN dbo.IpToCountry IC
ON IC.from_ip = R.to_ip + 1
AND IC.to_ip / 16777216 = R.from_ip / 16777216
),
Groups
AS (
-- Identify sequential assignments
-- using magic
SELECT R.root_id,
R.country_code,
R.from_ip,
R.to_ip,
group_id =
ROW_NUMBER()
OVER (
PARTITION BY R.root_id
ORDER BY R.country_code, R.recursion)
- R.recursion
FROM Ranges R
),
ClassA_Ranges
AS (
-- Compact sequential groups into one record
SELECT G.country_code,
from_ip = MIN(G.from_ip),
to_ip = MAX(G.to_ip)
FROM Groups G
GROUP BY
G.root_id,
G.country_code,
G.group_id
)
INSERT dbo.IPtoCountryMap
(
from_ip,
to_ip,
country_code
)
SELECT CAR.from_ip,
CAR.to_ip,
CAR.country_code
FROM ClassA_Ranges CAR
OPTION (MAXRECURSION 0);
GO
-- Index on the computed column
CREATE NONCLUSTERED INDEX [IX dbo.IPtoCountryMap classA (country_code, to_ip)]
ON dbo.IPtoCountryMap (classA)
INCLUDE (country_code, to_ip);
GO
The example 1,000 rows aren't enough to produce a hash join with the indexes shown, but you can force it.
April 14, 2010 at 5:13 am
Interesting concept.
Unfortunately If I am not missing something, it will not really work correctly.
Consider the following consecutive rows in the IpToCountryMap table:
from_ip to_ip country_code classA
0x06000000 0x07394B1F US 6
0x0A000000 0x0AFFFFFF NL 10
If you try to lookup ip-number 0x07123456 with your code the lookup will fail even though that value is within the US range above.
The initial hash-seek will look for an entry with classA=7 which will not be found at all so the lookup will fail.
Further more, I am not so sure that this method is any faster than the method I suggested using range seeks limited to a certain blocksize.
The problem with your method is that there are several classA-values that have a lot of entries, for example classA=212 has over 4500 different entries in the table.
The hash search will only help us find the correct slot, finding the correct entry then requires a linear search.
So, looking up ip-addresses that has classA=212 will require a linear search of on average 2000 rows. This will not be good for performance.
To see how bad the problem is I ran the following query on my dataset:
select classA, COUNT(*) as rows
from IPtoCountryMap
group by classA
having count(*)>100
order by 1
result:
classArows
24107
622202
64253
66340
69117
80764
81569
130130
134120
137112
138118
139115
146147
147128
158114
159158
161122
164122
1923671
1931995
1944996
1956611
196134
198754
199528
200468
202908
203365
204482
205141
206169
207164
209216
210101
2124512
2132026
216394
2171711
To improve performance you would need to use more than 8 bits for your hash lookup so each slot contains fewer entries, but this will make the problem of entries that span more than one hash value even worse.
/SG
April 14, 2010 at 7:24 am
Stefan_G (4/14/2010)
Unfortunately If I am not missing something, it will not really work correctly.Consider the following consecutive rows in the IpToCountryMap table:
Your US IP address range is 6.0.0.0 to 7.57.75.31 when decoded.
This is not valid (in the source) since it spans a class A address, and the end of the range contains a class D element that is not 255.
It would have to be (a minimum of) two ranges in the source file 6.0.0.0 to 6.255.255.255 and 7.0.0.0 to 7.57.75.255, for example.
That's why I go to such great lengths to avoid compacting ranges over a class A address - it would break the rules 😉
Further more, I am not so sure that this method is any faster than the method I suggested using range seeks limited to a certain blocksize.
You can see how a hash join might beat a loop join for larger inputs though? The 'hash' method doesn't produce a hash join until the row count gets big enough, but the loop join is pretty darn fast too - maybe comparable with yours, not sure since that isn't the case I wrote it for.
I'm quite happy to test both - but I'm not sure exactly how your idea works, or what block size I should choose - can you briefly explain?
The problem with your method is that there are several classA-values that have a lot of entries, for example classA=212 has over 4500 different entries in the table.
Class A = 212 has 1028 entries in the compacted table on my test rig - much smaller than the 43,911 in the table as a whole.
I feel confident that the class A hash will hold up well in testing.
The hash search will only help us find the correct slot, finding the correct entry then requires a linear search.
What makes you so sure that SQL Server uses a linear search when applying a residual predicate after a hash probe? 😀
I know the class-A level hash isn't perfect, but I believe it helps (I tried a class AB hash, but it's impossible to guarantee the source data has a row per AB, but it is guaranteed not to span class A records per entry).
So, looking up ip-addresses that has classA=212 will require a linear search of on average 2000 rows. This will not be good for performance.
Hey I never claimed my idea was better than yours 😛
I said the question had been well answered - I'm just exploring an alternative for large inputs. I think I mentioned that 🙁
If you want to contribute positively here, you are more than welcome - and I'll play too.
If you're going to get all defensive and stuff, then you're not :w00t:
Paul
edit: to fix that quote tag
April 14, 2010 at 8:13 am
Paul White NZ (4/14/2010)
Hey I never claimed my idea was better than yours 😛I said the question had been well answered - I'm just exploring an alternative for large inputs. I think I mentioned that 🙁
If you want to contribute positively here, you are more than welcome - and I'll play too.
If you're going to get all defensive and stuff, then you're not :w00t:
I apologize of you felt attacked by my previous post.
I am not trying to be defensive about anything, I am just discussing the relative merits of two different methods to solve a quite interesting problem.
I think exploring alternative methods is truly interesting and educational.
It is obvious that my raw data is different from yours. The row with 6.0.0.0 to 7.57.75.31 is not made-up, it is actually present in my dataset.
I have no idea of the actual quality of my raw data though - it is just something i downloaded from a random site after googling for it.
There are definitely no hard feelings from me here.
/SG
April 14, 2010 at 8:25 am
Stefan_G (4/14/2010)
I am not trying to be defensive about anything, I am just discussing the relative merits of two different methods to solve a quite interesting problem.
Great! Because I agree - it is very interesting.
It is obvious that my raw data is different from yours. The row with 6.0.0.0 to 7.57.75.31 is not made-up, it is actually present in my dataset. I have no idea of the actual quality of my raw data though - it is just something i downloaded from a random site after googling for it.
Hmmm - that is interesting.
Hey look, I really need to go now (it's 2:20am) but I am going to look at it again tomorrow.
I still can't get my head around a suitable block size for your earlier method (one that is guaranteed to work all the time) but it is late and I am past my peak for today 😀
If you get a minute, could you give it some thought?
Thanks!
Paul
April 14, 2010 at 10:58 am
Paul White NZ (4/14/2010)
I still can't get my head around a suitable block size for your earlier method (one that is guaranteed to work all the time) but it is late and I am past my peak for today 😀
Here is a script that tests the two different methods (hash based and range based)
-- Setup
-- Generate numbers between @start and @end
CREATE function [dbo].[GetRange](@start int, @end int) returns table as
return
select top (@end-@start+1) (row_number() over(order by (select null)))-1+@start as n
from sys.columns c1
cross join sys.columns c2
cross join sys.columns c3
GO
-- Create a table to hold the ipnumbers we want to lookup
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[my].[ipAddresses]') AND type in (N'U'))
DROP TABLE [my].[ipAddresses]
GO
CREATE TABLE [my].[ipAddresses](
[ipNumber] [bigint] NOT NULL,
[countryCode] [char](2) NULL,
[c2] [char](2) NULL,
CONSTRAINT [PK_ipAddresses_1] PRIMARY KEY CLUSTERED
(
[ipNumber] ASC
)
) ON [PRIMARY]
-- create dataset with random ip-addresses in the 217.0.0.0 area - This range is very common in europe.
-- Run the following statement manually a few times until you have 10000 rows
-- If you get a PK violation just try again
insert into my.ipaddresses (ipnumber)
select top 1000 cast(217 as bigint)*0x1000000+(abs(CHECKSUM(newid()))%0x1000000)
from sys.all_columns c1, sys.all_columns c2
select COUNT(*) from my.ipAddresses
-- Add some more completely random addresses
insert into my.ipaddresses (ipnumber)
select top 10000 CHECKSUM(newid())
from sys.all_columns c1, sys.all_columns c2
-- Make sure there are no blocks in the lookup data that spans a 64K boundary
-- Split any block that spans a 64K boundary
-- Change the constant 0x10000 if you want to try another blocksize
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'IPCountryMap2') AND type in (N'U'))
DROP TABLE IPCountryMap2
GO
select
case when from_ip > n*0x10000 then from_ip else n*0x10000 end as from_ip,
case when to_ip < (n+1)*0x10000-1 then to_ip else (n+1)*0x10000-1 end as to_ip,
country_code
into IPCountryMap2
from IPtoCountryMap cross apply dbo.GetRange(from_ip/0x10000, to_ip/0x10000)
CREATE CLUSTERED INDEX CL ON IPCountryMap2(from_ip)
-- Perform update of all rows using the hash-method
SET STATISTICS TIME ON
update my.ipAddresses
set c2=(
select country_code
from IPtoCountryMap
where
classA = ipNumber/0x1000000
and
from_ip <= ipNumber
and
to_ip >= ipNumber
)
GO
--SQL Server Execution Times:
-- CPU time = 4649 ms, elapsed time = 4696 ms.
-- Perform update of all rows using the range-metod
SET STATISTICS TIME ON
update my.ipAddresses
set countryCode=(
select country_code
from IPCountryMap2
where
from_ip <= ipNumber
and
from_ip > ipNumber - 0x10000 -- The same blocksize as above
and
to_ip >= ipNumber
)
--SQL Server Execution Times:
-- CPU time = 250 ms, elapsed time = 263 ms
-- Verify that the two methods give the same result
select CAST(ipNumber as binary(4)), c2, countryCode
from my.ipAddresses
where isnull(c2,'')<>isnull(countryCode,'')
-- should return nothing, but with my dataset I get lots of rows here where the hash-lookup failed because of rows that span multiple
-- ClassA entries.
As you can see for this test the hash-method took 4649 ms and the range-method took 250 ms.
So for this test the range-method is much faster.
Note that I have deliberately chosen to work with a range of ip-numbers that cause problems for the hash-method.
The block where ClassA=217 contains 1711 rows in my dataset, this means that the amount of linear scans needed is very large.
The dataset I used can be found here:http://www.canadiancontent.net/tech/download/IP-Country_mapping_Database.html
(I have no idea of the quality of this data)
On the other hand, I believe that ClassA=217 is a very common range for european ip-numbers. (My current adress is in that range for example) this means that I also think it is a fairly realistic test.
Some notes:
* It is very important to split up the ranges so there are no ranges in the table with a size larger than 64K. This is done by the code that creates IPCountryMap2
* The hash-method does not use an actual HASH join on my machine.
* If desired it is possible to use a different blocksize than 64K - just change the constant 0x10000 to something else in the script to try different sizes. Smaller blocks gives a much larger lookup table but should increase performance slightly.
/SG
April 14, 2010 at 11:24 am
Paul White NZ (4/14/2010)
Your US IP address range is 6.0.0.0 to 7.57.75.31 when decoded.This is not valid (in the source) since it spans a class A address, and the end of the range contains a class D element that is not 255.
It would have to be (a minimum of) two ranges in the source file 6.0.0.0 to 6.255.255.255 and 7.0.0.0 to 7.57.75.255, for example.
That's why I go to such great lengths to avoid compacting ranges over a class A address - it would break the rules 😉
Just a few more comments before I shut up.
I think the concept of Class A-E of ip-adresses is actually obsolete. Nowadays they use something called Classless Inter-Domain Routing instead. See:http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing
See http://en.wikipedia.org/wiki/Classful_network for more info about the old system.
If I understand this correctly this means that there is nothing strange about an ip-range that spans several different Class A networks - or several Class B or Class C networks for that matter.
What you call a ClassA address is really just the most significant 8 bits of the ip-number.
There is no special reason why you should only use 8 bits in the hash-method. You could just as easily use 12 bits or 16 bits - all you have to do is to split the rows in the lookup table just like I did in the range-method.
By increasing the number of bits you can get a smaller amount of rows in each hash-bucket and increase the lookup performance.
I believe that if one uses the same blocksize the two methods will have almost identical performance.
/SG
April 14, 2010 at 5:29 pm
Stefan_G (4/14/2010)
Here is a script that tests the two different methods (hash based and range based)
Thanks - that clarifies the details of the 64K range idea nicely 🙂
Of course, I have some observations...(not criticisms)
The number of IP addresses is too small to benefit from a hash join - 20,000 IPs just isn't enough (compared to the size of the lookup table) to make the loop expensive enough. Putting half the sample IPs in the 'difficult' range is a bit unfair too.
The correlated subquery form of the update statement makes it impossible to use a hash join - a hashable query equivalent would look something like:
UPDATE M
SET c2 = CM.country_code
FROM my.ipAddresses M
JOIN IPtoCountryMap CM
ON classA = ipNumber / 0x1000000
AND M.ipNumber BETWEEN CM.from_ip AND CM.to_ip;
Though this does highlight a problem - the result of the hash join needs a distinct before the update, since SQL Server has no way of knowing that ranges never overlap.
The results I got when running your test rig with this alternative construction were 2312ms for the hash, and 250ms for the loop-range.
Paul
April 14, 2010 at 5:41 pm
Stefan_G (4/14/2010)
I think the concept of Class A-E of ip-adresses is actually obsolete.
Agreed. I was using 'class A' as a quick way to refer the the first octet - nothing more.
If I understand this correctly this means that there is nothing strange about an ip-range that spans several different Class A networks
Sure. The data set I have just happens to be split that way.
There is no special reason why you should only use 8 bits in the hash-method. You could just as easily use 12 bits or 16 bits - all you have to do is to split the rows in the lookup table just like I did in the range-method.
Yep. The important point about the hash idea is that there needs to be at least one equality predicate - I'm not wedded to the 'high 8-bits' idea - it was simply convenient. The general point is that I think that pre-processing the country map ranges to allow an equality is likely to be optimal for large input sets.
I believe that if one uses the same blocksize the two methods will have almost identical performance.
This is the part that intrigues me: The loop-block method has only inequalities, forcing a loop join. My idea is to test the assertion that a hash will be more efficient for large numbers of IPs.
Paul
April 15, 2010 at 2:48 am
I just thought of another even more efficient way of solving this problem.
The problem with both your hash-method and my range-method is that they use a fast method to find a group of possibly matching values, then a linear search in the group to find the single value that really matches.
The only reason my method is faster in my tests is because my groups are smaller. The hash-method could be modified to give smaller groups which would probably make it slightly faster, but at the cost of generating a much larger lookup-table.
But, there is a simple way to make sure that only a single index-seek is required for each lookup.
The idea is to build a table that splits the full range of ip-numbers into parts. some parts will have a defined country while others will have XX to indicate that they are invalid or at least unknown.
So, I want to transform a table like this:
from_ipto_ipcountry_code
3399634433996351GB
5033164883886079US
9458542494585439SE
Into a table like this:
from_ipto_ipcountry_code
0000000033996343XX
3399634433996351GB
3399635250331647XX
5033164883886079US
8388608094585423XX
9458542494585439SE
With a table like that it is very fast to find the country code for an ip-number with a single indexed lookup on from_ip. No linear searching is needed, and no splitting of the original table in smaller blocks is necessary.
Size of the lookup-table after applying the different transforms:
new_method block_split original
45163 58734 38270
So, the new method generates fewer extra rows than the block-split method.
This script demonstrates the new method:
--
-- build the expanded table
--
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'IPCountryMap3') AND type in (N'U'))
DROP TABLE IPCountryMap3
go
with t1 as (
select
from_ip,
to_ip,
country_code,
isnull((select min(m.from_ip) from iptocountrymap m where m.from_ip > m1.from_ip),0xffffffff) as next_ip
from iptocountrymap m1
)
select *
into IPCountryMap3
from (
-- all original rows
select from_ip, to_ip, country_code
from iptocountrymap
union all
-- fill gaps between original rows and after the last row
select to_ip+1, next_ip-1, 'XX'
from t1
where next_ip > to_ip+1
union all
-- a single row that covers the range from 0 to first valid adress
select 0, min(from_ip)-1, 'XX'
from iptocountrymap
) t
-- create an index used for lookup
create index ix1 on IPCountryMap3(from_ip) include(country_code)
--
-- Use the expanded table to perform the update
--
update my.ipAddresses
set countryCode=(select top 1 country_code from IPCountryMap3 icm where icm.from_ip <= ipNumber order by icm.from_ip desc)
-- SQL Server Execution Times:
-- CPU time = 249 ms, elapsed time = 269 ms.
--
-- Test the performance of the join without an update to make the difference more obvious
--
-- old block-range method
select MAX(country_code) from (
select ipNumber,(select country_code from IPCountryMap2 icm where from_ip <= ipNumber and from_ip > ipNumber - 0x10000 and to_ip >= ipNumber) as country_code
from my.ipAddresses a
) t
-- SQL Server Execution Times:
-- CPU time = 187 ms, elapsed time = 190 ms.
-- new single lookup method
select max(country_code) from (
select ipNumber,(select top 1 country_code from IPCountryMap3 icm where icm.from_ip <= a.ipNumber order by icm.from_ip desc) as country_code
from my.ipAddresses a
) t
-- SQL Server Execution Times:
-- CPU time = 93 ms, elapsed time = 140 ms.
As you can see, the new method is even more efficient than any of the old methods.
Putting half the sample IPs in the 'difficult' range is a bit unfair too.
Maybe, but there are many "difficult" ranges, and they are the ranges that are actively used today to define new adresses. I think that if you actually look at adresses used on the internet by normal end users you will find that most ip-numbers that are not from the US comes from "difficult" ranges.
But of course, it depends on what you are doing. If the majority of your users are from the US it is possible that most numbers will come from "simple" ranges.
I actually think I am being nice by putting half the numbers in "easy" ranges. 🙂
Also, of course, I know that the hash-method would work perfectly in easy ranges. It is only in difficult ranges it has any problems. So I need to do it this way to prove my point. 😉
/SG
April 15, 2010 at 9:34 am
OK, I obviously cannot stop thinking about this problem.
I just thought of another really simple and very fast solution.
Just write the update statement like this:
update a
set CountryCode = case when a.ipNumber <= t.to_ip then t.country_code else 'XX' end
from my.ipAddresses a
cross apply (select top 1 * from ipcountry2 icm where icm.from_ip <= a.ipNumber order by icm.from_ip desc) t
This version works immediately - no need for any preprocessing of the ipcountry table.
This version is also just as fast as the fastest of all the other methods.
Amazing that a single problem can be solved in so many different ways.
Also amazing that the simplest solution is actually also the fastest.
I also find it interesting that the CROSS APPLY operator can be used in this way. In BOL they only talk about it for calling Table-Valued functions.
/SG
April 16, 2010 at 5:35 am
Stefan_G (4/15/2010)
I just thought of another even more efficient way of solving this problem.
You don't even need to go to the trouble of creating the invalid/unknown groups - just use an OUTER APPLY instead of CROSS APPLY.
Both APPLY solutions are pretty optimal for the original problem - no question about that. The only real issue with using APPLY here, is that it is correlated, and so forces the join order, and the use of nested loops to do the join. This becomes less than optimal for large sets of IP addresses, since we end up performing a look-up (no matter how efficient) per IP address.
The IP address set I have been playing around with contains one million randomly-generated addresses. Being random was the fairest way I could think of to test this.
Surprisingly, the original loop-join method turns out to be not so bad. Rather than the IP Address table driving the join, the optimiser flips things around so that the allocated ranges drive a loop join against the (now much larger) IP address table. It actually helps a fair bit to remove the 'from_ip > ipNumber - 0x10000' optimisation as soon as this flip-around occurs.
The hash-method could be modified to give smaller groups which would probably make it slightly faster, but at the cost of generating a much larger lookup-table.
This turns out to be true. I found that a group size of 16384 worked well, generating a look-up table with 250,203 rows (roughly double that of the original).
Testing with the million-row table, I tried MERGE and HASH joins, the fastest version of the loop-join method (flipped) and the APPLY. You'll have to take it on faith at this stage that I did everything I could think of with indexes and code optimisations to get the best out of all the methods.
Results:
-- Merge
Table 'IPCM'. Scan count 1, logical reads 966
Table 'ipAddresses'. Scan count 1, logical reads 2238
Table 'Worktable'. Scan count 6540, logical reads 131981
CPU time = 1188 ms, elapsed time = 1370 ms.
-- Hash
Table 'IPCM'. Scan count 1, logical reads 874
Table 'ipAddresses'. Scan count 1, logical reads 2238
CPU time = 968 ms, elapsed time = 1136 ms.
-- Loop
Table 'IPCM'. Scan count 1, logical reads 874
Table 'ipAddresses'. Scan count 250203, logical reads 1051465
CPU time = 2140 ms, elapsed time = 2239 ms.
-- Apply
Table 'ipAddresses'. Scan count 1, logical reads 2238
Table 'IPCM'. Scan count 999752, logical reads 2999256
CPU time = 4000 ms, elapsed time = 4124 ms.
So, it's not like the CPU or elapsed time differences are hugely dramatic - but check out the logical reads!
I also find it interesting that the CROSS APPLY operator can be used in this way. In BOL they only talk about it for calling Table-Valued functions.
Heh. So I hope you read part one of my article on Monday: http://www.sqlservercentral.com/articles/APPLY/69953/
Anyway, this whole problem continues to fascinate me. I'm not 100% happy with any of the methods found so far - though the APPLY is clearly optimal for small sets, and the grouped mapping table is an excellent improvement for the hash/merge approach. It bugs me that there is no easy way to perform a proper interval-join in SQL Server.
April 16, 2010 at 6:10 am
Could you possibly post the actual SQL used in your tests ?
You dont have to post a full repro script, but it would be interesting to see the actual queries that you were timing.
I am not sure if you where performing an update, or if you where just testing the join with some other kind of query.
Thanks
/SG
Viewing 15 posts - 16 through 30 (of 37 total)
You must be logged in to reply to this topic. Login to reply