TSQL Lookup using BETWEEN (Ip Addresses)

Question

TSQL Lookup using BETWEEN (Ip Addresses)

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150468 More actions · Answer 1

Stefan_G (4/16/2010)
Could you possibly post the actual SQL used in your tests ?
You dont have to post a full repro script, but it would be interesting to see the actual queries that you were timing. I am not sure if you where performing an update, or if you where just testing the join with some other kind of query.

The posted results came from a simple count - though I ran other tests with other aggregates like MAX- just something to make sure every row was touched. The update presents special problems, so the idea is to test just the mapping algorithm. Anyway:

-- Hash join

SELECT COUNT_BIG(*)

FROM dbo.IPCM IM

JOIN my.ipAddresses A

ON A.ipNumber BETWEEN IM.from_ip AND IM.to_ip

AND A.base = IM.base

OPTION (HASH JOIN, MAXDOP 1);

-- Loop join (flipped)

SELECT COUNT_BIG(*)

FROM dbo.IPCM IM

JOIN my.ipAddresses A

ON A.ipNumber BETWEEN IM.from_ip AND IM.to_ip

--AND from_ip > ipNumber - 16384

OPTION (LOOP JOIN, MAXDOP 1, FORCE ORDER);

-- Apply

SELECT COUNT_BIG(*)

FROM my.ipAddresses A

CROSS

APPLY (

SELECT TOP (1)

*

FROM dbo.IPCM IM

WHERE IM.from_ip <= A.ipNumber

ORDER BY

IM.from_ip DESC

) iTVF

OPTION (MAXDOP 1);

edit: updated code

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 2

Paul White NZ (4/16/2010)
The posted results came from a simple count - though I ran other tests with other aggregates like MAX- just something to make sure every row was touched. The update presents special problems, so the idea is to test just the mapping algorithm.

I see.

The problem is that the code you tested represents a very different problem than the original problem. Just testing the mapping might of course still be interesting we just have to remember that performing the join is just part of the problem.

So, if we should continue this discussion I think we need a more strict definition of the problem we are really trying to solve.

As I see it, the problem we are trying to solve can be described like this:

We have a table IpCountryMap that maps ip-address ranges to country codes. (Such a table can be downloaded from here[/url])

We have another table ipAddresses that contains ip-addresses and a country code column.

The problem is how to update all the country codes in the ipAddress table as quickly as possible. IpAddresses that do not map to a valid country code should be set to 'XX'. All rows in the table should be updated.

Preprocessing the IpCountryMap is permitted. the time for preprocessing does not count. the number of rows in IpCountryMap after preprocessing should not exceed 500,000

Any indexes can be added to the tables. The time for adding indexes does not count.

To get a realistic dataset we could assume that the ipAddress table contains 1 million unique random numbers evenly distributed in the following ranges:

61.0.0.0 - 69.0.0.0

128.0.0.0 - 172.0.0.0

192.0.0.0 - 221.0.0.0

These ranges together represent more than 87% of all ip-numbers actually in use today according to this report

The solution should not assume that only the above ranges are used - it must work correctly for any ip-address.

Only T-SQL is allowed. No SQLCLR for example.

When measuring the performance we run the script several times and then use the shortest elapsed time.

Parallell processing is allowed.

I think that something like this is close to an optimal solution for this problem:

drop table #t

select a.ipNumber, icm.country_code

into #t

from my.ipAddresses a, ipcountry_merged icm

where

from_ip <= ipNumber

and

to_ip >= ipNumber

create unique clustered index CL on #t(ipNumber)

update a

set a.countryCode = case when t.country_code is null then 'XX' else t.country_code end

from my.ipaddresses a

left join #t t on a.ipNumber = t.ipNumber

option (hash join)

ipcountry_merged is preprocessed by merging consecutive ranges that map to the same country code. The table has 43854 rows.

Do you have any better suggestions ?

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 3

Stefan_G (4/16/2010)
The problem is that the code you tested represents a very different problem than the original problem. Just testing the mapping might of course still be interesting we just have to remember that performing the join is just part of the problem.

Sure. More than one interesting problem was highlighted by the original question in this thread. My particular interest has been the generalised problem of interval matching in SQL Server. The same issues arise whenever a table contains a range of dates for example.

As I see it, the problem we are trying to solve can be described like this:

I don't disagree with any of those definitions (SQLCLR is unlikely to be of benefit given the large number of data operations required). Parallel plans are interesting (since they allow bitmap filters) but tend to complicate results because the usefulness of an in-row filter depends so much on how much data can be excluded in the scan.

My lack of focus on the update part of the problem comes down to two issues: (1) update plans have special restrictions and requirements which tend to dominate the overall cost for some plan options; and (2) in some cases it might be quicker to do a minimally-logged SELECT...INTO followed by a drop and rename, instead of the update.

ipcountry_merged is preprocessed by merging consecutive ranges that map to the same country code. The table has 43854 rows.

I'll have a look at the specific test case later today, though my expectation is that a loop join driven from the mapping table should be fastest - it'll be interesting to see how that works out.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Stefan_G SSCertifiable Points: 6609 More actions · Answer 4

Paul White NZ (4/16/2010)
More than one interesting problem was highlighted by the original question in this thread. My particular interest has been the generalised problem of interval matching in SQL Server. The same issues arise whenever a table contains a range of dates for example.

I agree. The generalized problem is very interesting. This thread has shown that there are many ways to attack the problem and it is not so easy to know in advance which method is the best.

On particular lesson I have learned from this thread is that UPDATE really constrains the methods SQL server can use. A straight SELECT into a temporary table gives the optimizer a much larger freedom to select an optimal plan.

It is interesting that the original formulation from the OP gives very bad perf for an update, but is at least close to optimal when using a SELECT. (The reason for this is of course that the optimizer can flip the query around and use a loop join based on the IpCpuntryMap table instead of basing it on the IpAddresses).

Now, if you are interested, here is some code to generate 1 million random ip-addresses with a realistic distribution:

create procedure dbo.GetRandom(@min bigint, @max-2 bigint, @count int)

as

select top (@count) cast(cast(binary_checksum(newid()) as binary(4)) as bigint)%(@max-@min+1)+@min

from sys.columns c1, sys.columns c2, sys.columns c3

go

drop table my.ipAddresses

create table my.ipAddresses (ipNumber binary(4) not null, countryCode char(2) null)

-- create approx 1 million random addresses with realistic distribution according to ftp://ftp.isi.edu/isi-pubs/tr-598.pdf

insert into my.ipAddresses (ipNumber) exec GetRandom 0x3d000000, 0x45ffffff, 278000

insert into my.ipAddresses (ipNumber) exec GetRandom 0x80000000, 0xacffffff, 142000

insert into my.ipAddresses (ipNumber) exec GetRandom 0xc0000000, 0xddffffff, 450000

insert into my.ipAddresses (ipNumber) exec GetRandom 0x00000000, 0xffffffff, 131000

-- delete duplicates

delete from my.ipAddresses

where ipNumber in (

select ipNumber

from my.ipAddresses

group by ipNumber

having count(*)>1

)

-- create unique index

create unique clustered index CL on my.ipAddresses(ipNumber)

-- show the resulting distribution

select substring(ipNumber,1,1), count(*)

from my.ipAddresses

group by substring(ipNumber,1,1)

order by 1

/SG

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 5

As a side bar, if you use Master.sys.All_Columns instead of the "local" copy of sys.columns, then you can get 16 million rows out of a single Cross-Join with a bit of a guarantee (the table has at least 4k rows even on a brand new full install)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Stefan_G SSCertifiable Points: 6609 More actions · Answer 6

Jeff Moden (4/17/2010)
As a side bar, if you use Master.sys.All_Columns instead of the "local" copy of sys.columns, then you can get 16 million rows out of a single Cross-Join with a bit of a guarantee (the table has at least 4k rows even on a brand new full install)

Well, not that it matters much, but the reason I use sys.columns is that it is slightly more efficient than sys.all_columns.

sys.columns only accesses a single table, while sys.all_columns is a join.

select max(n) from (

select top 1000000 checksum(newid()) as n

from sys.columns c1, sys.columns c2, sys.columns c3

) t

select max(n) from (

select top 1000000 checksum(newid()) as n

from sys.all_columns c1, sys.all_columns c2

) t

Table 'syscolpars'. Scan count 3, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 359 ms, elapsed time = 368 ms.

Table 'syscolrdb'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'syscolpars'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 374 ms, elapsed time = 369 ms.

As I said, it does not really matter much - it boils down to a matter of taste.

/SG

Paul White SSC Guru Points: 150468 More actions · Answer 7

Stefan_G (4/18/2010)
...the reason I use sys.columns is that it is slightly more efficient than sys.all_columns. sys.columns only accesses a single table, while sys.all_columns is a join.

With master as the context database, I couldn't get either method to consistently out-perform the other - though I understand the general point you are making

On the other hand, I can equally see Jeff's point: sys.all_columns in the master database is pretty much guaranteed to have sufficient rows - whereas sys.columns in the current context database might not.

I think it was Peso who had trimmed his master database enough to break Jeff's assertion - but as you say, it doesn't really matter. On occasion, I have been known to use master.sys.allocation_units, just for variety!

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 8

Actually, for anything bigger than 11K rows, I just use a CROSS JOIN on my trusty ol' Tally table because I KNOW how many rows that has. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)