April 16, 2010 at 6:27 am
Stefan_G (4/16/2010)
Could you possibly post the actual SQL used in your tests ?You dont have to post a full repro script, but it would be interesting to see the actual queries that you were timing. I am not sure if you where performing an update, or if you where just testing the join with some other kind of query.
The posted results came from a simple count - though I ran other tests with other aggregates like MAX- just something to make sure every row was touched. The update presents special problems, so the idea is to test just the mapping algorithm. Anyway:
-- Hash join
SELECT COUNT_BIG(*)
FROM dbo.IPCM IM
JOIN my.ipAddresses A
ON A.ipNumber BETWEEN IM.from_ip AND IM.to_ip
AND A.base = IM.base
OPTION (HASH JOIN, MAXDOP 1);
-- Loop join (flipped)
SELECT COUNT_BIG(*)
FROM dbo.IPCM IM
JOIN my.ipAddresses A
ON A.ipNumber BETWEEN IM.from_ip AND IM.to_ip
--AND from_ip > ipNumber - 16384
OPTION (LOOP JOIN, MAXDOP 1, FORCE ORDER);
-- Apply
SELECT COUNT_BIG(*)
FROM my.ipAddresses A
CROSS
APPLY (
SELECT TOP (1)
*
FROM dbo.IPCM IM
WHERE IM.from_ip <= A.ipNumber
ORDER BY
IM.from_ip DESC
) iTVF
OPTION (MAXDOP 1);
edit: updated code
April 16, 2010 at 10:18 am
Paul White NZ (4/16/2010)
The posted results came from a simple count - though I ran other tests with other aggregates like MAX- just something to make sure every row was touched. The update presents special problems, so the idea is to test just the mapping algorithm.
I see.
The problem is that the code you tested represents a very different problem than the original problem. Just testing the mapping might of course still be interesting we just have to remember that performing the join is just part of the problem.
So, if we should continue this discussion I think we need a more strict definition of the problem we are really trying to solve.
As I see it, the problem we are trying to solve can be described like this:
We have a table IpCountryMap that maps ip-address ranges to country codes. (Such a table can be downloaded from here[/url])
We have another table ipAddresses that contains ip-addresses and a country code column.
The problem is how to update all the country codes in the ipAddress table as quickly as possible. IpAddresses that do not map to a valid country code should be set to 'XX'. All rows in the table should be updated.
Preprocessing the IpCountryMap is permitted. the time for preprocessing does not count. the number of rows in IpCountryMap after preprocessing should not exceed 500,000
Any indexes can be added to the tables. The time for adding indexes does not count.
To get a realistic dataset we could assume that the ipAddress table contains 1 million unique random numbers evenly distributed in the following ranges:
61.0.0.0 - 69.0.0.0
128.0.0.0 - 172.0.0.0
192.0.0.0 - 221.0.0.0
These ranges together represent more than 87% of all ip-numbers actually in use today according to this report
The solution should not assume that only the above ranges are used - it must work correctly for any ip-address.
Only T-SQL is allowed. No SQLCLR for example.
When measuring the performance we run the script several times and then use the shortest elapsed time.
Parallell processing is allowed.
I think that something like this is close to an optimal solution for this problem:
drop table #t
select a.ipNumber, icm.country_code
into #t
from my.ipAddresses a, ipcountry_merged icm
where
from_ip <= ipNumber
and
to_ip >= ipNumber
create unique clustered index CL on #t(ipNumber)
update a
set a.countryCode = case when t.country_code is null then 'XX' else t.country_code end
from my.ipaddresses a
left join #t t on a.ipNumber = t.ipNumber
option (hash join)
ipcountry_merged is preprocessed by merging consecutive ranges that map to the same country code. The table has 43854 rows.
Do you have any better suggestions ?
/SG
April 16, 2010 at 7:05 pm
Stefan_G (4/16/2010)
The problem is that the code you tested represents a very different problem than the original problem. Just testing the mapping might of course still be interesting we just have to remember that performing the join is just part of the problem.
Sure. More than one interesting problem was highlighted by the original question in this thread. My particular interest has been the generalised problem of interval matching in SQL Server. The same issues arise whenever a table contains a range of dates for example.
As I see it, the problem we are trying to solve can be described like this:
I don't disagree with any of those definitions (SQLCLR is unlikely to be of benefit given the large number of data operations required). Parallel plans are interesting (since they allow bitmap filters) but tend to complicate results because the usefulness of an in-row filter depends so much on how much data can be excluded in the scan.
My lack of focus on the update part of the problem comes down to two issues: (1) update plans have special restrictions and requirements which tend to dominate the overall cost for some plan options; and (2) in some cases it might be quicker to do a minimally-logged SELECT...INTO followed by a drop and rename, instead of the update.
ipcountry_merged is preprocessed by merging consecutive ranges that map to the same country code. The table has 43854 rows.
I'll have a look at the specific test case later today, though my expectation is that a loop join driven from the mapping table should be fastest - it'll be interesting to see how that works out.
Paul
April 17, 2010 at 5:25 am
Paul White NZ (4/16/2010)
More than one interesting problem was highlighted by the original question in this thread. My particular interest has been the generalised problem of interval matching in SQL Server. The same issues arise whenever a table contains a range of dates for example.
I agree. The generalized problem is very interesting. This thread has shown that there are many ways to attack the problem and it is not so easy to know in advance which method is the best.
On particular lesson I have learned from this thread is that UPDATE really constrains the methods SQL server can use. A straight SELECT into a temporary table gives the optimizer a much larger freedom to select an optimal plan.
It is interesting that the original formulation from the OP gives very bad perf for an update, but is at least close to optimal when using a SELECT. (The reason for this is of course that the optimizer can flip the query around and use a loop join based on the IpCpuntryMap table instead of basing it on the IpAddresses).
Now, if you are interested, here is some code to generate 1 million random ip-addresses with a realistic distribution:
create procedure dbo.GetRandom(@min bigint, @max-2 bigint, @count int)
as
select top (@count) cast(cast(binary_checksum(newid()) as binary(4)) as bigint)%(@max-@min+1)+@min
from sys.columns c1, sys.columns c2, sys.columns c3
go
drop table my.ipAddresses
create table my.ipAddresses (ipNumber binary(4) not null, countryCode char(2) null)
-- create approx 1 million random addresses with realistic distribution according to ftp://ftp.isi.edu/isi-pubs/tr-598.pdf
insert into my.ipAddresses (ipNumber) exec GetRandom 0x3d000000, 0x45ffffff, 278000
insert into my.ipAddresses (ipNumber) exec GetRandom 0x80000000, 0xacffffff, 142000
insert into my.ipAddresses (ipNumber) exec GetRandom 0xc0000000, 0xddffffff, 450000
insert into my.ipAddresses (ipNumber) exec GetRandom 0x00000000, 0xffffffff, 131000
-- delete duplicates
delete from my.ipAddresses
where ipNumber in (
select ipNumber
from my.ipAddresses
group by ipNumber
having count(*)>1
)
-- create unique index
create unique clustered index CL on my.ipAddresses(ipNumber)
-- show the resulting distribution
select substring(ipNumber,1,1), count(*)
from my.ipAddresses
group by substring(ipNumber,1,1)
order by 1
/SG
April 17, 2010 at 4:44 pm
As a side bar, if you use Master.sys.All_Columns instead of the "local" copy of sys.columns, then you can get 16 million rows out of a single Cross-Join with a bit of a guarantee (the table has at least 4k rows even on a brand new full install)
--Jeff Moden
Change is inevitable... Change for the better is not.
April 18, 2010 at 3:12 am
Jeff Moden (4/17/2010)
As a side bar, if you use Master.sys.All_Columns instead of the "local" copy of sys.columns, then you can get 16 million rows out of a single Cross-Join with a bit of a guarantee (the table has at least 4k rows even on a brand new full install)
Well, not that it matters much, but the reason I use sys.columns is that it is slightly more efficient than sys.all_columns.
sys.columns only accesses a single table, while sys.all_columns is a join.
select max(n) from (
select top 1000000 checksum(newid()) as n
from sys.columns c1, sys.columns c2, sys.columns c3
) t
select max(n) from (
select top 1000000 checksum(newid()) as n
from sys.all_columns c1, sys.all_columns c2
) t
Table 'syscolpars'. Scan count 3, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 359 ms, elapsed time = 368 ms.
Table 'syscolrdb'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'syscolpars'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 374 ms, elapsed time = 369 ms.
As I said, it does not really matter much - it boils down to a matter of taste.
/SG
April 18, 2010 at 7:03 pm
Stefan_G (4/18/2010)
...the reason I use sys.columns is that it is slightly more efficient than sys.all_columns. sys.columns only accesses a single table, while sys.all_columns is a join.
With master as the context database, I couldn't get either method to consistently out-perform the other - though I understand the general point you are making
On the other hand, I can equally see Jeff's point: sys.all_columns in the master database is pretty much guaranteed to have sufficient rows - whereas sys.columns in the current context database might not.
I think it was Peso who had trimmed his master database enough to break Jeff's assertion - but as you say, it doesn't really matter. On occasion, I have been known to use master.sys.allocation_units, just for variety!
April 18, 2010 at 9:35 pm
Actually, for anything bigger than 11K rows, I just use a CROSS JOIN on my trusty ol' Tally table because I KNOW how many rows that has. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 31 through 37 (of 37 total)
You must be logged in to reply to this topic. Login to reply