February 15, 2007 at 4:42 pm
Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/dpoole/2871.asp
April 2, 2007 at 12:01 am
You missed one obvious way to store IP addresses, which is both the smallest data size and yet still easy to interpret as an IP address -- binary(4).
127.0.0.1 becomes 0x7F000001
I wrote the following functions, and hereby declare them into the public domain:
create function dbo.ConvertIPAddressToBinary (@IPAddress varchar(15))
returns binary(4)
as
begin
declare @Octet1 binary(1), @Octet2 binary(1), @Octet3 binary(1), @Octet4 binary(1);
select @Octet1 = cast(cast(substring(@IPAddress, 1, charindex('.', @IPAddress) - 1) as tinyint) as binary(1));
select @IPAddress = substring(@IPAddress, charindex('.', @IPAddress) + 1, len(@IPAddress));
select @Octet2 = cast(cast(substring(@IPAddress, 1, charindex('.', @IPAddress) - 1) as tinyint) as binary(1));
select @IPAddress = substring(@IPAddress, charindex('.', @IPAddress) + 1, len(@IPAddress));
select @Octet3 = cast(cast(substring(@IPAddress, 1, charindex('.', @IPAddress) - 1) as tinyint) as binary(1));
select @IPAddress = substring(@IPAddress, charindex('.', @IPAddress) + 1, len(@IPAddress));
select @Octet4 = cast(cast(@IPAddress as tinyint) as binary(1));
return @Octet1 + @Octet2 + @Octet3 + @Octet4;
end
go
create function dbo.ConvertBinaryToIPAddress (@IPAddress binary(4))
returns varchar(15)
as
begin
declare @BigIP bigint;
declare @Octet1 tinyint, @Octet2 tinyint, @Octet3 tinyint, @Octet4 tinyint;
select @BigIP = cast(@IPAddress as bigint);
select @Octet4 = @BigIP % 256;
select @BigIP = @BigIP / 256;
select @Octet3 = @BigIP % 256;
select @BigIP = @BigIP / 256;
select @Octet2 = @BigIP % 256;
select @BigIP = @BigIP / 256;
select @Octet1 = @BigIP % 256;
return cast(@Octet1 as varchar) + '.' + cast(@Octet2 as varchar) + '.' + cast(@Octet3 as varchar) + '.' + cast(@Octet4 as varchar);
end
go
April 2, 2007 at 12:34 am
Just a side comment. Depending on the purpose and expected lifetime of the application, you might have to deal with IPv6 addresses (http://en.wikipedia.org/wiki/Ipv6). In that case, you'd have to consider larger data types. If you wanted to stick to built-in types, uniqueidentifier could be an option (not very human readable, but atomic)-or you could go for binary(16) (or maybe varbinary(16) thinking also of ipv4) as implied by earlier post.
April 2, 2007 at 3:16 am
Surely your choice of index for the large range could have been better? Yes the highly specific reversed order may be better for a particular lookup, but for a ranged lookup over all 4 levels surely an index in order would be better? After all, in this example you will need to go through ALL of the top 3 levels before you acquire any level of usable index? Seems pretty inefficient to me.
April 2, 2007 at 7:20 am
Interesting. I'd be curious to see the query plan on the bigint implementation, and a link to the UDT implementation if anyone has one (though I agree the overhead in storage is substantial).
April 2, 2007 at 8:14 am
Good work, David.
I agree with the comments about IPv6 though. Easily expanded to 6. Our need is to validate IP's and to count the number of distinct rows over time for licensing. Some of these techniques could be quite helpful speeding performance.
ATBCharles Kincaid
April 2, 2007 at 9:12 am
I have seen this approach before and it *is* quite fast but when you try to make the returned values into readable format is when you get hit *hard* by performance.
* Noel
April 2, 2007 at 10:10 am
I'm not sure I see the need to have the data be human readable. I guess it depends on the specifics of your use case.
My assumption is that the index lookup would be best with int or binary(4) rather than having the index over 4 columns. Each comparison would be very direct this way. Not sure if this small perf gain is worth much though, depends on the circumstances. I have about 300MM rows of web logs (recent data only, I keep a rolling window of data) in the DB at the moment, so it might add up.. but I very rarely need to analyze by IP. When I do, it is always after already filtering based on a time window.
I would think that a good solution might be to just use int / binary(4) for indexing your IP and have a non-persisted/indexed computed column that has your human readable value. You would want simple mapping functions with most of these solutions, so just use your mapping function in the computed column for human readable. The major downside clearly is if you routinely want to render a large dataset of IPs rather than just filter or aggregate data on them.
I just use varchar(15) at the moment because it makes it dead simple to BCP in the IIS logs. I care very little about disk space because this is data I exclude from my backup regimen.
It's always good to know how you actually use your data for choosing the best strategy.
April 2, 2007 at 10:14 am
Joe,
How do you feel about the naming of the integer types?
When I started programming integer meant 16 bit, long integer meant 32 bit. Not everything has shifted up a gear (no pun intended).
I notice that the recommendations for .NET development say that the developer should used the explicit types such as int32 and int64.
Calling 64 bit integers BIGINT strikes me as the equivalent of having giant squid, collosal squid and whatever name they'll think of for the next size squid.
Does this mean that when we move up to 128 bit computing we are going to see HUGEINT, ENORMOUSINT etc?
April 2, 2007 at 11:58 am
I think there has been a dangerous tendency to make things human readable, and I would argue that storage and retrieval should be optimized for the computer that is doing the job. Generating a report for humans would be useless with a million rows, and I have found that that the processing cost in rendering reports for human consumption has not been a bottleneck, the key is in applying general logic and business rules with explicit test cases, rather than the more common approach of applying eyeballs to long lists to figure out what was missed. since the libraries to manipulate IP addresses in Binary are readily available, I would stay away from char based approaches. I would also avoid the conversion to tinyint, because this is a transformation of the bytes and therefore should require and extra processing step for logical manipulation. If text based logs are the source for such data, why not turn them off, and write a direct Binary file for orders of magnitude faster ETL?
and following Mr Poole, With Intel's new 80 core processors will we have OCTDECAint?
Cheers,
John R. Hanson
April 2, 2007 at 12:35 pm
If you have humans running a lot of adhoc queries, I suppose not normalizing out your state abbreviations is worthwhile. If you have a heavy query load running over a large dataset where state is a common filter, you're probably better off normalizing it. It's all about what you want to do with your data
April 2, 2007 at 1:58 pm
I think that it is an exelente article and I find the solution of the index good and original, although I don't discard the varbin(4) or int idea. You will always have to make a function to calculate the value of @Low and @High in your query and those variables will not have values very readable
April 2, 2007 at 4:50 pm
I remember reading all the Microsoft documents for the roll out of SQL 7. Somewhere in that mess was a document that said that SQL server's reason for being was to support ad hoc queries. Everything else was in support of this main goal.
Sorry, one last "shot" for the day. Couldn't resist.
ATBCharles Kincaid
April 6, 2007 at 11:27 am
You can store a UDT using much less space if you define the serialization methodology. The IP UDT I created serializing to bytes has the following stats when tested using the articles methodology.
rows 1000000 reserved 21832 KB data 21744 KB index_size 88 KB unused 0 KB
April 6, 2007 at 8:00 pm
How does this work when you want to filter on a given IP or subnet? Does the UDT have to be deserialized for every row to compare, or does sql serialize your filter value and compare the serialized values? I recall from my work with the xml datatype that it was far better to use existential checks (.exist()) when possible to avoid serialization, but this may not directly apply to CLR UDT's (although I recall there were many commonalities between xmldt and clr udt's).
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply