Storing IPs in SQL Server

What is the best way to store an IP Address in SQL Server

Imagine that you are asked to design a database for a data analysis team to perform web site traffic analysis.

This brief will have many requirements and amongst them will be the need to store IP addresses.

Search engines such as Google have a fixed range of IP addresses so the team can easily separate out

traffic from true visitors Vs traffic from bots. Although the team have a good rudimentary knowledge of T-SQL the data must

be relatively simple to query. So with regard to IP addresses the requirement is as follows:

Store IP addresses efficiently
Allow retrieval of IP addresses in a machine readable format
Allow simple querying on a range or ranges of IP addresses

The data type question

If traffic to your web site is high then the choice of data types is going to be important.

You could keep the IP address as a VARCHAR(15) but given the nature of what an IP address actually is, 4 integers in the range 0 to 255, this seems a trifle wasteful.

My initial thought was to use techniques described in Lee Everest's article Introduction to bitmasking in SQL2005

but if you read the forum discussion of the article you will see an interesting suggestion by Joe Celko.

So what possiblities do we have? Let us consider the ip address 192.168.0.5

Method	Storage	Comment
VARCHAR(15)	Between 7 and 15 bytes	Stores the IP address in human readable but this is wasteful.
BIGINT	8 bytes	We can represent our IP address as 192168000005. It is stretching the definition of human readability somewhat but this depends on your audience.
INT	4 bytes	Our IP address is no longer human readable being represented as 1084751877.
Four separate TINYINT fields	4 bytes	Our address is now both efficient and human readable just as Joe Celko pointed out.

SQL 2005 CLR User Defined Types

SQL2005 provides us with one other possiblity. The .NET assembly user defined type.

I was fortunate to go on the Microsoft "Updating Your Database Development Skills to Microsoft SQL Server 2005" (Course 2734B) which included

an IP address UDT. As I am not sure of the copyright issues surrounding the code for the UDT I am not including the source code here but the UDT

provided the following functionality.

Accept an ip address in the form nnn.nnn.nnn.nnn
Return the individual bytes of an IP address
Return the string representation of the IP address
Return a varbinary representation of the IP address
Return a string with the PING command and the IP address. I removed this from the code as it was irrelevant.

There are similar functions available on the web.

Test methodology

I decided to create five tables and into each put 1 million rows. Each table would contain a single INT IDENTITY field (IPID)acting as the primary key, plus the chosen method of storing an IP address.

Table	Comment
IPAddressSource	IP Address stored as 4 separate TINY INT fields
IPAddressINT	IP Address stored as a single INT field
IPAddressBIGINT	IP Address stored as a single BIGINT field
IPAddressVARCHAR	IP Address stored as a single VARCHAR(15) field
IPAddressUDT	IP Address stored in a .NET CLR data type

After population I ran an sp_spaceused on each table using the stored procedures described in

Automated Monitoring Database Size Using sp_spaceused.

For those who caught the article first time around I have added the equivalent procedure for SQL2000 in the discussion forum

attached to the article.

Initial results

TableName	Rows	Reserved	Data	IndexSize	Unused
dbo.IPAddressBIGINT	1,000,000	20,872	20,784	88	0
dbo.IPAddressINT	1,000,000	16,904	16,808	72	16
dbo.IPAddressSource	1,000,000	16,904	16,808	72	24
dbo.IPAddressUDT	1,000,000	49,928	49696	192	40
dbo.IPAddressVARCHAR	1,000,000	30,088	29,960	120	8

Straight away we can see that the table using the CLR UDT is by far the worst in terms of storage size taking nearly 50MB

to store our million rows.

Running an sp_help reveals why.

Column_name	Type	Computed	Length	...etc
IPID	int	no	4
IPAddress	IPAddress	no	37

There is very little information available on the storage of data in CLR UDTs but what appears to happen is that

the instance of the datatype (including structure and data) is serialised and stored in a VARBINARY type format in the field.

This sort of begs the question as to what real-world use are CLR UDTs?

The motivation to try a CLR IPAddress type was that a single type could encapsulate the IP address. The type is atomic.

Storing the data as 4 TINYINT fields means that any one of the four parts of the IP address is meaningless in its own

right. It is not atomic.

In this case we can see that there is a heavy overhead in gaining atomicity.

The indexing question

We have established that for readability and efficient storage of data storing an IP address as 4 separate TINYINT fields is the most efficient. But what about searching on IP Addresses?

What I need to be able to do is to identify ranges of IP addresses that correspond to Web Bots. Given the sheer volume of records

I am going to need to discover the most efficient method of doing this.

The SQL Query engine decides for itself whether or not there is a benefit in using an index. If the index is not regarded as being

selective enough the engine will simply ignore the index.

To give an example I had a table containing details of cars where the primary key was make, model and year.

WHERE make='FORD' resulted in a TABLE SCAN

WHERE make='MASERATI' resulted in an INDEX SEEK

The only difference in the WHERE clause was the value being searched for.

Again if you look at the discussion for Lee Everest's article

to make the index as selective as possible Joe Celko suggests that an index be placed across the four TINYINT columns representing the IP Address but from the right-most

first, in effect recording the IP address backwards. As the digits in an IP address vary the most in the right hand portion and least in the left hand portion this

should make the index as selective as it can possibly be.

My first port of call was to create such an index on my dbo.IPAddressSource table.

The next step was to devise a query that would search for a range of IP addresses such as 72.100.5.25 to 74.50.25.1


/*
    Define parameters to test various ranges of IP addresses
*/
DECLARE
@LowIP1 TINYINT , @HighIP1 TINYINT ,
@LowIP2 TINYINT , @HighIP2 TINYINT ,
@LowIP3 TINYINT , @HighIP3 TINYINT ,
@LowIP4 TINYINT, @HighIP4 TINYINT 
SELECT 
@LowIP1 = 72, @HighIP1 = 74 ,
@LowIP2 = 100, @HighIP2 = 50 ,
@LowIP3 = 5, @HighIP3 = 25 ,
@LowIP4 = 25, @HighIP4 = 1 

SELECT * FROM dbo.IPAddressSource
WHERE
(
IP1= @LowIP1 
AND (
IP2>@LowIP2
OR (IP2=@LowIP2 AND ((IP3 = @LowIP3 AND IP4>=@LowIP4) OR IP3>@LowIP3))
)
)
OR (IP1>@LowIP1 AND IP1OR
(
IP1 = @HighIP1
AND (
IP2 OR (IP2=@HighIP2  AND ((IP3=@HighIP3 AND IP4OR IP3
As we can see the WHERE clause is a bit convoluted and this is just searching for one range of ip addresses.  We should also look at the execution plan.
The subtree cost is 4.56566

An alternative approach
We have established that storing an IP address as 4 TINYINT fields achieves two objectives

Storing the IP address in an efficient format
Storing the IP address in a format that is easily read by a human being

The problem areas are as follows

Searching for a range of addresses requires convoluted WHERE conditions
The query is expensive

As an alternative experiment I decided to create a new table called dbo.IPAddressCALC as follows

IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME='IPAddressCalc')
CREATE TABLE dbo.IPAddressCalc(
IPID int NOT NULL,
IP1 tinyint NOT NULL,
IP2 tinyint NOT NULL,
IP3 tinyint NOT NULL,
IP4 tinyint NOT NULL,
IPAddress AS 
CAST(
((CAST(IP1 AS BIGINT)* 16777216)
+( CAST(IP2 AS BIGINT)* 65536)
+( CAST(IP3 AS BIGINT)*256)
+IP4)
-2147483648 AS INT),
 CONSTRAINT PK_IPAddressCalc PRIMARY KEY CLUSTERED (IPID ASC)
) 
GO

Our IPAddress field is a calculated field using the bitmasking technique and contains an INT value.

To aid our search I placed an index on this column. To search this column now our query would become

SELECT * FROM dbo.IPAddressCalc
WHERE IPAddress BETWEEN @Low AND @High

We would probably create function to return the values for @Low and @High from supplied IP addresses.

Our WHERE clause is much simpler and it is now easy to write a query that can handle multiple ranges of ip addresses.

Our execution plan is also slightly different.

Simple WHERE clause execution plan

The subtree cost is 3.73884

Wrapping it all up

The final step is to example sp_spaceused results for our two tables. We are interested

in the amount of data and index space taken up by each item.

TableName	Rows	Reserved	Data	IndexSize	Unused
dbo.IPAddressCalc	1,000,000	30,864	16,808	13,952	104
dbo.IPAddressSource	1,000,000	27,856	16,808	10,960	88

Adding a calculated column has no affect on the data held within the table, however, over 1 million rows the index takes

up just under 3MB more space.

The final note to add is that our analysis tables would be populated using some form of bulk loading process of the previous

day's results. Placing an index on a calculated field of a highly dynamic table could potentially cause performance problems.

Addendum

To develop this solution I generated 1 million random IP addresses.

For those of us who have cursed the SQL Server RAND function my population query was as follows

IF NOT EXISTS(SELECT 1 FROM dbo.IPAddressSource)
BEGIN
SET ROWCOUNT 1000000
INSERT INTO dbo.IPAddressSource(IP1, IP2, IP3, IP4)
SELECT 
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) ),
FLOOR(256*RAND(cast(cast(left(newid(),8) as varbinary  ) as int )) )
FROM master.dbo.sysobjects AS O1 ,
master.dbo.sysobjects AS O2 
SET ROWCOUNT 0
END

The method works because NEWID() returns a number that is supposed to be unique in time and space and consists fo a consistant format

containing hexadecimal strings.

What I am doing is taking the first block of 8 characters, converting the resulting string to a VARBINARY value and this resulting value

is then converted into an integer value.

I don't actually care what the integer value is as I am simply feeding it into the RAND function as a seed.

What is the best way to store an IP Address in SQL Server

The data type question

SQL 2005 CLR User Defined Types

Test methodology

Initial results

The indexing question

An alternative approach

Wrapping it all up

Addendum

Rate

Share

Categories

Share

Rate

Storing IPs in SQL Server

What is the best way to store an IP Address in SQL Server

The data type question

SQL 2005 CLR User Defined Types

Test methodology

Initial results

The indexing question

An alternative approach

Wrapping it all up

Addendum

Rate

Share

Categories

Share

Rate

Related content

Discuss content posted by geramkumar

Tame Those Strings - Using SUBSTRING

Tame Those Strings! Part 2 - Using CHARINDEX

Tame Those Strings! Part 5 - Using STUFF

SP_MSForeachtable - Life Without Cursors.