April 25, 2005 at 11:21 am
My company has a scenario where we would like to change the data type
of an existing primary key from an integer to a char, but we are
concerned about the performance implications of doing so. The script
for the two tables that we need to modify is listed below. Table
FR_Sessions contains a column named TransmissionID which is currently
an integer. This table contains about 1 million rows of data. Table
FR_VTracking table also contains the TransmissionID as part of it's
primary key and it contains about 35 millions rows of data. These two
tables are frequently joined on TransmissionID (FR_Sessions is the
parent). The TransmissionID column is used primarily for joins and is
not typically displayed.
We need to change the TransmissionID data type from int to
char(7), and I had a few questions:
1) Would this introduce significant performance degradation? I have
read that char keys/indexes are slower than int/numeric.
2) Are there collation options (or any other optimizations) that we
could use to minimize the performance hit of the char(7)...if so which
ones?
I am a software architect by trade, not a database guru, so please go
easy on my if I overlooked something obvious
Any suggestions or information would be greatly appreciated.
Thanks,
Tim
-------------------
CREATE TABLE [FR_Sessions] (
[TransmissionID] [int] IDENTITY (1, 1) NOT NULL ,
[PTUID] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[PortNum] [numeric](6, 0) NOT NULL CONSTRAINT [DF_FR_Sessions_PortNum]
DEFAULT (0),
[CloseStatus] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
,
[RecvBytes] [int] NULL ,
[SendBytes] [int] NULL ,
[EndDT] [datetime] NULL CONSTRAINT [DF_FR_Sessions_EndDT] DEFAULT
(getutcdate()),
[LocalEndDT] [datetime] NULL ,
[TotalTime] [int] NULL ,
[OffenderID] [numeric](9, 0) NULL ,
[UploadStatus] [char] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL CONSTRAINT [DF_FR_Sessions_UploadStatus] DEFAULT ('N'),
[SchedBatchID] [numeric](18, 0) NULL ,
[SWVersion] [varchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[DLST] [bit] NULL ,
[TZO] [smallint] NULL ,
[Processed] [bit] NOT NULL CONSTRAINT [DF_FR_Sessions_Processed]
DEFAULT (0),
[CallerID] [varchar] (13) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[PeerIP] [varchar] (16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[XtraInfo] [varchar] (1024) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
,
[IdType] [char] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
CONSTRAINT [PK_FR_Sessions] PRIMARY KEY CLUSTERED
(
[TransmissionID]
) WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [FR_VTracking] (
[TransmissionID] [int] NOT NULL ,
[FrameNum] [int] NOT NULL ,
[LatDegrees] [float] NOT NULL ,
[LonDegrees] [float] NOT NULL ,
[Altitude] [float] NOT NULL ,
[Velocity] [float] NOT NULL ,
[NumPositions] [smallint] NOT NULL ,
[NavMode] [smallint] NOT NULL ,
[Units] [smallint] NOT NULL ,
[GPSTrackingID] [numeric](18, 0) IDENTITY (1, 1) NOT NULL ,
[dtStamp] [datetime] NULL ,
CONSTRAINT [PK_FR_VTracking] PRIMARY KEY CLUSTERED
(
[TransmissionID],
[FrameNum]
) WITH FILLFACTOR = 90 ON [PRIMARY]
) ON [PRIMARY]
April 25, 2005 at 12:36 pm
The issue of char vs. int is simple:
1. storage size
2. the low level comparison that goes on
#1 is straight forward: int is 4 bytes and char = whatever. The effect here is in # of index pages required to reference everything and this depends on what other columns you're including in the index... then do the math
#2 is a little more abstract. I suppose that if you're using binary sort order (case sensitive/ a=a only as opposed to a=A aka insensitive) they're probably almost the same speed int ~= char(4). There's a tiny hit to how the two are compared (extra processor work).
Factor #1 is all I ever worry about because your greatest hit in performance is loading/swapping pages around. Second to that is jumping around on the pages, and then finally comparing data.
Why not do a quick test and see what works best?
Most often for me, I let the business requirements drive things. If the keys aren't going to get too huge and the key data is something the user is aware of, and it is meaningful to them (or by rule assigned by them), I let it be char. If it is a system assigned thing, or a behind the scenes thing (surrogate keys maintained by the app), I make it INT.
No doubt there's an indepth article somewwere on this site...
April 25, 2005 at 1:19 pm
I personally don't think that you are going to persive a big hit because of that change. I would worry if you were to use those columns for range , ordered or arithmetic searches (you may need to zeropad things and fix some "between" queries etc..) but if you are not using those in that way i wouldn't be concerned.
The size difference in this case, like expressed above, is negligible as long as you keep the indexes
Just my $0.02
* Noel
April 26, 2005 at 7:58 am
If the queries you mention are run so often that this is an issue then why not keep the original identity key and add the new varchar() column and add an index for it (if needed)? That way the original queries you use are not changed and if you have any queries that require the new column then its indexed so you are ok with that as well. The only problems you run into then are the sixe of the new index (which could be a factor), the time used when inserting into the table (as it has to update the new index as well) and that you have to enforce uniqueness for the new column perhaps.
It kinda goes against 'good' database design as you now have two keys (sort of) for the same table but sometimes you have to go that route when modifying an existing system and do not have the time/luxury of making a possibly global change.
Cheers
TNS
SQL noob
April 26, 2005 at 8:42 am
Okay, completely ignoring the logical aspects here.
john has already hit the nail right on the head here. Even if you use a CHAR(4) I suspect sorting and/or joining operations to be faster on an INT column. Even true when using a binary collation.
So, from a pure performance point-of-view I would stick with the INT PK. However, not everything in a RDBMS is all about performance and so you might identify valid reasons to modify your structure. But this depends on your requirements and your specifical environment. It's hard to judge on this from afar.
--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]
April 27, 2005 at 11:46 am
Another persistent myth of the database world. Make no mistake about it, it IS a myth that character data types necessarily offer inferrior performance to numeric data. Your search arguments are converted to binary values and then compared with the binary values stored in the database, so assuming the same length of data (int or char(4)) there will be little or no difference in most cases.
Long ago I put this to the test. It's really quite easy to do. Create a table with three columns; an integer, a char(4), and a varchar(4)
Now populate the table by declaring a loop an incrementing an integer value. Populate the integer with the counter value and populate the other columns by converting that value to a character string. (I think I put about 2 million rows in mine) This makes sure that all the rows have the same selectivity (unique) and that without indexes will force a full table scan. Next, turn on STATISTICS IO and STATISTICS TIME, you can also use SHOWPLAN_ALL. Do some simple queries against each column looking for a single value. You will notice that the queries against the character data do take additional time to compile. I think this is due to SQL needing to sort out character sets and sort order, but once the query plan is compiled the actual execution time is identical (or nearly so) to the integer data type.
Now to make it a more true-to-life test add a nonclustered index to each column and run the queries again. Now you will see that there is no meaningful, or consistent, difference in either the compile time or the execution time between the character and integer data types.
Now the difference between a varchar(255) and an integer column will probably be significant because the number of reads goes up...
/*****************
If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek
*****************/
May 2, 2005 at 1:40 am
And my 2 cents:
Don't forget: a char(n) is represented internally as an array of n bytes.
So to make a comparison SQLServer has to do a loop and compare every bytes one after the other until the corresponding bytes are the same.
Therefore even the algorithm is highly optimised it is n times slower then a single integer comparison.
If you change your datatype from int to char(7) is is even worse bacasue you are nearly doubling the size of the array (and also the database size, therefore you can store less records within a page, so you need more disk access to read your data)
Bye
Gabor
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply