October 1, 2008 at 1:39 am
I come from a relative old school in design that views the datatype should really fit the data it pertains to.
To that end, for Character data, if the column is of a fixed size (say National Insurance Number) then it is right to define that column as Char(9).
If the data is variable in nature but small, like say postcode, then the benefits obtained (for space) in terms of using a Varchar are limited. So typically if the variable length data is say <15 characters, then there is possibly a good case for using Char(n).
The water gets muddied when you consider the larger items, say Address, Name, EMail_Address etc.
It is on this aspect that I would like some advice from the gurus out there.
Virtually all the SQL Server apps that are running in my company are vendor supplied products. In many cases the above concept of 'right-sizing' columns tends to be ignored in favour of character fields being defined as Varchar(255), Varchar(nnnn) or more recently Varchar(max), regardless of the underlying data.
So the question is:-
Should I challenge these practices, and if so, what sort of best practice considerations should be cited?
Thanks
October 1, 2008 at 3:08 am
Sounds like the vendor has a modelling tool with a default size of varchar(255). Having the right size can save on storage but also helps prevent 'incorrect' data being entered. For example if a short description field is limited to 100 characters then it will only ever have a short description entered into it; if it is a varchar(max) then people will start entering huge amounts of text that nobody will read.
October 1, 2008 at 3:36 am
If only it were but one vendor!!!
Had one example where the column length was being changed because the size of the field on the app screen was larger than the field on the database. Point being, it was going from a VARCHAR(255) to VARCHAR(4000) and all they really needed was the screen changed to limit it at worse to 255!!!
Sometimes you wonder at their logic!!
October 1, 2008 at 6:21 am
I'm beginning to see the same thing. As more and more shops start to use generated databases through products like nHibernate, the idea of laying out the data structure, of correctly sizing columns, is going away. Personally, yes, you should absolutely attempt to enforce standards. Unfortunately, you're only going to be able to do that for in-house developed products. Anything coming from vendor is going to be in whatever messed up mechanism they chose and you'll just have to cope.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
October 1, 2008 at 7:43 am
Here's how I do it. If I know the column is a fixed size I used char otherwise I use varchar. I've heard the argument that smaller columns should use char, but I still prefer varchar in those situations.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
October 6, 2008 at 10:33 am
If it wasn't appropriate to use a well matched datatype why the errrrmmmm heck would Db vendors put all that time and effort into providing them in the first place?
Jamming random (everything is a varchar(255)) or plain incorrect (char datatypes to store dates) into databases rather than actually spend 10 seconds considering the appropriate way to define and store the data - or 10 minutes actually doing some basic investigation and analysis - is basically nothing more than a lazy way for the hard of thinking to out themselves. Apart from anything else - using VARCHAR(MAX), apart for all the other obvious reasons for doing so, for every field just leaves additional potential vulnerabilities for SQL injection attacks. And don't let anyone sufficiently lacking in attention to detail to use a binary construct to store short text data tell you they'll sort it out at the front end .....
October 7, 2008 at 9:12 am
Exact space in varchar ?
OK, let's say you saw the obvious patterrn for telephone numbers in North America:
NPA.CON.NNNN 10 digits, right ?
NPA numbering plan agreement (ASA area code)
CON Central Office number (your phone's father or home)
NNNNN the rest of the phone number.
Well now there is talk of adding a digit because of the exploding demand for new phone lines (the three kids each wanting their own private line at home, faxes, etc.).
So if you had adopted the 10-digit space-saving configuration, now you have to go back and change it.
Same thing for ZIP code 5-digit right ? until they added an extra 4...
Remember when someone thought 640 kilobytes would be more than sufficient for any foreseeable need ? (I wished for an Operating System that would allow me to use my gigantic 2 MegaBytes RAM-- until I found out the new O.S. hogged all the memory for itself :pinch:).
Also, when I started out, I was miffed that Fortran did not have the boolean type so I could use only one memory location to store a bit. Until a teacher made me find out how much overhead it was to pack 8 such bits into a byte. The extra code required to handle this took considerably more space than was saved by packing the boolean in the same bytes.
So do allow yourself some margin for the space you want to use. Of course, avoid using char type if you do not have a fixed forever length, as trailing spaces are added and you have to deal with those.
October 7, 2008 at 4:50 pm
J (10/7/2008)
So if you had adopted the 10-digit space-saving configuration, now you have to go back and change it.
If you adopted a 12-digit char column instead, in a million row table you are now wasting 2MB of space, plus the overhead of reading that space while querying. If you store it in a varchar(12) column, you have the overhead that SQL Server introduces for all varchar columns. And if you have a varchar(12) to store a max ten digit number, Programmer A will store it as 9165551212, Programmer B will store it as 916-555-1212, and Programmer C will store it as 916.555.1212. Yes, formatting needs to be done on the client, but not everyone gets the word.
Same thing for ZIP code 5-digit right ? until they added an extra 4...
That's not a real good example, for a few reasons. The Plus 4 doesn't need to be human readable, and is only used when snail-mailing. Also, since the Plus 4 can change over time, many systems only lookup the plus 4 when preparing a mailing, like statements, and it's not stored locally. The Plus 4 was a separate, often optional, field and could be stored elsewhere in the table.
Also, when I started out, I was miffed that Fortran did not have the boolean type so I could use only one memory location to store a bit. Until a teacher made me find out how much overhead it was to pack 8 such bits into a byte. The extra code required to handle this took considerably more space than was saved by packing the boolean in the same bytes.
That depends, right? If you have four bit fields, they are stored in one byte (in SS2000). Or you could store four smallint fields for eight bytes. Again in your million row table, that is an extra 7MB. You would have to weigh the time cost of AND'ing and OR'ing against the cost of the extra space used. Yeah, 7MB is not a lot, but what is? 70MB? 700MB? It's only too much wasted space if you don't have it to spare, and typically you don't know you can't spare it until it's too late (the disk is full, or you can't get a bigger drive for your server).
So do allow yourself some margin for the space you want to use.
But where do you draw the line? In the early 1980s I worked at a software house that sold a package for medical billing and insurance claims. We stored patient gender as a one character field M/F. Until (I kid you not) a San Francisco urologist requested a third value for people who were undergoing sex-change operations. (not that there's anything wrong with that :unsure: ) It didn't change the size of the field, but it did mess up our business rules engine for validating claims. How do you plan for that?
October 7, 2008 at 6:20 pm
Dave (10/7/2008)
J (10/7/2008)
But where do you draw the line? In the early 1980s I worked at a software house that sold a package for medical billing and insurance claims. We stored patient gender as a one character field M/F. Until (I kid you not) a San Francisco urologist requested a third value for people who were undergoing sex-change operations. (not that there's anything wrong with that :unsure: ) It didn't change the size of the field, but it did mess up our business rules engine for validating claims. How do you plan for that?
If you use the ISO standard there is 0-Unknown, 1-Male, 2-Female, and 9-Other. Plus they left 6 for future use.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
October 8, 2008 at 2:46 am
J: Well now there is talk of adding a digit
In the UK we don't have a standard phone number length nor pattern (eg 023 1234 1234, 0117 123 1234, and 012345 1234 - it's all on Wikipedia if you like that sort of thing), so varchar is the place to be.
Additionally, twice in the last 15 years we have added digits and/or changed area dialling codes. That was, er, interesting for the database that used the phone number as the Primary Key for records of businesses - and yes, we did change all those Primary Keys and yes, we did manage to maintain referential integrity. It was not a modern SQL database, but a mainframe-derived DBMS.
Still, someone had thought phone number was a good PK. I can see it looking like a candidate key during initial analysis (natural key, superficially no two the same), but how it made it to primary key (subject to change, not necessarily unique...) I have no idea.
October 8, 2008 at 4:07 am
I think you are all getting off the original point.
My query was with regard to the appropriate-ness of using VARCHAR(MAX) when the vendor might have had a bit more thought to use a relevant sized field for the data being stored. Say VARCHAR(100) for EMAIL_ADDRESS, as opposed to VARCHAR(MAX).
My view is that if at the time of developing an application the size of a field e.g. National Insurance Number is 9 characters, then as this is fixed at 9 characters then using CHAR(9) is more prudent than say VARCHAR(MAX). If legislation or other requirements make National Insurance Numbers a variable value up to say 12 characters long, then at that time the system would need suitable amendments to cater for this basic requirement change. Probably in the form of CHAR(9) changing to VARCHAR(12).
If you build a system to cater for any future possibility, then it seems to me, every column would be defined as BIGINT, VARCHAR(MAX), DATETIME etc. and lets ignore the fundamental aspects of design? Particularly when you bring in R.I., indexing and other factors to make it run to an acceptable performance level.
Consider the future (i.e. If you hold a field that is a integer, what might need to happen to involve the size needing to consider BIGINT rather than INTEGER.), but don't be a slave to it!
BTW, thanks to all for the replies. I do appreciate the feedback and varied views from the floor.
October 8, 2008 at 9:36 am
Dave,
Thanks for the comments. I knew about the snail mail purpose for the extra 4 digits on a zip code but that they are subject to change was new info for me. But you still need them for mass mailings (my snail mailbox gets full so quickly).
Now, I suggest that if a database deals with millions of records, if I had to worry about 2 or 7 megabytes, I would seriously look at upgrading the capacity. I have rarely seen a system that performed well when its drive were filled 99 %. In fact never.
One of my former employer did convert its employee database to another system. They allocated just enough space for the address, resulting in the apartment number being truncated of its two rightmost digits. And the building supervisor not being very bright, of course he could not take the initiative to look up my name and place my tax receipts in my mailbox. Bloody inconvenience getting duplicates.
Ewan,
Interesting comment about the variable length of telephone numbers.
Regards all.
October 8, 2008 at 11:51 am
Jack, we know that now, but it was 1981, and we couldn't even spell ISO then.
October 8, 2008 at 12:08 pm
J (10/8/2008)
Now, I suggest that if a database deals with millions of records, if I had to worry about 2 or 7 megabytes, I would seriously look at upgrading the capacity. I have rarely seen a system that performed well when its drive were filled 99 %. In fact never.
Right, and we're in that position on one of our servers today. We have a big ole sloppy data warehouse DB that loads in five separate jobs, and one poor guy who babysits it every weekend because we get to 93% capacity or higher. Our corporate parent insists on us buying into their SAN, with a 3-year cost of about US$125K for 900GB. If we swap out a few of our drives and max the system, it will get us to 800GB and cost around US$15K, but the data center guys are dragging their feet. A better designed, tighter database would have saved us a lot of stomache acid and $$.
One of my former employer did convert its employee database to another system. They allocated just enough space for the address, resulting in the apartment number being truncated of its two rightmost digits.
That there's no excuse for. The format of the legacy data should have been in the spec. Plus, if you're in a country where they have addresses like "The Mews, 47 Hampshire Heath Turnaround, Puddleby-by-the-marsh" common sense should have told them how to work that.
Best, Dave
October 9, 2008 at 7:40 am
Interesting discussion. Also interesting that no one mentioned the potential inconvenience and/or performance hit of having to use RTRIM on a char field to avoid getting extra spaces in output that's copied and pasted into Excel from a results window. The larger the table the greater the performance hit associated with RTRIM. Also consider that having to check the rightmost characters for certain values in a char field where the data is NOT a fixed length means RTRIM will show up often in WHERE clauses, and that's worse than it being in a SELECT clause, if I understand things I've seen on this forum correctly.
Steve
(aka smunson)
:):):)
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply