August 7, 2015 at 1:56 pm
Guys - I've search on the forums and found some posts regarding the impact on performance of changing Text column size.
However, many of them were regarding older releases of SQL Server.
Does the architecture in 2012, 2014 releases make this less of an issue ?
In other words, why bother going from Varchar(1000) to Varchar(50) ?
I'm just thinking maybe there is column compression automatically now.
August 9, 2015 at 12:31 pm
That's a great question and is the subject of many a battle between front-end developers that are supposedly avoiding the evils of "pre-optimization" and DBAs. As with all else in SQL Server, the answer is "It Depends" but I usually recommend taking the time to right-size variable length columns.
To start with and ignoring a conversation of BLOB datatypes (that's a whole 'nuther subject), just as long as all the data for a row fits within the limits of a row on a page (accepted maximum is 8,060 bytes), there's THEORETICALLY no penalty for making every VARCHAR a VARCHAR(8000) (or whatever size may strike you). I say THEORETICALLY because I've not actually done any performance testing for such a thing because it never struck me to try it. I've always "right sized" such columns. You can just bet that I WILL be doing some testing in the next week or so. 😉
There are some hidden drawbacks to making all VARCHARs so big (even just 1000 bytes), though.
First, indexes can only be a max of 900 characters per row. Now, SQL Server will give you a warning about that if you make an index on VARCHAR columns or combination of columns that exceed 900 bytes but it will still allow you to create the index. Further, the index will work just fine provided that the data itself never exceeds the 900 byte limit. But, what happens when it does? BOOM! The index goes bad and you're stuck with an emergency that has a "deadas panic" qualification.
To go along with that, let's talk about implicit data checking. If a column has other dependencies where it shouldn't be allowed to be any longer than 9 characters (a Zip Code column, for example), then having a limit on the column length would be the final line of defense in the area of error checking. Could you do such error checking from the front-end? Absolutely and you probably should. But the front-end isn't the only thing that might make entries into the table. Having a length limit (and a couple of other checks built in as a constraint) on the column can prevent many more mistakes that someone either forgot to include in the front-end code (and, YES, that happens all the bloody time) or came from some other source including manual corrections. Having a length limit also alerts you when you have a bad data file where, for example, someone right padded delimited columns with spaces to be fixed field so that the delimiters line up so that humans can read it. That can also happen if someone sent you data from a CHAR datatype. The trailing spaces don't matter for joins but they're still stored.
What about something like a Name or Address column? Why limit those in the design of the column? The answer is because you just don't know how the data will be used downstream. For example, we generate letters for a lot of our customers. The letters are typically selected by CustomerID and printed. What if someone made a mistake either in the front-end or the back-end and included a small paragraph for the Name or Address column instead of what was supposed to be there? It does happen especially if humans are entering data. The letter would get printed with that junk. Even if we caught it before being mailed, it's still a problem that would have had a much better chance of being detected if the columns were "right sized".
Another possible problem is possible "out of row" problems. Again, it's normally cause by bad data but, lets say you have a VARCHAR column that has no index on it and someone enters a wad of data from a file and they've made a column offset mistake and loaded long data to the wrong column. Let's also say that the expected data being loaded just fits into a row without going over the 8060 limit. First, having the max expected column length would have caught that problem but now you also suddenly have two problems because the data for that row has suddenly gone "out of row" and you might have to rebuild the clustered index to recover after you fix the data.
Is changing the datatype (size, in this case) a pain in the butt? You bet it is and you should definitely take the time to figure out a size large enough for the task (Address column, for example) to avoid the eventuality of having to make it larger to accommodate good data but it should also be small enough to at least give you a chance of detecting a problem based on size of the data being inserted regardless of source.
Shifting gears a bit and taking things to the extreme, one of the problems that I've seen because of the way that some front-end designer code will build a table is that all character based "fields" are designed as NVARCHAR(256). Yeah... that's just <sarcasm>"perfect"</sarcasm> for something that's supposed to be all numeric digits like a Zip Code or a Telephone Number. Not only does than NOT have the benefit of data checking by length, it doubles the required storage because of Unicode behind the scenes which doubles the requirements of IO, memory usage, backup time/space, restore time, etc, etc. It's especially <sarcasm>"perfect"</sarcasm> for things like "Y/N" columns because, instead of it taking just one byte, it now takes 4... 2 for the content and 2 for the length indication stored in the table because of its variable length.
So, to summarize... does changing a VARCHAR(1000) to a more appropriate VARCHAR(50) based on the expected content constitute a PITA? Yes it does and always will especially if there's already data in the table. Does it make a difference? Probably not in performance IF everything goes right (although I'm still going to test that). Does it constitute an evil "pre-optimization"? "It Depends" but generally not. It normally constitutes "good programming practices" for all the reasons previously stated. Personally, it's also an indication of skill and attitude on the part of the developer for me. When I see such things in tables and I can identify the creator of the table, it's usually an indication that there will be some major performance issues with that person's code and other nuances in table design, as well, and, unfortunately, that indication usually turns out to be the reality.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 10, 2015 at 2:44 am
Jeff Moden (8/9/2015)
But, what happens when it does? BOOM! The index goes bad and you're stuck with an emergency that has a "deadas panic" qualification.
The insert of a row with a key > 900 bytes will fail, the index won't be affected.
I'll design tables with appropriate varchar/nvarchar/numeric sizes. When I sit with my devs, I insist they do the same. If I come across a production server with column sizes overly large, I'll probably just leave them as the time required to do the change vs gain from doing it is just too small in many cases.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 10, 2015 at 3:42 am
Creating big varying columns can also influence the amount of memory that the server will use for query. Each time the server runs a query, it needs to estimate how much memory should be used. The amount of memory is determined by the operations that will be done by the query (for example sort and hash join need lots of memory), number of records that each stage in the query will work with and the size of the records. Since there is no statistics about actual size that varying columns are using, SQL Server makes some assumptions that are not always real. The code bellow creates 2 tables that hold the exact same data, but the first table is using a varchar(10) column and the second table is using a varchar(1000) column. If you run the code and insert 5000 records, you can check the query plan for each select statement, and in the query plan's property you'll be able to see that each query is "asking" for different amount of memory.
use tempdb
go
create table SmallVarchar (vc varchar(10) default ('aaaaaa'))
go
create table BigVarchar (vc varchar(1000) default ('aaaaaa'))
go
--Inserting 5000 record both tables. Notice that they have the exact same data
insert into SmallVarchar default values
insert into BigVarchar default values
go 5000
--Check query plan's properties and see differnece in MemoryGrantInfo
select * from SmallVarchar order by vc
select * from BigVarchar order by vc
--cleanup
drop table SmallVarchar
drop table BigVarchar
Adi
--------------------------------------------------------------
To know how to ask questions and increase the chances of getting asnwers:
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
August 10, 2015 at 6:33 am
Thanks guys (and especially Jeff) for the great responses.
There's a data warehouse manager here who practically made everything varchar(1000).
There are no indices in any of the tables yet, so when we do, those warning messages should start to appear.
I'm going to write a stored procedure that will find the max datalength of each of the varchar columns and trim them to 120% of the max length.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply