Wrote this email recently to a crew of developers who were shooting themselves in the foot with a database rich in varchar(max) data types.
Hey folks-
TL;DR: Stop using varchar(max). We’re not storing
books.
We need to review and avoid the varchar(max) data type in our tables. Here’s a short treatise as to
why.
In SQL Server, varchar(max)
is intended to replace the old text data
type, which was different from varchar(n)
because it was designed to store massive amounts of data. Massive being greater than 8000 characters,
and all the way up to 2gb worth of data. That’s what the varchar(max), varbinary(max), and nvarchar(max) data types are optimized for – HUGE blocks of data
in a single cell. We should only use it if we're actually intending to store massive text and use the fulltext indexing engine (a completely separate and specific topic for text blocks).
This is an oversimplification, but varchar(max) is designed to store data differently, and specially for
large text blocks. It appears to behave the same as a varchar(n) field, and that’s deceptive when we are throwing 100-200
characters in each row field.
The big drawbacks biting us right now about varchar(max) have to do with indexing,
and this is regardless of how much data
is actually in a varchar(max) field. A varchar(max)
column can’t be the key of a nonclustered index, even if it never stores
more than 8000 characters, and can’t have ONLINE index maintenance performed. As a result, it is generally a giant pain for indexing, a pain you only want to put up with if you absolutely have to.
Furthermore, we’re doing ourselves a disservice for
performance, straight up. Unless you’re storing books, (max) hurts performance.
Check out this blog post: http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/
In short, varchar(max)
is burdensome overkill for our datasets.
So here’s the solution… Change varchar(max)
to varchar(n), where n is an generous
but appropriate number for that column’s data. If Excel creates varchar(max) columns for us when performing a data import wizard, change them
to varchar(8000), which is the highest
number you can assign to a varchar
field. Or better yet, once the data is
in SQL, use this simple syntax to find out the max length of a column and then
pad it.
For
example: select MAX(LEN([yourcolumn])) from yourtable
Problem is, our SSIS packages are all very picky about the
data types and will break if we just change the data types. So, after making
these table changes, you’ll need to open your SSIS package, open the data flow destination
or other object, hit OK to apply the new metadata, save and deploy it again. No
actual changes necessary.
This all came up because we have data quality issues with
the fields Foo and Bar. Both of those columns are
varchar(max). I’m dumping the varchar(max)
data into temp tables with varchar(200)
to get the queries to return in a reasonable amount of time.
Let me know if you have any questions!
William
I like to use the word treatise to prepare my audience for verbosity.