August 2, 2013 at 3:56 pm
Hi,
I have a concatenated field that is usually about 80 characters but sometimes has been up to 600 characters. It could conceivably be longer than that. How should I define that field in my table? Nvarchar(max)? Nvarchar(1000)? If I define it as 1000 will that make performance better or worse than max? And if I then try to insert into it with something more than 1000 characters then what will happen? The update attempt would simply error?
I have another table with the same situation except the field is usually about 10 characters and I've observed it as long as 9,033 characters. Is nvarchar(max) basically the only way to address that one?
Thanks!
Tai
August 2, 2013 at 4:59 pm
Hi,
There's a very small performance difference between max and non-max varchar types. http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/
If you specify a 1000 limit, you'll get errors about data being truncated if you try and exceed that limit
Varchar(max) restricts you in some other ways though, like what indexes can be defined, you can't do online index rebuilds, etc...
August 2, 2013 at 6:01 pm
Thanks. Given that info I will stick with max as I have no need to index by these columns.
Tai
August 2, 2013 at 9:35 pm
taigovinda (8/2/2013)
Thanks. Given that info I will stick with max as I have no need to index by these columns.Tai
Before you jump to that conclusion, please consider the following:
adb2303 (8/2/2013)
Hi,There's a very small performance difference between max and non-max varchar types. http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/
If you specify a 1000 limit, you'll get errors about data being truncated if you try and exceed that limit
Varchar(max) restricts you in some other ways though, like what indexes can be defined, you can't do online index rebuilds, etc...
I guess it depends on what you consider to be a "small performance difference". On only a million rows, the difference between 4.9 and 9.2 seconds might seem trivial. On the kinds of tables people work with nowadays, it's not trivial at all. The truth is that the difference there is that VARCHAR(8000) measures out to be 87% faster. That's almost twice as fast and that's not trivial in my book at all.
Shifting gears a bit, you really have to be careful how you measure things. The real truth is that both VARCHAR(8000) and VARCHAR(MAX) both run lightning quick when you use them as they were meant to be used instead of in a WHILE loop. The problem there is that it also shows that VARCHAR(MAX) is comparatively 9 times slower (for this test, anyway. Can be more or less of a difference, depending on what you're doing, but will usually be slower) than VARCHAR(8000) on the first test even though both are much faster. Don't take my word for it, though. Here's the test jig that proves it. Notice... 1 million iterations with no explicit loop.
SET NOCOUNT ON;
--===== Test Variable Assignment 1,000,000 times using VARCHAR(8000)
DECLARE @SomeString VARCHAR(8000),
@StartTime DATETIME
;
SELECT @startTime = GETDATE()
;
SELECT TOP 1000000
@SomeString = 'ABC'
FROM master.sys.all_columns ac1,
master.sys.all_columns ac2
;
SELECT Duration = DATEDIFF(ms,@StartTime,GETDATE())
;
GO
--===== Test Variable Assignment 1,000,000 times using VARCHAR(MAX)
DECLARE @SomeString VARCHAR(MAX),
@StartTime DATETIME
;
SELECT @startTime = GETDATE()
;
SELECT TOP 1000000
@SomeString = 'ABC'
FROM master.sys.all_columns ac1,
master.sys.all_columns ac2
;
SELECT Duration = DATEDIFF(ms,@StartTime,GETDATE())
;
GO
Here are the results from my humble laptop... the VARCHAR(8000) runs in a tenth of the time that VARCHAR(MAX) does (so, 9 times faster, really).
Duration
-----------
60
Duration
-----------
623
In the larger scheme of things, even the difference of only 56.33 seconds on 100 million rows might not be enough to sway you towards trying to avoid a BLOB datatype like VARCHAR(MAX) if you can. But, consider what else a BLOB datatype does that very few people are actually aware of. Here's a quote from the "ALTER INDEX" section of Books Online.
Indexes, including indexes on global temp tables, can be rebuilt online with the following exceptions:
- Disabled indexes
- XML indexes
- Indexes on local temp tables
- Partitioned indexes
- Clustered indexes if the underlying table contains LOB data types
- Nonclustered indexes that are defined with LOB data type columns
That's right. You can't avoid the fact that the Clustered Index inherently contains ALL columns, including BLOB columns, at the leaf level of the index because the leaf level of all Clustered Indexes is actually the table itself.
My bottom line recommendation is try to avoid BLOB datatypes whenever possible. "Right Sizing" is still the right thing to do.
P.S. For the purists in the crowd, yes, I realize that a LOB isn't quite the same as a BLOB. I'm just using the term to cover all data-types that can exceed the size of a page, binary or not.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 2, 2013 at 9:55 pm
Thanks Jeff. So I suppose I should use nvarchar(8000) for the column that has never topped 1000 characters, but still am stuck using nvarchar(max) for the column that has topped 9000 characters already. Does the n in front of varchar matter? I think I read that varchar was slightly faster but nvarchar was more reliable or something. By the way my table accumulates about 1M rows per year...
Can you please explain how the 1,000,000 iterations works in your sample code? (Right now I'm very glad that there is a 'newbie' section on this forum and I am in it!) Yesterday I was for the first time using stuff for xml path to concatenate and I tried to use a cte, then do the stuff for xml path on the cte... It ran for well over an hour and did not complete. Then instead of the cte I used a physical table and the query ran in 3 seconds! I suspect I was re-running the cte for each row or something and it is related to my failure to understand when I am forcing the machine to work by iteration instead of by sets...
Thanks!!
Tai
August 2, 2013 at 10:38 pm
taigovinda (8/2/2013)
Thanks Jeff. So I suppose I should use nvarchar(8000) for the column that has never topped 1000 characters, but still am stuck using nvarchar(max) for the column that has topped 9000 characters already. Does the n in front of varchar matter? I think I read that varchar was slightly faster but nvarchar was more reliable or something. By the way my table accumulates about 1M rows per year...Can you please explain how the 1,000,000 iterations works in your sample code? (Right now I'm very glad that there is a 'newbie' section on this forum and I am in it!) Yesterday I was for the first time using stuff for xml path to concatenate and I tried to use a cte, then do the stuff for xml path on the cte... It ran for well over an hour and did not complete. Then instead of the cte I used a physical table and the query ran in 3 seconds! I suspect I was re-running the cte for each row or something and it is related to my failure to understand when I am forcing the machine to work by iteration instead of by sets...
Thanks!!
Tai
NP, Tai.
First, The "N" on NVARCHAR makes the column a Unicode column. Typically, that requires 2 bytes per character (1 for the character, 1 for the character set code or "country"). That means that NVARCHAR takes twice the amount of space for nothing if the language being used doesn't require Unicode to display the characters of the language.
That also means that there's no such thing as NVARCHAR(8000). The max limit of NVARCHAR (without going to the MAX length notation) is NVARCHAR(4000).
My recommendation is to look at "datatypes" in Books Online (press the {f1} key in SSMS to get there) and see what they all actually mean.
Shifting gears to the million row test...
If you have a some table with 4,000 rows in it, what does the following give you?
SELECT * FROM dbo.SomeTable
That's right... it gives you all of the columns of the table and makes your DBA scream at you for not limiting your query but that's not what I'm looking for. What else does it give you?
Because there's no TOP or WHERE clause, it will give you all 4,000 rows in the table.
What does the following give you?
SELECT 1 FROM dbo.SomeTable?
That's right. It will give you a "1" for every row there is in the table. In this case, the table has 4,000 rows in it so it will give you 4,000 "1"s. Notice that it did that without using anything from the table itself.
With the idea that dbo.SomeTable has at least 4,000 rows in it, what does the following give you? (don't actually run this one... it could take quite a while)
SELECT 1 FROM dbo.SomeTable t1, dbo.SomeTable t2
Another way of writing that same thing is ...
SELECT 1 FROM dbo.SomeTable t1 CROSS JOIN dbo.SomeTable t2
If you know what a CROSS JOIN is (Cartesian Product, returns the entire content of 1 table for each and every row of the other table) and given that SomeTable has at least 4,000 rows in it (we joined it to itself), it will return at least 16 MILLION (4,000 * 4,000) 1's.
If we want to limit the number of 1's returned, we can use TOP like this...
SELECT TOP 1000000 1 dbo.SomeTable t1, dbo.SomeTable t2
We could also change the 1 to something like 'ABC' and return a million 'ABC's...
SELECT TOP 1000000 'ABC' dbo.SomeTable t1, dbo.SomeTable t2
There's nothing that says we can't assign each of those same 'ABC's to a given variable (overwriting the content of the variable for each row previously produced) so we end up with...
DECLARE @SomeString VARCHAR(8000);
SELECT TOP 1000000 @SomeString = 'ABC' dbo.SomeTable t1, dbo.SomeTable t2;
... which is what I posted. It's a test that does 1 million variable assignments without using an explicit loop.
It does, however, use the implicit loop built into every multirow SELECT. R. Barry Young coined the phrase "Pseudo Cursor" to describe this action and I've called it that ever since.
Hopefully, that answers your questions. If it doesn't, c'mon back with more questions.
Since we're talking about using the content of tables to replace what a loop does, I recommend you study the following article. It WILL change the way you think about how to do things that loop in the future.
http://www.sqlservercentral.com/articles/T-SQL/62867/
If you'd like to see the method in action for generating more test data than you can shake a stick at, please see the following two articles...
http://www.sqlservercentral.com/articles/Data+Generation/87901//
http://www.sqlservercentral.com/articles/Test+Data/88964/
--Jeff Moden
Change is inevitable... Change for the better is not.
August 2, 2013 at 10:45 pm
Forgot about the original problem. Since you've already determined that some data exceeds both NVARCHAR(4000) and VARCHAR(8000), then one of the MAX datatypes is appropriate although I'd try to find out from the folks sending me the data if string > 8000 bytes were actually necessary.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 3, 2013 at 8:16 am
Many thanks Jeff! This is just the kind of thing I need to get me started actually understanding instead of being a perpetual "stumble-my-way-through-it"-er... At this point I am not comfortable enough with SQL to even write a loop without looking at an example, so you could say I don't have bad habits yet! I read the first tally table article and it kinda blew my mind
As for my original problem... first thing I'll do is change up my tables so that they use varchar instead of nvarchar! The data is related to invoices paid by the company I work in and the 9,000-character entry was one where the person who paid it split it between about 100 different cost centers. Normally this field shows 1 or maybe two cost centers and I want to make it available so that folks become aware of the total amount we are paying to a given vendor whereas before they were likely only aware of their share of the payment. I may reconsider and decide that it's smarter to just show "your share," a grand total amount and a "# of ways split" field on our internal website, then let them click through to see individual rows if they want the list of who else paid... That way each field is normally under 100 characters and they can click through to a number of rows.
Tai
August 3, 2013 at 9:34 am
taigovinda (8/3/2013)
Many thanks Jeff! This is just the kind of thing I need to get me started actually understanding instead of being a perpetual "stumble-my-way-through-it"-er... At this point I am not comfortable enough with SQL to even write a loop without looking at an example, so you could say I don't have bad habits yet!
Heh... I was hoping to help prevent some of the bad habits that newbies develop. One of the first things they do in every programming class is teach people how to count from, say, 1 to 100 to demonstrate how to loop. I wish they'd do that in SQL Server classes but using Pseudo-Cursors instead to get people thinking about (and this is the key to high performance SQL) what they want to do to a column rather than what they want to do to a row.
I read the first tally table article and it kinda blew my mind
It IS a different way of thinking. Once you "get" something like that, you'll find 100's of ways to apply the same principles even if it's not for splitting, creating dates, etc, etc. It's a different paradigm for most folks especially if the have some procedural coding experience.
As for my original problem... first thing I'll do is change up my tables so that they use varchar instead of nvarchar! The data is related to invoices paid by the company I work in and the 9,000-character entry was one where the person who paid it split it between about 100 different cost centers. Normally this field shows 1 or maybe two cost centers and I want to make it available so that folks become aware of the total amount we are paying to a given vendor whereas before they were likely only aware of their share of the payment. I may reconsider and decide that it's smarter to just show "your share," a grand total amount and a "# of ways split" field on our internal website, then let them click through to see individual rows if they want the list of who else paid... That way each field is normally under 100 characters and they can click through to a number of rows.
How'd they do the 100 cost center thing? Did they create a comma separated list or ??? I ask because if this column has cost centers separated by delimiters (even a space can be a delimiter), we might be able to normalize the data for you so that you can do "normal" queries to aggregate data by cost center, etc.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 3, 2013 at 9:54 am
I extract the data from various places and combine it. Once I'm done, for each invoice I have a row for each cost center (or order or network) that it was paid against. In most cases it is less than three lines - so I actually created the comma-separated field (using stuff for xml path()) in order to display on a report which receivers (cost center, order, network) got what piece of the invoice in one little box. I didn't realize until I put all the data together that this csv field occasionally runs much longer than what I would want to put in that box.
So, I already have the normalized data and all I need to do is decide to change my approach and instead of giving the user who paid 1k two boxes - "1k" and "~list of 100 cost centers that paid 1k each~;" I will give them "1k" and "100k total with 100 receivers" and they can click on "100k total with 100 receivers" if they want to see who paid what and it will give them the 100 rows... Then I can drop my csv column from the table altogether. It had seemed like a good idea because I wanted to see who was splitting invoices. Normally it would be something like lawn mowing for 1k/month but they would not know that they were one of three people each paying 1k so that is the concept...
I did read your article that you linked! But in this case I do not need to split the text I just need to refrain from combining it :-p
Thanks.
Tai
August 3, 2013 at 11:36 am
For a "newbie", you sure do seem to have your act together. That's an excellent idea for handling the larger cost center distributions. Well done.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 3, 2013 at 5:56 pm
Hi Jeff,
Thanks for the words of encouragement. I've been getting the SSC newsletter for awhile now and quietly helping myself to some of the articles therein. So, well done!
I work in finance and we were just recently given free reign over our own server... The main thing I have going for me is that - unlike most of my peers - I still felt like a newbie even after I learned how to "select a.*, b.* from a inner join b on a.myKey = b.myKey" - and certainly will for a long time! I've been setting up all my text fields as nvarchar in each table that I build, simply because that was how all the existing ones had been done. I just tested varchar with this most recent table and sure enough the size is cut in half with no drawback that I can see. Now I can change all the old ones too!
When I browse the SSC newsletters I tend to be looking not to learn how to do something specific but just to get a feel for the realm of possibility - and then I read the articles if I could conceive of doing something like that myself... it is fantastic to be able to come here and get guidance from the experts as well.
Thanks!!
Tai
August 3, 2013 at 8:01 pm
There might actually be a drawback to switching NVARCHAR to VARCHAR depending on which ORM you're using. For example, Linq2SQL would usually send string parameters with the "N" prefix which makes them NVARCHAR. If you have a VARCHAR column that's being compared to, it will convert the entire column to NVARCHAR in memory before it tries to do the compare. That also means that the best you can get out of such an "ad hoc" query is an index scan.
Check with the front-end folks before you change and NVARCHAR columns to VARCHAR.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 4, 2013 at 10:15 am
Jeff Moden (8/2/2013)[hrI guess it depends on what you consider to be a "small performance difference". On only a million rows, the difference between 4.9 and 9.2 seconds might seem trivial. On the kinds of tables people work with nowadays, it's not trivial at all. The truth is that the difference there is that VARCHAR(8000) measures out to be 87% faster. That's almost twice as fast and that's not trivial in my book at all.
Shifting gears a bit, you really have to be careful how you measure things. The real truth is that both VARCHAR(8000) and VARCHAR(MAX) both run lightning quick when you use them as they were meant to be used instead of in a WHILE loop. The problem there is that it also shows that VARCHAR(MAX) is comparatively 9 times slower (for this test, anyway. Can be more or less of a difference, depending on what you're doing, but will usually be slower) than VARCHAR(8000) on the first test even though both are much faster. Don't take my word for it, though. Here's the test jig that proves it. Notice... 1 million iterations with no explicit loop.
SET NOCOUNT ON;
--===== Test Variable Assignment 1,000,000 times using VARCHAR(8000)
DECLARE @SomeString VARCHAR(8000),
@StartTime DATETIME
;
SELECT @startTime = GETDATE()
;
SELECT TOP 1000000
@SomeString = 'ABC'
FROM master.sys.all_columns ac1,
master.sys.all_columns ac2
;
SELECT Duration = DATEDIFF(ms,@StartTime,GETDATE())
;
GO
--===== Test Variable Assignment 1,000,000 times using VARCHAR(MAX)
DECLARE @SomeString VARCHAR(MAX),
@StartTime DATETIME
;
SELECT @startTime = GETDATE()
;
SELECT TOP 1000000
@SomeString = 'ABC'
FROM master.sys.all_columns ac1,
master.sys.all_columns ac2
;
SELECT Duration = DATEDIFF(ms,@StartTime,GETDATE())
;
GO
Here are the results from my humble laptop... the VARCHAR(8000) runs in a tenth of the time that VARCHAR(MAX) does (so, 9 times faster, really).
Duration
-----------
60
Duration
-----------
623
In the larger scheme of things, even the difference of only 56.33 seconds on 100 million rows might not be enough to sway you towards trying to avoid a BLOB datatype like VARCHAR(MAX) if you can. But, consider what else a BLOB datatype does that very few people are actually aware of.
Two things about that:
First: (this is a trivial comment) on my laptop (Windows 8, SQL 2012) that shows a speed ratio of about 6 to 1 instead of your 10 to 1. Obviously it's quite platform dependent.
Second: (this is non-trivial) Assigning string values to a variable is not a particularly good bet as a performance metric for SQL. so I changed the code to stick a million rows in a table instead. When I did that, I found that varchar(max) was taking less time than varchar(8000), not more. Of course this probably depends on the length of the strings inserted, so I tried increasing the string size from 3 to 10: the result was that the speed advantage of varchar(max) over varchar(8000) was more pronounced than it was using 3 characters.
To really get to the bottom of this, I would have to use strings with random values and lengths, include some deletions as well as inserts, and include some updates too. I suspect the results would change if compression were used, so I would have to look at that too. But the tests with just insert instead of variable assignment seem sufficient to show that varchar(max) is not guaranteed to be slower than varchar(8000) in the real world.
Tom
August 4, 2013 at 9:08 pm
After Jeff got me thinking, I ended up not needing very long strings after all so went with much shorter varchar's.
One thing I observed that was interesting...
When using SSIS to import from Excel to a temp table that is varchar, I got errors on one particular file. That file had some unicode characters that the French-speaking folks must have put in. So I changed one field in the temp table to be nvarchar (and changed the data type in my SISS data conversion for that one field). I don't do anything to the offending field before porting it to a permanent table, and yet I did not need to use nvarchar on the permanent table. So SSIS / SQL Server somehow is smart enough to deal with whatever characters are in my data when going from Excel to an nvarchar table and then to a varchar table, but not directly from Excel to a varchar table. Not sure why this is but it was interesting to see.
Tai
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy