UNICODE usage

Question

Post reply

UNICODE usage

Greg Larsen

SSC-Insane

Points: 20961
More actions
December 10, 2002 at 7:08 am

#58239

My co-workers and I are having a discussion regarding the need to use UNICODE data (nchar,nvarchar,ntext). Since UNICODE data types store 2-bytes, instead of 1, seems like these should only be used when the characters that might need to be stored can't be represented by the 256 1-byte characters associated with the server code page setting. If they are used for columns that store characters that are within the 256 1-byte character set, then you are just wasting storage space.
So I guess my question is, if you have a column whose data will not store any special characters, and the 256 character 1-byte character set has the all the characters that might possibly be stored in this column then you should define the column as either char,varchar, or text, and never nchar, nvarchar, or ntext, if your are trying to conserver disk space, I/O band width, and performance?
Gregory Larsen, DBA
If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples
Gregory A. Larsen, MVP

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Steven. SSCrazy Eights Points: 9806 More actions · Answer 1

Remember the 8K page limit per row.

Hence if you use unicode you half the amount of data you can hold in a row.

Steven

Greg Larsen SSC-Insane Points: 20961 More actions · Answer 2

My point exactly. Unicode will require more pages, therefore more I/O bandwidth, and hence slower performance if used, when not needed.

Gregory Larsen, DBA

If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples

Gregory A. Larsen, MVP

Antares686 SSC Guru Points: 125444 More actions · Answer 3

If you will not be using any of the characters outside the defined 256 use a varchar, char or other non-unicode datatype. If they will then use a unicode type (nvarchar, nchar). The thing with unicode to keep in mind is all items are stored in the unicode size 16-bit per byte, even a character from the first 256 items. Non-unicode stores 8-bit per byte. The number of bits per bte is different between unicode and non, this is the reason for the difference. Remember a byte by definition is not 8-bit but the space required to store 1 character.

Greg Larsen SSC-Insane Points: 20961 More actions · Answer 4

Thanks for the explaination Antares686. I'm not sure I agree with the last sentence, but I think I get your point.

Guess I'm from the old school where a byte was 8 bits, or two hex characters. I still have my IBM Green card (some where also in yellow) so I can determine what Dec. and Hex values represents a Instruction or BCDIC, EBCDIC and Ascii character.

Gregory Larsen, DBA

If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples

Gregory A. Larsen, MVP

don1941 SSCarpal Tunnel Points: 4183 More actions · Answer 5

quote:
Remember a byte by definition is not 8-bit but the space required to store 1 character.

A byte is always a byte. But sometimes a character takes more than one byte.

8-bits are always an OCTET.

Antares686 SSC Guru Points: 125444 More actions · Answer 6

Please read.

http://dictionary.reference.com/search?q=byte

Ok just to add a little more. I found what I was looking for. Unicode is based on the 8bit byte, and is a MBCS (MultiByte Character Set). So we are speaking 8bit bytes here. Funny thing was I got this information years ago in a SQL class and had come across it before. The teacher (an MCT) specifically brought it up when talking about unicode. I didn't think anymore about it until now. I found Unicodes details on MSDN.

Edited by - antares686 on 12/10/2002 10:57:03 AM

Greg Larsen SSC-Insane Points: 20961 More actions · Answer 7

Very interesting definition, of byte not exactly what I thought, all based on machine you working on. I'm sure someday a byte as we know it today will change.......hopefully not in my lifetime.

Gregory Larsen, DBA

If you looking for SQL Server Examples check out my website at http://www.geocities.com/sqlserverexamples

Gregory A. Larsen, MVP

don1941 SSCarpal Tunnel Points: 4183 More actions · Answer 8

quote:
http://dictionary.reference.com/search?q=byte

So like I said: a byte's a byte.

In C/C++ and other HLL, there is a distinction between a BYTE and a CHARACTER. Some characters sets (notably UNICODE) require multiple bytes to represent a single character.

Whether part of a UNICODE column or a CHAR column, a byte is still 8 bits.

The LEN() function returns the number of CHARACTERS in a field, taking into account the data type.

You can get a column's length in bytes using COL_LENGTH(). I thought there was a BLEN or LENB function that also returned the actual number of bytes.

In the networking world, there was lots of furor over what a byte is so they coined OCTET to mean 8 bits. But to the rest of the world, a byte is a byte.

Signing off the philosophical discussion now.

AFPeterson SSCommitted Points: 1580 More actions · Answer 9

And now back to Unicode.

SQL Server stores unicode in what is known as UTF-16/UCS-2 format, which is effectively a two-byte format. Most windows apps support this format. DTS only reads UTF-16/UCS-2 format.

But much of the UNIX,Oracle world supports unicode in what is known as UTF-8 format. This is a mixed one-byte to three-byte format to store each character. With UTF-8 format, often most of the characters are in standard ASCII format, while only the extended characters are in the multi-byte format.

So if you plan on using Unicode, also focus on what clients you plan on supporting.

Andrew

What's the business problem you're trying to solve?

sbrown314 Grasshopper Points: 13 More actions · Answer 10

More on UTF-8 (Unicode Transformation Format)...

This is an encoding scheme to reduce the bandwidth required to transport unicode from end-point to end-point. At the actual end-points, UCS-2 is normally reverted to (ie. standard unicode). So one would normally store information (especially of the web flavour accessed internationally) in UCS-2 and push it out over the web in UTF-8. If you are using MSXML2 to transport your information, you will be using UTF-8 probably without even knowing it.

However, for the sake of saving on disk space, it is possible to store UTF-8 formatted information in a char/varchar/text field. But bear in mind that when you do so, full-text indexing will not work 100%, because it will interpret 2-byte and 3-byte characters as sequences of 2 and 3 characters.

AFPeterson SSCommitted Points: 1580 More actions · Answer 11

While we are on the topic of unicode, and

UTF-16/UCS-2 to UTF-8 conversion. Does anyone know where I can get some sample VB code to do the conversion. I have not had time to work it out myself.

Thanks, Andrew

What's the business problem you're trying to solve?

UNICODE usage

Cookies on SQLServerCentral