June 11, 2009 at 1:02 am
"Elementary, my dear Watson" .
Good morning!
June 11, 2009 at 6:24 am
I agree that the answer appears simple, and SQL supports UCS-2 encoding, but if you take the question for face value, the answer looks like it could be up to 4 for other unicode encoding schemes.
June 11, 2009 at 6:25 am
Two bytes per character may be how Unicode is implemented in SQL Server - but code elements of one through four bytes are used in various definitions of Unicode. And some characters can be formed by combining code elements - so "characters" can be even longer.
June 11, 2009 at 9:10 am
AndyK (6/11/2009)
Two bytes per character may be how Unicode is implemented in SQL Server - but code elements of one through four bytes are used in various definitions of Unicode. And some characters can be formed by combining code elements - so "characters" can be even longer.
Pardon my ignorance, but what exactly is a "code element"?
June 11, 2009 at 9:39 am
James Rochez (6/11/2009)
AndyK (6/11/2009)
Two bytes per character may be how Unicode is implemented in SQL Server - but code elements of one through four bytes are used in various definitions of Unicode. And some characters can be formed by combining code elements - so "characters" can be even longer.Pardon my ignorance, but what exactly is a "code element"?
Basically a code element is a character, but it's not called a character so it can be more precise.
(I'm another one that chose 4).
June 11, 2009 at 10:10 am
Take e.g. the French e-acute "character". This consists of two "code elements": a base character (e) and a modifier (a non-spacing acute sign). There are separate Unicode characters for each of these, each of which can be encoded in different but equivalent ways, depending on your pre-agreed coding scheme. (To confuse things, there is also a single character representing "e-with an acute-sign", but that's another story).
The definition of "character" is actually a bit vague, when you examine it. In particular, it doesn't always exactly correspond to a "glyph", ie. the printed sign on the paper or screen.
The Unicode website (www.unicode.org)has the full low-down, including a gentle but rigorous intro.
June 11, 2009 at 10:28 am
dgabele (6/11/2009)
http://en.wikipedia.org/wiki/Unicode%5B/quote%5D
I agree:
The question was how many bytes are required for storage of each character set. Although the UTF-16 scheme allows a four byte Unicode character to be stored in two successive storage positions of two bytes each, you still need four bytes to correctly identify the Unicode character. Thus, ipso facto, QED, and other latin phraseology - you need four bytes.
Refer to essay -
June 11, 2009 at 10:51 am
My $0.02:
Since 1996 Unicode has not been a 16-bit encoding Unicode.org.
There are 3 encoding schemes for unicode UTF-8, UTF-16 and UTF-32 but all of these encoding schemes can use up to four bytes per "character".
Mark makes an excellent point about the ambiguity in the term "character". I've never thought about it because I guess I've always automatically assumed that what a "character" was was tied to the base storage unit (an ASCII character is 1 byte, a SHIFT-JIS is 2 bytes).
-Darren
June 11, 2009 at 11:14 am
In general, the number of bytes required to store a unicode character DEPENDS on the encoding.
Since we know SQL Server uses UCS-2 encoding, the most correct answer is 2 bytes.
With UTF-8 encoding (not recognized by SQL Server), a character will require from 1 to 3 bytes.
June 11, 2009 at 12:10 pm
But the question was on Unicode not UCS-2! SQL Server does not actually support Unicode; it supports UCS-2 which is a limited subset of the Unicode standard. If the author of the question wanted an SQL Server specific answer he should have worded the question differently.
-- Mark D Powell --
June 11, 2009 at 1:42 pm
Mark D Powell (6/11/2009)
But the question was on Unicode not UCS-2! SQL Server does not actually support Unicode; it supports UCS-2 which is a limited subset of the Unicode standard. If the author of the question wanted an SQL Server specific answer he should have worded the question differently.-- Mark D Powell --
Yeah, but given that the site is SQLServerCentral, we should have took took that into consideration...although you could make the argument that it can/does store unicode in binary format...it just doesn't make it easy.
June 11, 2009 at 1:52 pm
That's a fair point and I'll reword the question and answer and award points back.
Most people working with SQL Server, are laypeople, and to them, Unicode is UCS-2, 2 bytes per character. It's a simplistic definition that's come about with somewhat lax documentation for SQL Server, and doesn't correctly answer this. I didn't even know there were 4 byte Unicode characters until I did a bunch of work with iFTS last year.
June 12, 2009 at 1:44 am
Even worse is probably that Microsoft can't even seem to agree on what constitutes "Unicode". We ran into that at my workplace just recently: from what I was able to deduce, SSMS 2008 saves as UTF-16 LE with BOM when saving as unspecified "Unicode", but Visual Studio 2008 uses UTF-8 with BOM and calls that Unicode. This led to some serious difficulties when trying to merge source code changes between the main branch and a development feature branch, because files ended up with different encodings (which Visual Studio 2008 does not support merging of, even though it supports both character encodings).
Better, then, would be to be explicit about the encoding that is being used. I think it's a fairly safe bet that most people who use VS or SSMS could wrap their minds around the fact that UCS-2 (or even UTF-16) and UTF-8 are different, but it's much less obvious that Unicode and Unicode are different between two tools from the same company even.
June 12, 2009 at 6:26 am
I read the answer, but I thoght a unicode character could be between one to 4 bytes. (a byte being 8 bits)
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply