June 29, 2011 at 1:36 pm
I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.
Thanks in Advance
June 29, 2011 at 1:44 pm
June 29, 2011 at 1:54 pm
ok i thought Unicode is just chars 1 to 255, but as i started pushing the limits of my example below, I get all sorts of unique chinese characters.
is this something like you wanted?
--35891 distinct chars on my server
SELECT distinct Nchar(Tally.N) as MyNChar
FROM
( SELECT top 100000 row_number() over(order by a.name) AS N from sys.columns a CROSS JOIN sys.columns b)Tally
Lowell
June 29, 2011 at 2:12 pm
Murthy-152277 (6/29/2011)
I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.Thanks in Advance
"Unicode" is ambiguous. Which character encoding are you looking for?
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
June 30, 2011 at 5:48 am
I hope this explains your question in depth.
First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.
Second, SQL Server does support collation sequences which are used for cultural specifiic data.
The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.
I converted the complicated tally table example into a while loop and stored the characters into a table.
There are 44306 mappings under the default collation in which 35889 are distinct.
Check out the links below to find out more information on unicode and collation sequences.
The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.
-- Wikipedia - Unicode defined
-- http://en.wikipedia.org/wiki/Unicode#Scripts_covered
-- Collation & Unicode Support
-- http://technet.microsoft.com/en-us/library/ms143503.aspx
-- Collations described
-- http://msdn.microsoft.com/en-us/library/aa174903(v=sql.80).aspx
-- NChar Defined in books online
-- MYNUM Is a positive whole number from 0 through 65535. If a value outside this range is specified, NULL is returned.
-- Collation sequence for temp db
SELECT collation_name FROM sys.databases WHERE name = 'tempdb'
GO
-- CREATE TABLE TO HOLD ASC NUMBER & UNICODE CHAR (UTF-16)
CREATE TABLE #TMP1(MYNUM INT, MYCHAR NCHAR(1))
GO
-- DECLARE LOOP VARIABLE
DECLARE @v-2 AS INT
SELECT @v-2 = 0;
-- CREATE DATA
WHILE (@V <= 65535)
BEGIN
INSERT INTO #TMP1 VALUES (@V, NCHAR(@V));
PRINT STR(@V, 6, 0) + ' = ' + NCHAR(@V);
END
GO
-- TOTAL NUMBER OF CHARACTERS
SELECT COUNT(*) AS TOTAL_MAPPINGS FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';
GO
-- TOTAL DISTINCT CHARACTERS
SELECT COUNT (DISTINCT MYCHAR) AS TOTAL_DISTINCT
FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';
-- REMOVE TABLE
DROP TABLE #TMP1
GO
John Miner
Crafty DBA
www.craftydba.com
June 30, 2011 at 6:11 am
j.miner (6/30/2011)
I converted the complicated tally table example into a while loop and stored the characters into a table.
What makes you think the Tally table example is so "complicated" compared to the While Loop you wrote???
--Jeff Moden
Change is inevitable... Change for the better is not.
June 30, 2011 at 6:23 am
Hi Jeff,
I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.
I guess I am old school and need to think more in terms of sets. I always like to use KISS when designing solutions. Therefore, I give kodos to the Tally example for that.
However, my explaination shows that there is only 65536 possibilities which is key to UTF16.
Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?
Best Regards
John
John Miner
Crafty DBA
www.craftydba.com
June 30, 2011 at 6:31 am
one of Jeffs other fine examples is creating a CTE Tally table that doesn't touch any existing tables, but creates it all on the fly;
there's not really much of a difference, for me, a few more lines of code is all, but functionally equivilent.
WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)
--35891 distinct chars on my server
--35891 distinct chars on my server
SELECT distinct Nchar(X.N) as MyNChar
FROM
( SELECT top 100000 N from Tally) X
Lowell
June 30, 2011 at 6:43 am
also, using sys.columns is just an easy way to find a lot of items;
if you doi this:
--SQL 2005= 419
--SQL 2008= 483
select count(*) from model.sys.columns
that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,
so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.
if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.
Lowell
June 30, 2011 at 8:03 am
Lowell (6/30/2011)
also, using sys.columns is just an easy way to find a lot of items;if you doi this:
--SQL 2005= 419
--SQL 2008= 483
select count(*) from model.sys.columns
that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,
so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.
if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.
Use sys.all_columns instead. If you have the correct privs (and most do), it will have over 4,000 rows in it even on a brand new install.
On SQL Server 2000 or less, use master.dbo.syscolumns
--Jeff Moden
Change is inevitable... Change for the better is not.
June 30, 2011 at 8:12 am
j.miner (6/30/2011)
Hi Jeff,I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.
Nope... You guys already had a speaker. I was just an attendee and we talked about it somewhere along the line.
However, my explaination shows that there is only 65536 possibilities which is key to UTF16.
You can easily do the same thing with the Tally Table solution. The Tally Table solution will also allow you to avoid the need for the creation of a Temp Table. Give it a try.
Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?
Yes, but so what? 🙂 The CROSS JOIN forms a very, very high speed "pseudo cursor" that will blow the old-school While Loop method out of the water for performance. You're allowing personal preference to hold you back for writing high performance code. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
June 30, 2011 at 8:19 am
🙂
Hi Jeff & Lowell,
All and all, it is cool stuff and I have squirelled the E16 example away.
I will have to remember to use it in the real world.
-John
John Miner
Crafty DBA
www.craftydba.com
June 30, 2011 at 10:26 am
j.miner (6/30/2011)
I hope this explains your question in depth.First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.
Whoa, just so we're all on the same page, 'N'ational data types in SQL Server are stored using the UCS-2 encoding, which is related to, but is not to be used interchangeably with UTF-16. UCS-2 is a subset of UTF-16. Their functional similarities end after the first Unicode plane (Basic Multilingual Plane, or BMP, or plane 0) which is where UTF-16 starts representing characters using multiple pairs of bytes. UCS-2 is only represent characters as a single pair of bytes.
That said, you can store any code point you want in a column declared with a National data type, however that does not mean that your collation or the data consumer will know what to do with it if the code point does not appear on the code page their applying to the data. Even SSMS is not setup to display everything in the BMP (see below).
Second, SQL Server does support collation sequences which are used for cultural specifiic data.
The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.
Collation is a whole other issue that does not affect data storage of Unicode data in the least (see below for a proof).
The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.
You can store UTF-32 data in a National data type field, however SQL Server will not be able to do much with it in terms of applying a Collation since it evaluates each pair of bytes discretely and so cannot make sense of an encoding that represents each character as 2 pairs of bytes. This means things like the LEN() function tend to start lying.
Piggybacking on Lowell's example here is how to see every code point (0-65535) usable in a National data type column:
IF OBJECT_ID(N'tempdb..#tmp') > 0
DROP TABLE #tmp ;
GO
CREATE TABLE #tmp
(
my_code_point INT,
my_char NCHAR(1)
) ;
GO
WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)
INSERT INTO #tmp
(my_code_point, my_char)
SELECT N, Nchar(X.N) as MyNChar
FROM
( SELECT top 65536 N from Tally) X
GO
/* notice that nothing was escaped with a ? mark.There are two legit code
points in the BMP for question marks: the ASCII version 63 (the one we get
on our keyboards) and there is also one on the BMP at code
point 65311 (0xFF1F) */
SELECT *
FROM #tmp
WHERE my_char = N'?' ;
GO
-- look for other problems where SQL Server changed one of the values for us
SELECT UNICODE(my_char) AS unicode_function_output,
CONVERT(VARBINARY(2), my_char),
my_code_point,
my_char
FROM #tmp
WHERE UNICODE(my_char) != my_code_point ;
-- We're OK, returns 0 rows
-- BTW the same cannot be said of VARCHAR columns, see below for another code sample
GO
That's not to say that all code points will represent a character, because they won't, not even for the code page SSMS uses (did you notice the Unicode question mark from the code example above looked weird in SSMS?), but there it is. What does this mean? It means you can store ANY Unicode encoding in SQL Server as long as you do not expect SQL Server to know what you're talking about all the time. This is useful in some cases, but things like LEN() can be wrong.
VARCHAR escaping demo (this is where collations come into play in terms of munging your data):
IF OBJECT_ID(N'tempdb..#tmp') > 0
DROP TABLE #tmp ;
GO
CREATE TABLE #tmp
(
my_code_point INT,
my_char CHAR(1)
) ;
GO
WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)
INSERT INTO #tmp
(my_code_point, my_char)
SELECT N, NCHAR(X.N) as MyNChar
FROM
( SELECT N from Tally WHERE N < 256) X
GO
-- notice that things were escaped with a ? mark
SELECT *
FROM #tmp
WHERE my_char = '?' ;
GO
-- look for other problems where SQL Server changed one of our values for us
SELECT ASCII(my_char) AS ascii_function_output,
my_code_point,
my_char
FROM #tmp
WHERE ASCII(my_char) != my_code_point ;
-- Munged data!
GO
EDIT: fix comments in VARCHAR code example
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato
June 30, 2011 at 3:17 pm
j.miner (6/30/2011)
🙂Hi Jeff & Lowell,
All and all, it is cool stuff and I have squirelled the E16 example away.
I will have to remember to use it in the real world.
-John
Start using such methods in the not-so-real-world and your real-world will become much easier. It's very much like practicing the piano... unless your goal is to become a comedian, you never practice hitting the wrong notes.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply