Generate all possible characters for UNICODE using SQL Script

Question

Post reply

Generate all possible characters for UNICODE using SQL Script

Murthy-152277

Right there with Babe

Points: 720
More actions
June 29, 2011 at 1:36 pm

#240063

I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.
Thanks in Advance

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 1

Lookup NCHAR and unicode

select Nchar(50)

select unicode(2)

Jayanth Kurup[/url]

Lowell SSC Guru Points: 323518 More actions · Answer 2

ok i thought Unicode is just chars 1 to 255, but as i started pushing the limits of my example below, I get all sorts of unique chinese characters.

is this something like you wanted?

--35891 distinct chars on my server

SELECT distinct Nchar(Tally.N) as MyNChar

FROM

( SELECT top 100000 row_number() over(order by a.name) AS N from sys.columns a CROSS JOIN sys.columns b)Tally

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 3

Murthy-152277 (6/29/2011)
I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.
Thanks in Advance

"Unicode" is ambiguous. Which character encoding are you looking for?

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

j.miner SSCrazy Points: 2264 More actions · Answer 4

I hope this explains your question in depth.

First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.

Second, SQL Server does support collation sequences which are used for cultural specifiic data.

The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.

I converted the complicated tally table example into a while loop and stored the characters into a table.

There are 44306 mappings under the default collation in which 35889 are distinct.

Check out the links below to find out more information on unicode and collation sequences.

The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.

-- Wikipedia - Unicode defined

-- http://en.wikipedia.org/wiki/Unicode#Scripts_covered

-- Collation & Unicode Support

-- http://technet.microsoft.com/en-us/library/ms143503.aspx

-- Collations described

-- http://msdn.microsoft.com/en-us/library/aa174903(v=sql.80).aspx

-- NChar Defined in books online

-- MYNUM Is a positive whole number from 0 through 65535. If a value outside this range is specified, NULL is returned.

-- Collation sequence for temp db

SELECT collation_name FROM sys.databases WHERE name = 'tempdb'

GO

-- CREATE TABLE TO HOLD ASC NUMBER & UNICODE CHAR (UTF-16)

CREATE TABLE #TMP1(MYNUM INT, MYCHAR NCHAR(1))

GO

-- DECLARE LOOP VARIABLE

DECLARE @v-2 AS INT

SELECT @v-2 = 0;

-- CREATE DATA

WHILE (@V <= 65535)

BEGIN

INSERT INTO #TMP1 VALUES (@V, NCHAR(@V));

PRINT STR(@V, 6, 0) + ' = ' + NCHAR(@V);

SELECT @v-2 = @v-2 + 1;

END

GO

-- TOTAL NUMBER OF CHARACTERS

SELECT COUNT(*) AS TOTAL_MAPPINGS FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';

GO

-- TOTAL DISTINCT CHARACTERS

SELECT COUNT (DISTINCT MYCHAR) AS TOTAL_DISTINCT

FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';

-- REMOVE TABLE

DROP TABLE #TMP1

GO

John Miner
Crafty DBA
www.craftydba.com

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 5

j.miner (6/30/2011)
I converted the complicated tally table example into a while loop and stored the characters into a table.

What makes you think the Tally table example is so "complicated" compared to the While Loop you wrote???

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

j.miner SSCrazy Points: 2264 More actions · Answer 6

Hi Jeff,

I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.

I guess I am old school and need to think more in terms of sets. I always like to use KISS when designing solutions. Therefore, I give kodos to the Tally example for that.

However, my explaination shows that there is only 65536 possibilities which is key to UTF16.

Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?

Best Regards

John

John Miner
Crafty DBA
www.craftydba.com

Lowell SSC Guru Points: 323518 More actions · Answer 7

one of Jeffs other fine examples is creating a CTE Tally table that doesn't touch any existing tables, but creates it all on the fly;

there's not really much of a difference, for me, a few more lines of code is all, but functionally equivilent.

WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

--35891 distinct chars on my server

SELECT distinct Nchar(X.N) as MyNChar

FROM

( SELECT top 100000 N from Tally) X

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Lowell SSC Guru Points: 323518 More actions · Answer 8

also, using sys.columns is just an easy way to find a lot of items;

if you doi this:

--SQL 2005= 419

--SQL 2008= 483

select count(*) from model.sys.columns

that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,

so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.

if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 9

Lowell (6/30/2011)
also, using sys.columns is just an easy way to find a lot of items;
if you doi this:
--SQL 2005= 419
--SQL 2008= 483
select count(*) from model.sys.columns
that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,
so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.
if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.

Use sys.all_columns instead. If you have the correct privs (and most do), it will have over 4,000 rows in it even on a brand new install.

On SQL Server 2000 or less, use master.dbo.syscolumns

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 10

j.miner (6/30/2011)
Hi Jeff,
I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.

Nope... You guys already had a speaker. I was just an attendee and we talked about it somewhere along the line.

However, my explaination shows that there is only 65536 possibilities which is key to UTF16.

You can easily do the same thing with the Tally Table solution. The Tally Table solution will also allow you to avoid the need for the creation of a Temp Table. Give it a try.

Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?

Yes, but so what? 🙂 The CROSS JOIN forms a very, very high speed "pseudo cursor" that will blow the old-school While Loop method out of the water for performance. You're allowing personal preference to hold you back for writing high performance code. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

j.miner SSCrazy Points: 2264 More actions · Answer 11

🙂

Hi Jeff & Lowell,

All and all, it is cool stuff and I have squirelled the E16 example away.

I will have to remember to use it in the real world.

-John

John Miner
Crafty DBA
www.craftydba.com

Orlando Colamatteo SSC Guru Points: 182279 More actions · Answer 12

j.miner (6/30/2011)
I hope this explains your question in depth.
First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.

Whoa, just so we're all on the same page, 'N'ational data types in SQL Server are stored using the UCS-2 encoding, which is related to, but is not to be used interchangeably with UTF-16. UCS-2 is a subset of UTF-16. Their functional similarities end after the first Unicode plane (Basic Multilingual Plane, or BMP, or plane 0) which is where UTF-16 starts representing characters using multiple pairs of bytes. UCS-2 is only represent characters as a single pair of bytes.

That said, you can store any code point you want in a column declared with a National data type, however that does not mean that your collation or the data consumer will know what to do with it if the code point does not appear on the code page their applying to the data. Even SSMS is not setup to display everything in the BMP (see below).

Second, SQL Server does support collation sequences which are used for cultural specifiic data.
The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.

Collation is a whole other issue that does not affect data storage of Unicode data in the least (see below for a proof).

The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.

You can store UTF-32 data in a National data type field, however SQL Server will not be able to do much with it in terms of applying a Collation since it evaluates each pair of bytes discretely and so cannot make sense of an encoding that represents each character as 2 pairs of bytes. This means things like the LEN() function tend to start lying.

Piggybacking on Lowell's example here is how to see every code point (0-65535) usable in a National data type column:

IF OBJECT_ID(N'tempdb..#tmp') > 0

DROP TABLE #tmp ;

GO

CREATE TABLE #tmp

(

my_code_point INT,

my_char NCHAR(1)

) ;

GO

WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

INSERT INTO #tmp

(my_code_point, my_char)

SELECT N, Nchar(X.N) as MyNChar

FROM

( SELECT top 65536 N from Tally) X

GO

/* notice that nothing was escaped with a ? mark.There are two legit code

points in the BMP for question marks: the ASCII version 63 (the one we get

on our keyboards) and there is also one on the BMP at code

point 65311 (0xFF1F) */

SELECT *

FROM #tmp

WHERE my_char = N'?' ;

GO

-- look for other problems where SQL Server changed one of the values for us

SELECT UNICODE(my_char) AS unicode_function_output,

CONVERT(VARBINARY(2), my_char),

my_code_point,

my_char

FROM #tmp

WHERE UNICODE(my_char) != my_code_point ;

-- We're OK, returns 0 rows

-- BTW the same cannot be said of VARCHAR columns, see below for another code sample

GO

That's not to say that all code points will represent a character, because they won't, not even for the code page SSMS uses (did you notice the Unicode question mark from the code example above looked weird in SSMS?), but there it is. What does this mean? It means you can store ANY Unicode encoding in SQL Server as long as you do not expect SQL Server to know what you're talking about all the time. This is useful in some cases, but things like LEN() can be wrong.

VARCHAR escaping demo (this is where collations come into play in terms of munging your data):

IF OBJECT_ID(N'tempdb..#tmp') > 0

DROP TABLE #tmp ;

GO

CREATE TABLE #tmp

(

my_code_point INT,

my_char CHAR(1)

) ;

GO

WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

INSERT INTO #tmp

(my_code_point, my_char)

SELECT N, NCHAR(X.N) as MyNChar

FROM

( SELECT N from Tally WHERE N < 256) X

GO

-- notice that things were escaped with a ? mark

SELECT *

FROM #tmp

WHERE my_char = '?' ;

GO

-- look for other problems where SQL Server changed one of our values for us

SELECT ASCII(my_char) AS ascii_function_output,

my_code_point,

my_char

FROM #tmp

WHERE ASCII(my_char) != my_code_point ;

-- Munged data!

GO

EDIT: fix comments in VARCHAR code example

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jeff Moden SSC Guru Points: 1004414 More actions · Answer 13

j.miner (6/30/2011)
🙂
Hi Jeff & Lowell,
All and all, it is cool stuff and I have squirelled the E16 example away.
I will have to remember to use it in the real world.
-John

Start using such methods in the not-so-real-world and your real-world will become much easier. It's very much like practicing the piano... unless your goal is to become a comedian, you never practice hitting the wrong notes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)