Generate all possible characters for UNICODE using SQL Script

  • I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.

    Thanks in Advance

  • Lookup NCHAR and unicode

    select Nchar(50)

    select unicode(2)

    Jayanth Kurup[/url]

  • ok i thought Unicode is just chars 1 to 255, but as i started pushing the limits of my example below, I get all sorts of unique chinese characters.

    is this something like you wanted?

    --35891 distinct chars on my server

    SELECT distinct Nchar(Tally.N) as MyNChar

    FROM

    ( SELECT top 100000 row_number() over(order by a.name) AS N from sys.columns a CROSS JOIN sys.columns b)Tally

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Murthy-152277 (6/29/2011)


    I need to test some functionality for UNICODE. I would like to generate all possible characters for UNICODE. Is there any way in SQL Server to generate this data.

    Thanks in Advance

    "Unicode" is ambiguous. Which character encoding are you looking for?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I hope this explains your question in depth.

    First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.

    Second, SQL Server does support collation sequences which are used for cultural specifiic data.

    The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.

    I converted the complicated tally table example into a while loop and stored the characters into a table.

    There are 44306 mappings under the default collation in which 35889 are distinct.

    Check out the links below to find out more information on unicode and collation sequences.

    The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.

    -- Wikipedia - Unicode defined

    -- http://en.wikipedia.org/wiki/Unicode#Scripts_covered

    -- Collation & Unicode Support

    -- http://technet.microsoft.com/en-us/library/ms143503.aspx

    -- Collations described

    -- http://msdn.microsoft.com/en-us/library/aa174903(v=sql.80).aspx

    -- NChar Defined in books online

    -- MYNUM Is a positive whole number from 0 through 65535. If a value outside this range is specified, NULL is returned.

    -- Collation sequence for temp db

    SELECT collation_name FROM sys.databases WHERE name = 'tempdb'

    GO

    -- CREATE TABLE TO HOLD ASC NUMBER & UNICODE CHAR (UTF-16)

    CREATE TABLE #TMP1(MYNUM INT, MYCHAR NCHAR(1))

    GO

    -- DECLARE LOOP VARIABLE

    DECLARE @v-2 AS INT

    SELECT @v-2 = 0;

    -- CREATE DATA

    WHILE (@V <= 65535)

    BEGIN

    INSERT INTO #TMP1 VALUES (@V, NCHAR(@V));

    PRINT STR(@V, 6, 0) + ' = ' + NCHAR(@V);

    SELECT @v-2 = @v-2 + 1;

    END

    GO

    -- TOTAL NUMBER OF CHARACTERS

    SELECT COUNT(*) AS TOTAL_MAPPINGS FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';

    GO

    -- TOTAL DISTINCT CHARACTERS

    SELECT COUNT (DISTINCT MYCHAR) AS TOTAL_DISTINCT

    FROM #TMP1 WHERE MYCHAR IS NOT NULL AND MYCHAR <> '';

    -- REMOVE TABLE

    DROP TABLE #TMP1

    GO

    John Miner
    Crafty DBA
    www.craftydba.com

  • j.miner (6/30/2011)


    I converted the complicated tally table example into a while loop and stored the characters into a table.

    What makes you think the Tally table example is so "complicated" compared to the While Loop you wrote???

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.

    I guess I am old school and need to think more in terms of sets. I always like to use KISS when designing solutions. Therefore, I give kodos to the Tally example for that.

    However, my explaination shows that there is only 65536 possibilities which is key to UTF16.

    Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?

    Best Regards

    John

    John Miner
    Crafty DBA
    www.craftydba.com

  • one of Jeffs other fine examples is creating a CTE Tally table that doesn't touch any existing tables, but creates it all on the fly;

    there's not really much of a difference, for me, a few more lines of code is all, but functionally equivilent.

    WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

    E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

    E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

    E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

    E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

    Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

    --35891 distinct chars on my server

    --35891 distinct chars on my server

    SELECT distinct Nchar(X.N) as MyNChar

    FROM

    ( SELECT top 100000 N from Tally) X

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • also, using sys.columns is just an easy way to find a lot of items;

    if you doi this:

    --SQL 2005= 419

    --SQL 2008= 483

    select count(*) from model.sys.columns

    that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,

    so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.

    if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (6/30/2011)


    also, using sys.columns is just an easy way to find a lot of items;

    if you doi this:

    --SQL 2005= 419

    --SQL 2008= 483

    select count(*) from model.sys.columns

    that shows me that on my SQL2005 databases it returns 419, on my SQL 2008 it returns 483 rows.,

    so cross joining that table agaisnt itself, at a minimum, in any database, returns at least 175561 rows, so if i can use that known quanity to generate some quick data, it's good for me.

    if i need more, i can cross join it agaisnt itself a third time or more, till i have the reserve of rows i need.

    Use sys.all_columns instead. If you have the correct privs (and most do), it will have over 4,000 rows in it even on a brand new install.

    On SQL Server 2000 or less, use master.dbo.syscolumns

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • j.miner (6/30/2011)


    Hi Jeff,

    I think you presented Tally Tables at the SQL Server Users Group in Southern New England last summer.

    Nope... You guys already had a speaker. I was just an attendee and we talked about it somewhere along the line.

    However, my explaination shows that there is only 65536 possibilities which is key to UTF16.

    You can easily do the same thing with the Tally Table solution. The Tally Table solution will also allow you to avoid the need for the creation of a Temp Table. Give it a try.

    Also, I do not like the fact that you are crossing two tables that to create a tally table. The data in the two tables has really nothing to do with the resulting in memory table having a 100K rows. Correct?

    Yes, but so what? 🙂 The CROSS JOIN forms a very, very high speed "pseudo cursor" that will blow the old-school While Loop method out of the water for performance. You're allowing personal preference to hold you back for writing high performance code. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • 🙂

    Hi Jeff & Lowell,

    All and all, it is cool stuff and I have squirelled the E16 example away.

    I will have to remember to use it in the real world.

    -John

    John Miner
    Crafty DBA
    www.craftydba.com

  • j.miner (6/30/2011)


    I hope this explains your question in depth.

    First, SQL Server only supports UTF16 with the NCHAR, NVCHAR and NTEXT datatypes. That means, a character is represented by 2 bytes or 16 bits which gives you 65536 unique combinations.

    Whoa, just so we're all on the same page, 'N'ational data types in SQL Server are stored using the UCS-2 encoding, which is related to, but is not to be used interchangeably with UTF-16. UCS-2 is a subset of UTF-16. Their functional similarities end after the first Unicode plane (Basic Multilingual Plane, or BMP, or plane 0) which is where UTF-16 starts representing characters using multiple pairs of bytes. UCS-2 is only represent characters as a single pair of bytes.

    That said, you can store any code point you want in a column declared with a National data type, however that does not mean that your collation or the data consumer will know what to do with it if the code point does not appear on the code page their applying to the data. Even SSMS is not setup to display everything in the BMP (see below).

    Second, SQL Server does support collation sequences which are used for cultural specifiic data.

    The default collation is installed on my server, SQL_Latin1_General_CP1_CI_AS.

    Collation is a whole other issue that does not affect data storage of Unicode data in the least (see below for a proof).

    The newest standard that is out is UTF32 which supports 2 raised to the 32 power or 4.2 Billion combinations.

    You can store UTF-32 data in a National data type field, however SQL Server will not be able to do much with it in terms of applying a Collation since it evaluates each pair of bytes discretely and so cannot make sense of an encoding that represents each character as 2 pairs of bytes. This means things like the LEN() function tend to start lying.

    Piggybacking on Lowell's example here is how to see every code point (0-65535) usable in a National data type column:

    IF OBJECT_ID(N'tempdb..#tmp') > 0

    DROP TABLE #tmp ;

    GO

    CREATE TABLE #tmp

    (

    my_code_point INT,

    my_char NCHAR(1)

    ) ;

    GO

    WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

    E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

    E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

    E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

    E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

    Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

    INSERT INTO #tmp

    (my_code_point, my_char)

    SELECT N, Nchar(X.N) as MyNChar

    FROM

    ( SELECT top 65536 N from Tally) X

    GO

    /* notice that nothing was escaped with a ? mark.There are two legit code

    points in the BMP for question marks: the ASCII version 63 (the one we get

    on our keyboards) and there is also one on the BMP at code

    point 65311 (0xFF1F) */

    SELECT *

    FROM #tmp

    WHERE my_char = N'?' ;

    GO

    -- look for other problems where SQL Server changed one of the values for us

    SELECT UNICODE(my_char) AS unicode_function_output,

    CONVERT(VARBINARY(2), my_char),

    my_code_point,

    my_char

    FROM #tmp

    WHERE UNICODE(my_char) != my_code_point ;

    -- We're OK, returns 0 rows

    -- BTW the same cannot be said of VARCHAR columns, see below for another code sample

    GO

    That's not to say that all code points will represent a character, because they won't, not even for the code page SSMS uses (did you notice the Unicode question mark from the code example above looked weird in SSMS?), but there it is. What does this mean? It means you can store ANY Unicode encoding in SQL Server as long as you do not expect SQL Server to know what you're talking about all the time. This is useful in some cases, but things like LEN() can be wrong.


    VARCHAR escaping demo (this is where collations come into play in terms of munging your data):

    IF OBJECT_ID(N'tempdb..#tmp') > 0

    DROP TABLE #tmp ;

    GO

    CREATE TABLE #tmp

    (

    my_code_point INT,

    my_char CHAR(1)

    ) ;

    GO

    WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

    E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

    E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

    E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

    E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

    Tally(N) AS (SELECT 0 UNION ALL SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

    INSERT INTO #tmp

    (my_code_point, my_char)

    SELECT N, NCHAR(X.N) as MyNChar

    FROM

    ( SELECT N from Tally WHERE N < 256) X

    GO

    -- notice that things were escaped with a ? mark

    SELECT *

    FROM #tmp

    WHERE my_char = '?' ;

    GO

    -- look for other problems where SQL Server changed one of our values for us

    SELECT ASCII(my_char) AS ascii_function_output,

    my_code_point,

    my_char

    FROM #tmp

    WHERE ASCII(my_char) != my_code_point ;

    -- Munged data!

    GO

    EDIT: fix comments in VARCHAR code example

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • j.miner (6/30/2011)


    🙂

    Hi Jeff & Lowell,

    All and all, it is cool stuff and I have squirelled the E16 example away.

    I will have to remember to use it in the real world.

    -John

    Start using such methods in the not-so-real-world and your real-world will become much easier. It's very much like practicing the piano... unless your goal is to become a comedian, you never practice hitting the wrong notes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply