Insert Smart Quote Unicode

  • I thought char and varchar columns would only allow the 256 ASCII values and would not allow Unicode. SQL Server 2000 lets me insert Unicode into these columns without any problems.

    For example, when I insert the two sets of MS Word smart quotes, separated by a space, into a varchar(5) column, the smart quotes get inserted just fine. When I run the Unicode() function on that column, I get the Unicode numbers back.

    Smart quotes: “” ‘’

    Unicode values:  (8220)(8221)(32)(8216)(8217)

    Where am I wrong in my understanding of char and varchar columns?

    Thank you.

  • There are ASCII versions of them that apparently the UNICODE() function translates into the codes you listed.

    Ascii values = (147)(148)(32)(145)(146)

    It is ultimately a matter of character mapping from UNICODE to ASCII and then back to UNICODE again.

  • OK, so the ASCII and UNICODE functions don't reveal what is actually stored in the column.

     

    What I am doing is inserting these values into the database with ColdFusion, then immediately querying what is in the database. If I insert an ASCII 147 character, I get back an ASCII 147 character. If I insert a 8220 character (which I assume is Unicode), I get back a 8220 character, even though this is a varchar field.

  • I've never worked with ColdFusion so I can't do more than theorize here. But first of all, I have assumed that you are using varchar versus nvarchar. nvarchar and nchar are the Unicode versions of the fields. Also, since varchar is ultimately just a specialized byte array it is theoretically possible that ColdFusion is pasing Unicode into the varchar byte array and then when it returns it out recombines the byte array into a Unicode string. Problems will arrise when another application retuns the byte array expecting ASCII as it should have been. There are a couple ways to verify that this is being done. One is to do a LEN on the varchar field after inserting the Unicode string. If the LEN is twice as long as it should be then SQL Server is storing both halves of the Unicode characters as seperate characters. Another way to verify it is to simply SELECT it out from Query Analyzer. If you want to store Unicode strings then you should use nvarchar to prevent a lot of headaches later on.

  • I was using a varchar(5) and inserting five characters into the field.

    I am now thinking that either ColdFusion, SQL Server, or the JDBC/ODBC driver must be doing some conversion on the fly. When I insert a smart quote using ColdFusion MX, which uses Unicode, then retrieve that value, I get back the code 8220. When I retrieve that value with ColdFusion 5, which does not use Unicode, it reports back that the ASCII code of that same character is 147.

    So it begs the question, what is actually stored in that field and what is doing the conversion?

    I would guess that the answer to the first question has something to do with the SQL_Latin1_... collation name. I would guess that the answer to question two is SQL Server. I say that because in SQL Server, if I select ASCII(column) I get 147 and if I select Unicode(column) I get 8220. It seems like one of those functions should have throw an error.

    I changed the column type to an nvarchar() and it did not change the results much. When I changed the column to nvarchar(10), the datalength of the five characters was 10. I changed the column back to a varchar(10) and the datalength became 5. I did not lose any of the five characters in the conversion process.

    So, if I do not lose any information, and the data takes up half the size, I think I will stick with a varchar.

    The point of asking this question was to determine if I need to strip away any characters that a user can enter on an HTML form before inserting the characters into the database. I guess the answer is no based on my test results of inserting unicode into a varchar field.

  • One thing to keep in mind. I use cf also. Often the quotes/apostrophes look fine in the database and show on websites properly.

    But then we use our data to populate pdfs. This is where you may get weird characters showing up instead of the quotes. For example a TM or @ may show up instead of the apostrophe...

    j

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply