How SQL server 2005 store utf-8 data type?

Question

How SQL server 2005 store utf-8 data type?

Chan Wai Yin

SSCommitted

Points: 1883
More actions
December 21, 2009 at 10:24 pm

#214921

Hi,
We want to store native fonts for Chinese Macau Fonts and we use Chinese_PRC_90_CI_AS database collation according to MSDN. However as far as I know SQL Server 2005 only support UCS-2 but not UTF-8 scheme directly, therefore is it necessary to convert to varbinary to store such utf-8 data type instead of using nvarchar(max)? If using varbinary as utf-8 data type can we use exact search or like '%<text>% search, work similar as nvarchar(max)? Thanks a lot.
Best regards,
Wallace

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Gift Peddie SSC Guru Points: 73570 More actions · Answer 1

Chan Wai Yin (12/21/2009)
Hi,
We want to store native fonts for Chinese Macau Fonts and we use Chinese_PRC_90_CI_AS database collation according to MSDN. However as far as I know SQL Server 2005 only support UCS-2 but not UTF-8 scheme directly, therefore is it necessary to convert to varbinary to store such utf-8 data type instead of using nvarchar(max)? If using varbinary as utf-8 data type can we use exact search or like '%<text>% search, work similar as nvarchar(max)? Thanks a lot.
Best regards,
Wallace

If you are in SQL Server 2005 Sp2 and all you need is the collation in the column and code because there are at least four versions of Chinese defined and you could use UT8 in .NET or SSIS flat file data source.

Kind regards,
Gift Peddie

RBarryYoung SSC Guru Points: 143329 More actions · Answer 2

Chan Wai Yin (12/21/2009)
Hi,
We want to store native fonts for Chinese Macau Fonts and we use Chinese_PRC_90_CI_AS database collation according to MSDN. However as far as I know SQL Server 2005 only support UCS-2 but not UTF-8 scheme directly, therefore is it necessary to convert to varbinary to store such utf-8 data type instead of using nvarchar(max)? If using varbinary as utf-8 data type can we use exact search or like '%<text>% search, work similar as nvarchar(max)? Thanks a lot.
Best regards,
Wallace

Yes, internally SQL Server does not support UTF-8, so unless you can convert it externally or while importing it with one of the methods that Gift suggests, storing it in VARBINARY is you best choice. And sorry, only exact searches (also converted from UTF-8 to VARBINARY) will work.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Mauve SSChampion Points: 11316 More actions · Answer 3

See post 832452 which is a similar problem. It has links to the solutions presented by Microsoft in the BOL. http://www.sqlservercentral.com/Forums/FindPost832452.aspx

Note that you can successfully Full-Text index data when it is stored in a VARBINARY(MAX) column. Provided that there is a matching column that identifies the type of the file (HTML, XML, etc.) so that it launches the proper iFilter. So it should be able to handle UTF-8 encoded data. However, I haven't tried plain text in UTF-8. We only store and Full-Text index Word (.doc), PowerPoint (.ppt) and PDF (.pdf) documents that are stored in a VARBINARY(MAX) column.

[font="Arial Narrow"](PHB) I think we should build an SQL database. (Dilbert) What color do you want that database? (PHB) I think mauve has the most RAM.[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 4

Note you can handle UTF-8 in SQL Server, only if you have some facility (iFilter, etc.) if it is written explicitly *for* UTF-8. The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Gift Peddie SSC Guru Points: 73570 More actions · Answer 5

RBarryYoung (12/23/2009)
Note you can handle UTF-8 in SQL Server, only if you have some facility (iFilter, etc.) if it is written explicitly *for* UTF-8. The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.

If the person is using .NET storing all versions of Chinese is a none issue because VS2005-8 can UTF 8 encode by default and the code can be saved as Chinese. In SSIS import uses the .NET 3.5 System.Text.Encoding definition so character conversion issues almost don't exist even in Arabic if the developer takes care of these issues. I talked to a developer of Chinese decent at the Asp.net forums a while back and that was in SQL Server 2000.

http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

Kind regards,
Gift Peddie

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Gift Peddie (12/23/2009)
RBarryYoung (12/23/2009)
Note you can handle UTF-8 in SQL Server, only if you have some facility (iFilter, etc.) if it is written explicitly *for* UTF-8. The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.
If the person is using .NET storing all versions of Chinese is a none issue because VS2005-8 can UTF 8 encode by default and the code can be saved as Chinese. In SSIS import uses the .NET 3.5 System.Text.Encoding definition so character conversion issues almost don't exist even in Arabic if the developer takes care of these issues. I talked to a developer of Chinese decent at the Asp.net forums a while back and that was in SQL Server 2000.
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

I don't think that you are following the nuance of this conversation very closely Gift. And I am certain that you did not understand my post above, because your response to it misses my point entirely. My point was about using the facilities internal to SQL Server, it's already been established that external facilities can get UTF-8 into and out of SQL Server. As I said:

The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.

Which should be pretty clear. However, if you think that this statement is incorrect then please point out what exactly is wrong with it.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Gift Peddie SSC Guru Points: 73570 More actions · Answer 7

RBarryYoung (12/24/2009)
Gift Peddie (12/23/2009)
RBarryYoung (12/23/2009)
Note you can handle UTF-8 in SQL Server, only if you have some facility (iFilter, etc.) if it is written explicitly *for* UTF-8. The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.
If the person is using .NET storing all versions of Chinese is a none issue because VS2005-8 can UTF 8 encode by default and the code can be saved as Chinese. In SSIS import uses the .NET 3.5 System.Text.Encoding definition so character conversion issues almost don't exist even in Arabic if the developer takes care of these issues. I talked to a developer of Chinese decent at the Asp.net forums a while back and that was in SQL Server 2000.
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
I don't think that you are following the nuance of this conversation very closely Gift. And I am certain that you did not understand my post above, because your response to it misses my point entirely. My point was about using the facilities internal to SQL Server, it's already been established that external facilities can get UTF-8 into and out of SQL Server. As I said:
The normal old VARBINARY() parsing, matching, etc. tricks don't work because UTF-8 has a lot of partial byte encodings.
Which should be pretty clear. However, if you think that this statement is incorrect then please point out what exactly is wrong with it.

I did not see T-SQL in your post and System.Text is not excluded in the CLR, the post is in SQL Server 2005.

Kind regards,
Gift Peddie