How do I clean up data for UTF-8

Question

Post reply

How do I clean up data for UTF-8

Khades

SSCrazy

Points: 2067
More actions
October 16, 2013 at 5:14 pm

#278176

I have a stored procedure which selects 1 column, 1 row, NVARCHAR(MAX) which we use to make an XML file with UTF-8 encoding.
This field is put together through multiple other fields in the DB. Those fields come from our website.
The problem is, we don't have validation everywhere on the website to not allow certain special characters. And some of these characters make the file unreadable in XML format due to the encoding.
I was wondering:
a) How can I identify which characters would cause this? I've searched lots of articles on UTF-8 and I'm still confused about it.
b) What's the best (performance) way to clean out these characters from the data. Should we do it in the mentioned stored procedure? Should we do it in the SSIS package that uses the stored procedure to create the XML file?
Thank you in advance for your responses!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 1

There are certain characters XML cannot handle, because they are used by XML itself.

For example: < > & " '

List of XML and HTML character entity references

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Khades SSCrazy Points: 2067 More actions · Answer 2

Aren't there other characters that UTF-8 encoding won't allow though? Such as the copyright symbol?

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 3

Unicode can handle a lot of characters. I don't know at the top of my head which characters are not supported.

If you know which characters you don't want, you can remove them using the REPLACE function in the stored procedure, or in the derived column transformation in SSIS.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 4

Koen Verbeeck (10/17/2013)
There are certain characters XML cannot handle, because they are used by XML itself.
For example: < > & " '
List of XML and HTML character entity references

It's not that XML can't handle these characters, only that they must be represented in character data and attribute values as entity references - e.g., "&" (ampersand) will be represented as "&". All XML parsers are required to recognize the five predefined entities: ", ', <, >, and &, but other entities can be specified in a DTD if needed. Then, the original characters can be replaced with the entity references when the XML is created. An XML parser reading the XML will render the entity references as the original characters.

Jason Wolfkill

Koen Verbeeck SSC Guru Points: 259115 More actions · Answer 5

wolfkillj (10/23/2013)
Koen Verbeeck (10/17/2013)
There are certain characters XML cannot handle, because they are used by XML itself.
For example: < > & " '
List of XML and HTML character entity references
It's not that XML can't handle these characters, only that they must be represented in character data and attribute values as entity references - e.g., "&" (ampersand) will be represented as "&". All XML parsers are required to recognize the five predefined entities: ", ', <, >, and &, but other entities can be specified in a DTD if needed. Then, the original characters can be replaced with the entity references when the XML is created. An XML parser reading the XML will render the entity references as the original characters.

I know. "Can't handle" was maybe a poor choice of words. I meant: it returns invalid XML if you don't use the entities.

I wrote an article today, which I have to turn in as HTML and it contained some XML code, so I had a lot of fun with these characters. 🙂

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Khades SSCrazy Points: 2067 More actions · Answer 6

Koen Verbeeck (10/22/2013)
Unicode can handle a lot of characters. I don't know at the top of my head which characters are not supported.
If you know which characters you don't want, you can remove them using the REPLACE function in the stored procedure, or in the derived column transformation in SSIS.

I don't know what these characters are, that's part of my problem :(.

I figured maybe there was an ASCII range of values you could only use with UTF-8 encoding. But haven't been able to find anything that let's me know which ranges to exclude.

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 7

Khades (10/24/2013)
Koen Verbeeck (10/22/2013)
Unicode can handle a lot of characters. I don't know at the top of my head which characters are not supported.
If you know which characters you don't want, you can remove them using the REPLACE function in the stored procedure, or in the derived column transformation in SSIS.
I don't know what these characters are, that's part of my problem :(.
I figured maybe there was an ASCII range of values you could only use with UTF-8 encoding. But haven't been able to find anything that let's me know which ranges to exclude.

It's not likely that you're encountering a character that can't be represented in UTF-8 encoding - it covers all 1,000,000+ code points of the Unicode set. The 128 ASCII code points correspond directly to the first 128 code points of UTF-8, so they're completely compatible. And SQL Server automatically entitizes the five characters that must be represented by entities in XML (?, ', ", <, and >).

I think we need some more information to help you with your problem. How exactly are you assembling the XML? Select statements with FOR XML? What do you mean that the resulting file is "unreadable" - that it can't be parsed by an XML parser or that you can't open it in a text editor or something else?

I wonder whether any of the "control characters" (the first 32 characters in both ASCII and Unicode) may have slipped into your data. They cannot be displayed but SQL Server will happily store them in columns with string data types. It's conceivable that the tool that can't read the XML you're producing is encountering one of those characters and choking on it.

Do you know how to use the ASCII() and UNICODE() functions to return the integer value of a character in each particular code space? That's how you can tell if you have control characters lurking in your data - if those functions return 0 - 31 for any character in your data, you've got a control character. If you need help with that, we can get you started with some code snippets.

Jason Wolfkill

Khades SSCrazy Points: 2067 More actions · Answer 8

wolfkillj (10/24/2013)
Khades (10/24/2013)
Koen Verbeeck (10/22/2013)
Unicode can handle a lot of characters. I don't know at the top of my head which characters are not supported.
If you know which characters you don't want, you can remove them using the REPLACE function in the stored procedure, or in the derived column transformation in SSIS.
I don't know what these characters are, that's part of my problem :(.
I figured maybe there was an ASCII range of values you could only use with UTF-8 encoding. But haven't been able to find anything that let's me know which ranges to exclude.
It's not likely that you're encountering a character that can't be represented in UTF-8 encoding - it covers all 1,000,000+ code points of the Unicode set. The 128 ASCII code points correspond directly to the first 128 code points of UTF-8, so they're completely compatible. And SQL Server automatically entitizes the five characters that must be represented by entities in XML (?, ', ", <, and >).
I think we need some more information to help you with your problem. How exactly are you assembling the XML? Select statements with FOR XML? What do you mean that the resulting file is "unreadable" - that it can't be parsed by an XML parser or that you can't open it in a text editor or something else?
I wonder whether any of the "control characters" (the first 32 characters in both ASCII and Unicode) may have slipped into your data. They cannot be displayed but SQL Server will happily store them in columns with string data types. It's conceivable that the tool that can't read the XML you're producing is encountering one of those characters and choking on it.
Do you know how to use the ASCII() and UNICODE() functions to return the integer value of a character in each particular code space? That's how you can tell if you have control characters lurking in your data - if those functions return 0 - 31 for any character in your data, you've got a control character. If you need help with that, we can get you started with some code snippets.

Thank you, this helps a lot. I can at least know the replace those 32 characters.

To answer your questions, you're right, by "unreadable" I mean that I cannot parse it with an XML parser. I use XML Notepad. But I can still open it with a text editor such as Notepad just fine.

I am indeed assembling my XML with SELECT statements with FOR XML. The information in these XML files are actually put together from multiple tables in our DB, we then send the XML file to the client. This has already been live for a few months. This has been our only hiccup. We send thousands of files every day, and only 1 file had this problem.

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 9

Khades (10/24/2013)
Thank you, this helps a lot. I can at least know the replace those 32 characters.
To answer your questions, you're right, by "unreadable" I mean that I cannot parse it with an XML parser. I use XML Notepad. But I can still open it with a text editor such as Notepad just fine.
I am indeed assembling my XML with SELECT statements with FOR XML. The information in these XML files are actually put together from multiple tables in our DB, we then send the XML file to the client. This has already been live for a few months. This has been our only hiccup. We send thousands of files every day, and only 1 file had this problem.

Do you have nested FOR XML queries, and if so, are you using the TYPE directive with the FOR XML clause on each one? If not, an outer query will treat the result of the inner query as a string value rather than an XML-typed value, which results in an invalid XML element. Also, if you are using the TYPE directive, SQL Server will throw an error if it encounters an invalid character (most control characters are invalid in XML 1.0 and are valid only in certain contexts in XML 1.1). Compare the results of these queries:

;WITH cteTestValues AS (

SELECT 'X' as col1

UNION ALL

SELECT '8'

UNION ALL

SELECT NCHAR(5) -- a control character

)

SELECT (SELECT col1

FROM cteTestValues

FOR XML PATH('row')) AS [rows]

,'XYZ' AS someElement

FOR XML PATH('root'), TYPE

-- Query 2:

;WITH cteTestValues AS (

SELECT 'X' as col1

UNION ALL

SELECT '8'

UNION ALL

SELECT NCHAR(5) -- a control character

)

SELECT (SELECT col1

FROM cteTestValues

FOR XML PATH('row'), TYPE) AS [rows]

,'XYZ' AS someElement

FOR XML PATH('root'), TYPE

The only difference between the two is the presence of the TYPE directive on the inner query. Query 1 succeeds but generates some unexpectedly funky XML which includes the entitized reference for the control character. Query 2 throws an error because SQL Server cannot include the control character in a valid XML instance.

Is your query generating XML that looks like the result of Query 1, perhaps?

Jason Wolfkill

Khades SSCrazy Points: 2067 More actions · Answer 10

I'm Doing something like this:

CREATE TABLE #Child_table(

id int,

parent_id int,

Attribute1 VARCHAR(10),

Attribute2 VARCHAR(10),

column1 VARCHAR(10),

column2 VARCHAR(10),

column3 VARCHAR(10),

column4 VARCHAR(10),

column5 VARCHAR(10))

CREATE TABLE #Parent_table(

id int,

Attribute1 VARCHAR(10),

column1 VARCHAR(10),

column2 VARCHAR(10),

column3 VARCHAR(10),

column4 VARCHAR(10),

column5 VARCHAR(10),

column6 VARCHAR(10),

column7 VARCHAR(10))

DECLARE @X XML, @Y XML(dbo.SomeXSD)

SET @X = (

SELECT

p.Attribute1 AS '@Attribute1'

, CAST ( (SELECT

c.Attribute1 AS '@Attribute1'

,c.Attribute2 AS '@Attribute2'

,c.column1

,c.column2

,c.column3

,c.column4

,c.column5

FROM #Child_Table c WITH (NOLOCK) WHERE c.Parent_id = p.id

FOR XML PATH ('Child')

) AS XML) AS 'Childs'

,p.column1

,p.column2

,p.column3

,p.column4

,p.column5

,p.column6

,p.column7

FROM #Parent_table p WITH (NOLOCK)

FOR XML PATH ('Parent')

)

SET @X = CAST ( (N'<Parents xmlns="http://www.somesite.com/somenamespace/v1.0">'

+ CAST( @X AS NVARCHAR(MAX)) + N'</Parents>') AS XML)

SET @Y = @X

SELECT XMLResult = '<?xml version="1.0" encoding="UTF-8"?>' + CAST (@Y AS VARCHAR(MAX))

After further research, I've found that the actual character that caused all these problems was Char(189) which is ½