August 22, 2012 at 11:11 am
I'm storing information in an XML file that is UTF-8.
I have a stored procedure I use to query some tables and structure the data in XML format, which I then save into an XML file with UTF-8.
I noticed that there are some characters that make it into this XML file, that violate the UTF-8 encoding. How do I avoid that?
August 23, 2012 at 2:23 pm
Looks like this might be in the wrong section, how do I move it to the XML secion?
August 23, 2012 at 2:42 pm
ok, i just googled "Non UTF-8 Characters", and the only thing i can find is when something got malformed; otherwise, It looks like UTF covers teh whole range of possible characters.
http://stackoverflow.com/questions/1379416/insert-utf8-data-into-a-ms-sql-server-2008
http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/
the two links above are describing fixes for malformed strings;
can you be more specific about what it is you want to remove?
do you really mean high ascii characters, ie >127 like some of these?
('ÀAlbèert ËEîinstêeiìn ÌInstìitúutëe - MPG')
do you mean escaping special characters that xml requires to be htmlized? like > to &l t ;?
Lowell
August 24, 2012 at 9:57 am
I didn't write this code and can't remember where I got it so apologies to the original author. In my case I only cared about non-ASCII characters outside the range 32-127 because the SQL XML parser handles things like '<' and '>' etc. If you need to convert other characters or tags then the case statement will need a little tweaking. Undesirable characters are replaced by '?' and then optionally all question marks are replaced with blanks.
CREATE FUNCTION dbo.svfCleanXML
(
@OriginalText VARCHAR(8000)
)
RETURNS VARCHAR(8000)
BEGIN
/* This function requires a Tally or Numbers table */
/* http://www.sqlservercentral.com/articles/T-SQL/62867/ */
DECLARE @CleanedText VARCHAR(8000)
SELECT
@CleanedText = ISNULL(@CleanedText,'')
+ CASE
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127
THEN SUBSTRING(@OriginalText,Tally.N,1)
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32
THEN CONVERT(VARCHAR,ASCII(SUBSTRING(@OriginalText,Tally.N,1)))
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128
THEN CONVERT(VARCHAR,ASCII(SUBSTRING(@OriginalText,Tally.N,1)))
END
FROM
dbo.Tally
WHERE
Tally.N <= LEN(@OriginalText)
SET @CleanedText = REPLACE(@CleanedText,'?','') --optional
RETURN @CleanedText
/*
SELECT dbo.svfCleanXML('[xml_string_here]') AS cleanString
SELECT CONVERT(XML,dbo.svfCleanXML('[xml_string_here]')) AS cleanXML
*/
END
August 24, 2012 at 11:01 am
and here's a similar one i wrote long ago which converts things like © to the html entity ©
I think it was specifically due to XML:
CREATE FUNCTION HTMLEncode(@OriginalText VARCHAR(4000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000);
WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
-- E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
-- E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E04)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =34) THEN '"' --" " " quotation mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =39) THEN ''' --' ' ' apostrophe
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =38) THEN '&' --& & & ampersand
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =60) THEN '<' --< < < less-than
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =62) THEN '>' --> > > greater-than
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =160) THEN ' ' -- non-breaking space
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =161) THEN '¡' --¡ ¡ ¡ inverted exclamation mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =162) THEN '¢' --¢ ¢ ¢ cent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =163) THEN '£' --£ £ £ pound
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =164) THEN '¤' --¤ ¤ ¤ currency
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =165) THEN '¥' --¥ ¥ ¥ yen
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =166) THEN '¦' --¦ ¦ ¦ broken vertical bar
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =167) THEN '§' --§ § § section
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =168) THEN '¨' --¨ ¨ ¨ spacing diaeresis
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =169) THEN '©' --© © © copyright
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =170) THEN 'ª' --ª ª ª feminine ordinal indicator
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =171) THEN '«' --« « « angle quotation mark (left)
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =172) THEN '¬' --¬ ¬ ¬ negation
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =173) THEN '' -- soft hyphen
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =174) THEN '®' --® ® ® registered trademark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =175) THEN '¯' --¯ ¯ ¯ spacing macron
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =176) THEN '°' --° ° ° degree
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =177) THEN '±' --± ± ± plus-or-minus
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =178) THEN '²' --² ² ² superscript 2
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =179) THEN '³' --³ ³ ³ superscript 3
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =180) THEN '´' --´ ´ ´ spacing acute
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =181) THEN 'µ' --µ µ µ micro
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =182) THEN '¶' --¶ ¶ ¶ paragraph
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =183) THEN '·' --· · · middle dot
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =184) THEN '¸' --¸ ¸ ¸ spacing cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =185) THEN '¹' --¹ ¹ ¹ superscript 1
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =186) THEN 'º' --º º º masculine ordinal indicator
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =187) THEN '»' --» » » angle quotation mark (right)
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =188) THEN '¼' --¼ ¼ ¼ fraction 1/4
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =189) THEN '½' --½ ½ ½ fraction 1/2
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =190) THEN '¾' --¾ ¾ ¾ fraction 3/4
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =191) THEN '¿' --¿ ¿ ¿ inverted question mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =215) THEN '×' --× × × multiplication
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =247) THEN '÷' --÷ ÷ ÷ division
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =192) THEN 'À' --À À À capital a, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =193) THEN 'Á' --Á Á Á capital a, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =194) THEN 'Â' --Â Â Â capital a, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =195) THEN 'Ã' --Ã Ã Ã capital a, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =196) THEN 'Ä' --Ä Ä Ä capital a, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =197) THEN 'Å' --Å Å Å capital a, ring
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =198) THEN 'Æ' --Æ Æ Æ capital ae
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =199) THEN 'Ç' --Ç Ç Ç capital c, cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =200) THEN 'È' --È È È capital e, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =201) THEN 'É' --É É É capital e, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =202) THEN 'Ê' --Ê Ê Ê capital e, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =203) THEN 'Ë' --Ë Ë Ë capital e, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =204) THEN 'Ì' --Ì Ì Ì capital i, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =205) THEN 'Í' --Í Í Í capital i, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =206) THEN 'Î' --Î Î Î capital i, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =207) THEN 'Ï' --Ï Ï Ï capital i, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =208) THEN 'Ð' --Ð Ð Ð capital eth, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =209) THEN 'Ñ' --Ñ Ñ Ñ capital n, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =210) THEN 'Ò' --Ò Ò Ò capital o, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =211) THEN 'Ó' --Ó Ó Ó capital o, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =212) THEN 'Ô' --Ô Ô Ô capital o, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =213) THEN 'Õ' --Õ Õ Õ capital o, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =214) THEN 'Ö' --Ö Ö Ö capital o, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =216) THEN 'Ø' --Ø Ø Ø capital o, slash
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =217) THEN 'Ù' --Ù Ù Ù capital u, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =218) THEN 'Ú' --Ú Ú Ú capital u, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =219) THEN 'Û' --Û Û Û capital u, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =220) THEN 'Ü' --Ü Ü Ü capital u, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =221) THEN 'Ý' --Ý Ý Ý capital y, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =222) THEN 'Þ' --Þ Þ Þ capital THORN, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =223) THEN 'ß' --ß ß ß small sharp s, German
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =224) THEN 'à' --à à à small a, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =225) THEN 'á' --á á á small a, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =226) THEN 'â' --â â â small a, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =227) THEN 'ã' --ã ã ã small a, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =228) THEN 'ä' --ä ä ä small a, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =229) THEN 'å' --å å å small a, ring
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =230) THEN 'æ' --æ æ æ small ae
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =231) THEN 'ç' --ç ç ç small c, cedilla
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =232) THEN 'è' --è è è small e, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =233) THEN 'é' --é é é small e, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =234) THEN 'ê' --ê ê ê small e, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =235) THEN 'ë' --ë ë ë small e, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =236) THEN 'ì' --ì ì ì small i, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =237) THEN 'í' --í í í small i, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =238) THEN 'î' --î î î small i, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =239) THEN 'ï' --ï ï ï small i, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =240) THEN 'ð' --ð ð ð small eth, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =241) THEN 'ñ' --ñ ñ ñ small n, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =242) THEN 'ò' --ò ò ò small o, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =243) THEN 'ó' --ó ó ó small o, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =244) THEN 'ô' --ô ô ô small o, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =245) THEN 'õ' --õ õ õ small o, tilde
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =246) THEN 'ö' --ö ö ö small o, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =248) THEN 'ø' --ø ø ø small o, slash
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =249) THEN 'ù' --ù ù ù small u, grave accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =250) THEN 'ú' --ú ú ú small u, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =251) THEN 'û' --û û û small u, circumflex accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =252) THEN 'ü' --ü ü ü small u, umlaut mark
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =253) THEN 'ý' --ý ý ý small y, acute accent
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =254) THEN 'þ' --þ þ þ small thorn, Icelandic
WHEN(ASCII(SUBSTRING(@OriginalText,Tally.N,1)) =255) THEN 'ÿ' --ÿ ÿ ÿ small y, umlaut mark
ELSE SUBSTRING(@OriginalText,Tally.N,1)
END
FROM Tally
WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
August 24, 2012 at 11:04 am
well, the forum inteprets the escaped values as something it should render...whoops.
link to the function as raw text here:
Lowell
August 31, 2012 at 3:44 pm
The character that was giving me problems is â (0xE2)
August 31, 2012 at 6:39 pm
Khades (8/31/2012)
The character that was giving me problems is â (0xE2)
Here's a revised version of the function I posted above which in addition to checking for normal ASCII characters will remove any other extended characters either in their native form (â) or encoded form (&#x). There are some examples to run at the bottom of the code.
CREATE FUNCTION [dbo].[svfHTMLEncodeXML]
(
@OriginalText VARCHAR(8000)
)
RETURNS VARCHAR(8000)
BEGIN
DECLARE
@CleanedText VARCHAR(8000)
,@pos INT
,@chunk VARCHAR(10)
SELECT @CleanedText = COALESCE(@CleanedText + '','')
+ CAST(CleanText.CharVal AS VARCHAR(8000))
FROM
(
SELECT
Result.CharVal
FROM
(
SELECT
N AS RowNum
,(CASE
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) BETWEEN 32 AND 127
THEN CAST(SUBSTRING(@OriginalText,t.N,1) AS VARCHAR(50))
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) < 32
THEN NULL
WHEN ASCII(SUBSTRING(@OriginalText,t.N,1)) >= 127
THEN NULL
END)
AS CharVal
FROM
dbo.Tally AS t
WHERE
t.N <= LEN(@OriginalText)
) AS Result
WHERE
Result.CharVal IS NOT NULL
) CleanText
SET @pos = PATINDEX('%[&#x]%',@CleanedText)
WHILE @pos <> 0
BEGIN
SET @chunk = SUBSTRING(@CleanedText,@pos,6)
SET @CleanedText = REPLACE(@CleanedText,@chunk,'')
SET @pos = PATINDEX('%[&#x]%',@CleanedText)
SET @chunk = NULL
END
SET @pos = PATINDEX('%[0x]%',@CleanedText)
WHILE @pos <> 0
BEGIN
SET @chunk = SUBSTRING(@CleanedText,@pos,6)
SET @CleanedText = REPLACE(@CleanedText,@chunk,'')
SET @pos = PATINDEX('%[0x]%',@CleanedText)
SET @chunk = NULL
END
/* This optional section removes tabs and excess spaces */
SET @CleanedText = REPLACE(@CleanedText,CHAR(9),' ')-- convert tabs to spaces
SET @CleanedText =
REPLACE(
REPLACE(
REPLACE(
@CleanedText
,' ',' '+CHAR(7))
,CHAR(7)+' ','')
,CHAR(7),'')
SET @CleanedText = REPLACE(@CleanedText,'> <','><')
RETURN @CleanedText
/*
SELECT dbo.svfHTMLEncodeXML('O&xxxx;Brien') -- sub '#x0D' for 'xxxx' to test
SELECT dbo.svfHTMLEncodeXML('OÇBrien')
SELECT dbo.svfHTMLEncodeXML('O¢Brien')
SELECT dbo.svfHTMLEncodeXML('O&xxxx;Brien') -- sub '#x0F' for 'xxxx' to test
SELECT dbo.svfHTMLEncodeXML('OâBrien')
*/
/* --this is text with an invisible line break character in it
SELECT dbo.svfHTMLEncodeXML('O
Brien')
*/
/* -- notice the â (0xE2) in the 2nd position which the function will remove
SELECT CONVERT(XML,dbo.svfHTMLEncodeXML(
'<âRoot type="demographics">
<PersonalData>
<PersonName type="primary">
<GivenName>John</GivenName>
<MiddleName>A</MiddleName>
<FamilyName>Doe</FamilyName>
</PersonName>
</PersonalData>
</Root>'))
*/
END
September 4, 2012 at 3:10 pm
Thank you once again. I appreciate your help.
What does the dbo.Tally table hold? I notice we keep querying the N column.
September 4, 2012 at 3:16 pm
Khades (9/4/2012)
Thank you once again. I appreciate your help.What does the dbo.Tally table hold? I notice we keep querying the N column.
A "Tally" table (sometimes called a "Numbers" table) is just a table with a large column of sequentially ordered numbers which can be joined to other tables to avoid using cursors or while loops.
This article will giet you started on the subject: http://www.sqlservercentral.com/articles/T-SQL/62867/
BTW, in my last post above I did include some WHILE loops which could almost certainly be replaced by joining to a tally table. But it was late when I posted that and I wanted to go home. 😛
Besides, in my particular case where I'm using a function very similar to this, the number of loops are likely to be minimal as these stray characters only occur infrequently and I'm using the function against a single block of code at a time and not in a SELECT statement. If using this function in a SELECT statement it would be worth the effort to try and replace them with some sort of non-looping construct.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply