July 6, 2017 at 5:52 am
We've got a regular data problem where special (accented, etc.) characters make it into our data set. Our base systems can handle the characters, but not the other department systems to which our feeds go. So we have to a data scrub.
I thought I was being clever. I coded an ASCII table for SQL Server with which I could replace certain characters (like tabs with blanks or smart quotes with regular quotes or accented characters with non-accented characters) when I realized the ASCII table I'd created in SQL wasn't using the same numbers as the ASCII table I found on the internet.
I used Jeff's Tally Table to do the ASCII table based on what I got from the above ASCII table link. The code I used is below:
SELECT N-1 AS AsciiCode,CHAR(N-1) AS AsciiChar,
CASE WHEN N = 145 THEN 'E'
WHEN N IN (131,317) THEN 'e'
WHEN N IN (133,134,161,199) THEN 'a'
WHEN N IN (143,200) THEN 'A'
WHEN N = 140 THEN 'i'
WHEN N IN (130,164) THEN 'u'
WHEN N IN (155,234) THEN 'U'
WHEN N = 165 THEN 'n'
WHEN N = 166 THEN 'N'
WHEN N IN (149,163,229) THEN 'o'
WHEN N IN (154,225,230) THEN 'O'
ELSE '' END AS ReplacementCharacter
FROM dbo.Tally
WHERE N NOT IN (33,45,47)
AND N NOT BETWEEN 49 AND 58
AND N NOT BETWEEN 66 AND 91
AND N NOT BETWEEN 98 AND 123
AND N <= 256;
I based every N number off of the ASCII information I found on the internet (second link) + 1 (to account for the fact that the Tally Table doesn't have 0 and the SELECT is subtracting 1 to get the proper ASCII code. But when I look at the result set for this SELECT, my replacement characters don't line up at all with the special characters nor do the special characters line up with the code numbers in the second link.
Does anyone know what's going on here? Does SQL Server use different ASCII numbers to represent special characters than what I found or did I find something old & out of date?
I'm very confused... yes, I can fix the issue easily, but I'd like to understand the core problem before I do the fix.
July 6, 2017 at 1:09 pm
Have you taken collation into account?
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)
July 6, 2017 at 1:27 pm
it has to do with your server/session collation.
I'm not even sure which collation the link you supplied is.
In order to determine the values for your current collation the following should do the trick.
select charnum, char(t.charnum)
from (
select top 256 row_number() over (order by o.name) -1 as charnum
from msdb.sys.objects o
) t
the following may also be a option to you depending on your own country.
I normally use this within a function in conjunction with a loop through the string using patindex to remove other unwanted characters.
first step would be the collate below, second step loop with patindex - less interactions like this - can supply code if you are interested.
if object_id('tempdb..#special') is not null
drop table #special;
create table #special
(special varchar(100)
)
insert into #special select 'áÁâÂàÀâÂéÉêÊèÈíÍìÌòÒóÓôÔõÕúÚùÙûÛçÇ€'
select t.special
, t.special collate SQL_Latin1_General_CP1251_CS_AS as SQL_Latin1_General_CP1251_CS_AS
, t.special collate SQL_Latin1_General_CP1253_CI_AI as SQL_Latin1_General_CP1253_CI_AI
, t.special collate SQL_Latin1_General_CP1253_CS_AS as SQL_Latin1_General_CP1253_CS_AS
from #special t
if object_id('tempdb..#special') is not null
drop table #special;
select *
from (
SELECT name, description
FROM ::fn_helpcollations()
) t
where t.name like 'SQL_Latin1_General%125[1,3]%'
or t.name like '%437%'
order by name
July 6, 2017 at 1:29 pm
There's more than 1 version of the extended ASCII characters.
For reference like this (to compare to the output of CHAR), use the ISO 8859 table shown at http://www.ascii-code.com/
Cheers!
July 7, 2017 at 5:15 am
Okay. I did not know ASCII could change depending on collation. Thank you all for your quick responses. That explains a lot.
I just assumed the ASCII table was the same for everything, silly me.
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply