August 31, 2007 at 12:20 am
I'm downloading text in fixed width, but some of it contains special characters which corrupts the alignment of the data. Is there a simple way to identify and replace any sort of unicode characters. I can identify some but not sure how to tackle things like
‘Special Digital Data Service Obligation’
This is just an example. I need to be able to identify anything of this nature.
September 4, 2007 at 2:42 pm
Is unicode characters the same as high ascii characters?
this is the only way i could figure out how to do it so far, was with a function that loops thru characters 127 thru 255:
create function StripHighAscii(@originaltext varchar(8000))
returns varchar(8000)
begin
declare @i int
set @i=127
while @i < 255
begin
SET @originaltext = replace(@originaltext,CHAR(@i),'')
SET @i=@i +1
end
return @originaltext
end
select dbo.StripHighAscii('‘Special Digital Data Service Obligation’')
results: Special Digital Data Service Obligation
Performance might suck, but If you were testing to see if anything existed, you might do :
select * from sometable where unifield <> dbo.StripHighAscii(unifield)
Lowell
September 4, 2007 at 6:10 pm
Thank you Lowell,
I was looking for UNICODE(SUBSTRING(@Col,@col_pos,1))>=160, but I see now, it's not quite correct as some of these high ascii characters have unicode values less than 160. eg. 127 and 129, although so far these haven't shown up in the data.
Performance can't be helped as I have to replace each one with something meaningful eg. GPS position 14º 54’ 0” has to be remain meaningful, but at least I have a limited subset of ascii values to work with now.
Mary
September 4, 2007 at 9:17 pm
Mary,
Lowell's code is great but, as he said, there's a bit of a performance problem... takes about 4 and a half minutes to return 100,000 rows. The following function is about 22 times faster on a 100,000 row test... takes only 12 seconds...
But, first, let's create a bit of test data... say, 100,000 rows... this is NOT part of the solution! We just need some test data...
--===== Create and populate a 100,000 row test table. -- Column RowNum has a range of 1 to 1,000,000 unique numbers -- Column "SomeString" has data with characters above the CHAR(128) range
-- Takes about 2 seconds to execute. SELECT TOP 100000 RowNum = IDENTITY(INT,1,1), SomeString = CAST('‘Special Digital Data Service Obligation’' AS VARCHAR(80)) INTO dbo.JBMTest FROM Master.dbo.SysColumns sc1, Master.dbo.SysColumns sc2 --Lack of join criteria makes this a CROSS-JOIN
--===== A table is not properly formed unless a Primary Key has been assigned ALTER TABLE dbo.JBMTest ADD PRIMARY KEY CLUSTERED (RowNum)
--===== Put some other data of concern in the table that must remain "meaningful"... UPDATE dbo.JBMTest SET SomeString = 'GPS position 14º 54’ 0”' WHERE RowNum %2 = 0
Now, before we get to the solution, we need to make a well indexed table of sequential numbers. These "Tally" or "Numbers" tables are very powerful and can help do things in SQL Server 2000 as if we were using ROWNUM from SQL Server 2005. You should make a permanent Tally table as follows... yes, this is part of the solution for this and many other "impossible" tasks...
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC
And, now for the function to do the cleaning you want and still maintain some of the special characters to keep certain types of data "meaningful" as you requested...
CREATE FUNCTION dbo.jbmStripHighAscii(@OriginalText VARCHAR(8000)) RETURNS VARCHAR(8000) BEGIN DECLARE @CleanedText VARCHAR(8000) SELECT @CleanedText = ISNULL(@CleanedText,'') + SUBSTRING(@OriginalText,t.N,1) FROM dbo.Tally t WHERE t.N <= LEN(@OriginalText) AND ( ASCII(SUBSTRING(@OriginalText,t.N,1))<128 OR SUBSTRING(@OriginalText,t.N,1) IN ('º','’','”') )
RETURN @CleanedText END
And, now, let's test the function... make sure the "Grid" mode is turned on in Query Analyzer and then run this...
SELECT dbo.jbmStripHighAscii(SomeString) FROM dbo.JBMTest
Please let me know if you have any additional questions...
--Jeff Moden
Change is inevitable... Change for the better is not.
September 5, 2007 at 7:01 am
elegant solution as always from Jeff;
I'm decent with SQL, but I still haven't gotten my arms around the Tally table concept for processing;
someday I'll grow up and be like Jeff
Lowell
September 5, 2007 at 7:50 am
Heh... thanks Lowell
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply