Removing unicode data from fixed width downloaded text

  • I'm downloading text in fixed width, but some of it contains special characters which corrupts the alignment of the data.  Is there a simple way to identify and replace any sort of unicode characters.  I can identify some but not sure how to tackle things like

    ‘Special Digital Data Service Obligation’  

    This is just an example.  I need to be able to identify anything of this nature.

     

  • Is unicode characters the same as high ascii characters?

    this is the only way i could figure out how to do it so far, was with a function that loops thru characters 127 thru 255:

    create function StripHighAscii(@originaltext varchar(8000))

    returns varchar(8000)

    begin

    declare @i int

    set @i=127

    while @i < 255

     begin

              SET @originaltext = replace(@originaltext,CHAR(@i),'')

     SET @i=@i +1

     end

    return @originaltext

    end

    select dbo.StripHighAscii('‘Special Digital Data Service Obligation’')

    results: Special Digital Data Service Obligation

    Performance might suck, but If you were testing to see if anything existed, you might do :

    select * from sometable where unifield <> dbo.StripHighAscii(unifield)

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Thank you Lowell,

    I was looking for UNICODE(SUBSTRING(@Col,@col_pos,1))>=160, but I see now, it's not quite correct as some of these high ascii characters have unicode values less than 160.  eg. 127 and 129, although so far these haven't shown up in the data.

    Performance can't be helped as I have to replace each one with something meaningful  eg. GPS position 14º 54’ 0”  has to be remain meaningful, but at least I have a limited subset of ascii values to work with now. 

    Mary

  • Mary,

    Lowell's code is great but, as he said, there's a bit of a performance problem... takes about 4 and a half minutes to return 100,000 rows.  The following function is about 22 times faster on a 100,000 row test... takes only 12 seconds...

    But, first, let's create a bit of test data... say, 100,000 rows... this is NOT part of the solution!  We just need some test data...

    --===== Create and populate a 100,000 row test table.
         -- Column RowNum has a range of 1 to 1,000,000 unique numbers
         -- Column "SomeString" has data with characters above the CHAR(128) range
         -- Takes about 2 seconds to execute.
     SELECT TOP 100000
            RowNum     = IDENTITY(INT,1,1),
            SomeString = CAST('‘Special Digital Data Service Obligation’' AS VARCHAR(80))
       INTO dbo.JBMTest
       FROM Master.dbo.SysColumns sc1,
            Master.dbo.SysColumns sc2 --Lack of join criteria makes this a CROSS-JOIN
    --===== A table is not properly formed unless a Primary Key has been assigned
      ALTER TABLE dbo.JBMTest
            ADD PRIMARY KEY CLUSTERED (RowNum)
    --===== Put some other data of concern in the table that must remain "meaningful"...
     UPDATE dbo.JBMTest
        SET SomeString = 'GPS position 14º 54’ 0”'
      WHERE RowNum %2 = 0

    Now, before we get to the solution, we need to make a well indexed table of sequential numbers.  These "Tally" or "Numbers" tables are very powerful and can help do things in SQL Server 2000 as if we were using ROWNUM from SQL Server 2005.  You should make a permanent Tally table as follows... yes, this is part of the solution for this and many other "impossible" tasks...

    --===== Create and populate the Tally table on the fly

     SELECT TOP 11000 --equates to more than 30 years of dates

            IDENTITY(INT,1,1) AS N

       INTO dbo.Tally

       FROM Master.dbo.SysColumns sc1,

            Master.dbo.SysColumns sc2

    --===== Add a Primary Key to maximize performance

      ALTER TABLE dbo.Tally

        ADD CONSTRAINT PK_Tally_N

            PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    --===== Allow the general public to use it

      GRANT SELECT ON dbo.Tally TO PUBLIC

    And, now for the function to do the cleaning you want and still maintain some of the special characters to keep certain types of data "meaningful" as you requested...

     CREATE FUNCTION dbo.jbmStripHighAscii(@OriginalText VARCHAR(8000))
    RETURNS VARCHAR(8000)
      BEGIN
            DECLARE @CleanedText VARCHAR(8000)
            
             SELECT @CleanedText = ISNULL(@CleanedText,'') +  SUBSTRING(@OriginalText,t.N,1)
               FROM dbo.Tally t
              WHERE t.N <= LEN(@OriginalText)
                AND (
                        ASCII(SUBSTRING(@OriginalText,t.N,1))<128
                     OR 
                        SUBSTRING(@OriginalText,t.N,1) IN ('º','’','”')
                    )
    RETURN @CleanedText
        END

    And, now, let's test the function... make sure the "Grid" mode is turned on in Query Analyzer and then run this...

     SELECT dbo.jbmStripHighAscii(SomeString)
       FROM dbo.JBMTest

    Please let me know if you have any additional questions...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • elegant solution as always from Jeff;

    I'm decent with SQL, but I still haven't gotten my arms around the Tally table concept for processing;

    someday I'll grow up and be like Jeff

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Heh... thanks Lowell

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply