Identifying foreign characters in rows of data

  • Guys,

    I have a table with a single VARCHAR column.  The COLLATION for this column is SQL_Latin1_General_CP1_CI_AS.  Some of the values contain 'greek' or other European language characters (e.g. µ).

    Does anyone know how I can identify all the rows in a table which contain non-english characters?

    I was thinking to use something like the following:

    SELECT TOP *

    FROM TestTable

    WHERE NAME NOT LIKE '%[A-Z]%'

    ... but that does not seem to work

    Any ideas?

    Thanks a lot

  • Try using the ASCII function in your filter.


    Ronald San Juan | SQL DBA
    ID 710124 ~ Code "Northwind"

  • Thanks Ronald.  This is actually a good starting point for me.

    One option is to write a function which will just look through the string in a row, char by char, and return 0 or 1, based on whether a char outside the acceptable range(s)  is found.  It can also short circuit (i.e. stop if such is found).

    I can then apply this function to each row, as I execute my select. 

    I was looking for some sql server built-in functionality to do this.

    But ... it is pretty trivial, so I'll just program it myself.

     

    Thanks a lot

  • I see. I'm not sure if COLLATE can do the trick. I'm unable to test as I do not have the data for the different languages.


    Ronald San Juan | SQL DBA
    ID 710124 ~ Code "Northwind"

  • Maybe you could make a copy of the table - except this time define the column with the default collation and then try to copy the values into this to see if it'll filter out all the non-compatible values ?!?!







    **ASCII stupid question, get a stupid ANSI !!!**

  • These are all good ideas.

    However, if i don't want to filter out the values  - i want to identify them.  I guess, once filtered out, I can compare 2 tables ... but I am looking for a more elegant solution - if such exists of course ...

    As for foreign chars, you can just insert the following 'µ' intermixed with english chars, as strings in some rows, and see if you can identify these rows.

     

    Thanks a lot

  • Depends on what 'foreign' means

    You could try PATINDEX

    PATINDEX('%[^!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]%',somecolumn)

    which will give you the first column where char above 127 is (except for control chars as cf/lf etc and [], but would get you started. You can test for [ and ] but you will need to use a char not in the data as an ESCAPE.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • It does seem like LIKE and a wildcard could solve it...?

    This looks like it's working anyway..

    create table #x ( a varchar(20) not null )

    insert #x

    select 'abcde' union all

    select 'abcµde'

    -- finds only where 'legal' a-z chars exist

    select * from #x where a not like '%[^a-z]%'

    -- finds only where 'illegal' chars exist

    select * from #x where a like '%[^a-z]%'

    /Kenneth

  • Kenneth,

    Your solution seems very good.

    However, it cannot solve my problem completely as the text column I am working contains a lot of phrases (i.e. not just single letter words).

    Otherwise, it would be perfect (assuming, of course that other conditions are added, for example:

    AND X NOTE LIKE '%.%'

    AND X NOTE LIKE '%,%'

    AND X NOTE LIKE '%;%'

    AND X NOTE LIKE '%-%'

    AND X NOTE LIKE '%)%'

    AND X NOTE LIKE '%(%'

    )

     

    As for PATINDEX ... I am about to test it out now

     

    Thanks a lot guys

     

  • The wildcard works just the same with LIKE as PATINDEX (you can see the example above)

    For ranges, you just specify the allowed/disallowed chars you want to filter on one after another.

    /Kenneth

  • Another way to identify chars above ascii 127 is to use the now famous numbers table

    SELECT t.rowid, n.Number

    FROM

    t

    INNER JOIN [Numbers] n

    ON n.Number <= LEN([column])

    WHERE ASCII(SUBSTRING([column],n.Number,1)) > 127

    Not necessarily very efficient

    Also beware that currency symbols have ascii > 127, eg £ and €

    but as I said before depends on what is meant by foreign

    Far away is close at hand in the images of elsewhere.
    Anon.

  • To make it universal and manageable:

    SELECT t.rowid, n.Number

    FROM

    t

    INNER JOIN [Numbers] n

    INNER JOIN AllowedSymbols A ON n.Number = A.ASCIICode

    WHERE SUBSTRING([column],n.Number,1) = A.Symbol

    Table AllowedSymbols contains all symbols not to be rejected from the string.

    Symbol value in this table = NCHAR(ASCIICode)

    _____________
    Code for TallyGenerator

  • Guys,

    What is the "Numbers" table?

    Now to explain what I mean by foreign characters:  It all began when  I tried to convert a database into an xml file.  I used FOR XML syntax of SQL Server which easily generated the xml file from my database.  I then added the root node and tried to view the xml file in the internet explorer.  The internet explorer generated errors because my data contained characters which could not be interpreted.

    I learned that I need to replace those chars with their ascii equivalents, which should solve the problem.  But in order to replace them, I had to first identify what they are and where they are (as I have many occurencies).  Thus came this posting.

    Would anyone know if there is another way of replacing all xml non-friendly chars to their ascii codes automatically - possibly during the xml code generation from SQL Server?

    If not, then the codes provided in earlier postings could hopefully do the trick.

    Thanks a lot

  • quoteWhat is the "Numbers" table?

    It is a generic table of numbers like

    CREATE TABLE dbo.Numbers (Number int PRIMARY KEY CLUSTERED)

    and populated with sufficient numbers for the tasks it is used for

    eg for VARCHAR processing numbers 1 to 8000

    Far away is close at hand in the images of elsewhere.
    Anon.

  • David: Thanks for the clarification.

    Guys,

    It would be great to still find out whether there is a way to force conversions of xml non-friendly characters during xml code generation from sql table.

    However, in the meantime, with all the help provided here, I was able to solve my problem (i.e. identify rows with bad data) as such:

    ==================================================================================

    SELECT COLUMN_NAME

    FROM MyTable

    WHERE PATINDEX('%[^!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]%',REPLACE(COLUMN_NAME,' ','')) > 0

    ==================================================================================

    Note that I used REPLACE function to see multiple word text as a single string during processing.  Otherwise the spacing created problems.

     

    Thanks a lot

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply