Identifying foreign characters in rows of data

Question

Post reply

Identifying foreign characters in rows of data

Negmat

SSC Eights!

Points: 863
More actions
December 5, 2006 at 3:59 pm

#115589

Guys,
I have a table with a single VARCHAR column. The COLLATION for this column is SQL_Latin1_General_CP1_CI_AS. Some of the values contain 'greek' or other European language characters (e.g. µ).
Does anyone know how I can identify all the rows in a table which contain non-english characters?
I was thinking to use something like the following:
SELECT TOP *
FROM TestTable
WHERE NAME NOT LIKE '%[A-Z]%'
... but that does not seem to work
Any ideas?
Thanks a lot

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Ronald San Juan SSCrazy Points: 2937 More actions · Answer 1

Try using the ASCII function in your filter.

Ronald San Juan | SQL DBA
ID 710124 ~ Code "Northwind"

Negmat SSC Eights! Points: 863 More actions · Answer 2

Thanks Ronald. This is actually a good starting point for me.

One option is to write a function which will just look through the string in a row, char by char, and return 0 or 1, based on whether a char outside the acceptable range(s) is found. It can also short circuit (i.e. stop if such is found).

I can then apply this function to each row, as I execute my select.

I was looking for some sql server built-in functionality to do this.

But ... it is pretty trivial, so I'll just program it myself.

Thanks a lot

Ronald San Juan SSCrazy Points: 2937 More actions · Answer 3

I see. I'm not sure if COLLATE can do the trick. I'm unable to test as I do not have the data for the different languages.

Ronald San Juan | SQL DBA
ID 710124 ~ Code "Northwind"

sushila SSC-Dedicated Points: 35293 More actions · Answer 4

Maybe you could make a copy of the table - except this time define the column with the default collation and then try to copy the values into this to see if it'll filter out all the non-compatible values ?!?!

**ASCII stupid question, get a stupid ANSI !!!**

Negmat SSC Eights! Points: 863 More actions · Answer 5

These are all good ideas.

However, if i don't want to filter out the values - i want to identify them. I guess, once filtered out, I can compare 2 tables ... but I am looking for a more elegant solution - if such exists of course ...

As for foreign chars, you can just insert the following 'µ' intermixed with english chars, as strings in some rows, and see if you can identify these rows.

Thanks a lot

David Burrows SSC Guru Points: 65143 More actions · Answer 6

Depends on what 'foreign' means

You could try PATINDEX

PATINDEX('%[^!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]%',somecolumn)

which will give you the first column where char above 127 is (except for control chars as cf/lf etc and [], but would get you started. You can test for [ and ] but you will need to use a char not in the data as an ESCAPE.

Far away is close at hand in the images of elsewhere.
Anon.

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 7

It does seem like LIKE and a wildcard could solve it...?

This looks like it's working anyway..

create table #x ( a varchar(20) not null )

insert #x

select 'abcde' union all

select 'abcµde'

-- finds only where 'legal' a-z chars exist

select * from #x where a not like '%[^a-z]%'

-- finds only where 'illegal' chars exist

select * from #x where a like '%[^a-z]%'

/Kenneth

Negmat SSC Eights! Points: 863 More actions · Answer 8

Kenneth,

Your solution seems very good.

However, it cannot solve my problem completely as the text column I am working contains a lot of phrases (i.e. not just single letter words).

Otherwise, it would be perfect (assuming, of course that other conditions are added, for example:

AND X NOTE LIKE '%.%'

AND X NOTE LIKE '%,%'

AND X NOTE LIKE '%;%'

AND X NOTE LIKE '%-%'

AND X NOTE LIKE '%)%'

AND X NOTE LIKE '%(%'

)

As for PATINDEX ... I am about to test it out now

Thanks a lot guys

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 9

The wildcard works just the same with LIKE as PATINDEX (you can see the example above)

For ranges, you just specify the allowed/disallowed chars you want to filter on one after another.

/Kenneth

David Burrows SSC Guru Points: 65143 More actions · Answer 10

Another way to identify chars above ascii 127 is to use the now famous numbers table

SELECT t.rowid, n.Number

FROM

t

INNER JOIN [Numbers] n

ON n.Number <= LEN([column])

WHERE ASCII(SUBSTRING([column],n.Number,1)) > 127

Not necessarily very efficient

Also beware that currency symbols have ascii > 127, eg £ and €

but as I said before depends on what is meant by foreign

Far away is close at hand in the images of elsewhere.
Anon.

Sergiy SSC Guru Points: 110208 More actions · Answer 11

To make it universal and manageable:

SELECT t.rowid, n.Number

FROM

t

INNER JOIN [Numbers] n

INNER JOIN AllowedSymbols A ON n.Number = A.ASCIICode

WHERE SUBSTRING([column],n.Number,1) = A.Symbol

Table AllowedSymbols contains all symbols not to be rejected from the string.

Symbol value in this table = NCHAR(ASCIICode)

_____________
Code for TallyGenerator

Negmat SSC Eights! Points: 863 More actions · Answer 12

Guys,

What is the "Numbers" table?

Now to explain what I mean by foreign characters: It all began when I tried to convert a database into an xml file. I used FOR XML syntax of SQL Server which easily generated the xml file from my database. I then added the root node and tried to view the xml file in the internet explorer. The internet explorer generated errors because my data contained characters which could not be interpreted.

I learned that I need to replace those chars with their ascii equivalents, which should solve the problem. But in order to replace them, I had to first identify what they are and where they are (as I have many occurencies). Thus came this posting.

Would anyone know if there is another way of replacing all xml non-friendly chars to their ascii codes automatically - possibly during the xml code generation from SQL Server?

If not, then the codes provided in earlier postings could hopefully do the trick.

Thanks a lot

David Burrows SSC Guru Points: 65143 More actions · Answer 13

What is the "Numbers" table?

It is a generic table of numbers like

CREATE TABLE dbo.Numbers (Number int PRIMARY KEY CLUSTERED)

and populated with sufficient numbers for the tasks it is used for

eg for VARCHAR processing numbers 1 to 8000

Far away is close at hand in the images of elsewhere.
Anon.

Negmat SSC Eights! Points: 863 More actions · Answer 14

David: Thanks for the clarification.

Guys,

It would be great to still find out whether there is a way to force conversions of xml non-friendly characters during xml code generation from sql table.

However, in the meantime, with all the help provided here, I was able to solve my problem (i.e. identify rows with bad data) as such:

==================================================================================

SELECT COLUMN_NAME

FROM MyTable

WHERE PATINDEX('%[^!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\^_`abcdefghijklmnopqrstuvwxyz{|}~]%',REPLACE(COLUMN_NAME,' ','')) > 0

==================================================================================

Note that I used REPLACE function to see multiple word text as a single string during processing. Otherwise the spacing created problems.

Thanks a lot