junk characters in a field

Question

junk characters in a field

Tara-1044200

SSCoach

Points: 15951
More actions
April 11, 2012 at 12:42 pm

#253896

I have a field called " TranNo" which is a 12 digit number but it seems like some of the reords have junk characters like * , / , empty space etc.. though i am not sure what other junk characters are there. How can find all junk recrods and clean them?

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 1

First we should know what according to your system is considered as a junk character. With that, we have mechanisms to weed the rows that contain one or more junk character.

Tara-1044200 SSCoach Points: 15951 More actions · Answer 2

any character other than alph-numeric is considered as junk.

Lowell SSC Guru Points: 323518 More actions · Answer 3

here's one way: a scalar function that strips out the non-compliant characters:

I've got the same functionality as a CLR, which uses Regular Expressions, but I've never tested which performs better.

SELECT dbo.StripNonAlphaNumeric(someColumn) from someTable

and my old function:

CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

RETURNS VARCHAR(8000)

BEGIN

DECLARE @CleanedText VARCHAR(8000)

;WITH tally (N) as

(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

FROM Master.dbo.SysColumns sc1

CROSS JOIN Master.dbo.SysColumns sc2)

SELECT @CleanedText = ISNULL(@CleanedText,'') +

CASE

--ascii numbers are 48(for '0') thru 57 (for '9')

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii upper case letters A-Z is 65 thru 90

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii lower case letters a-z is 97 thru 122

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END

FROM tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText

END

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

drew.allen SSC Guru Points: 76985 More actions · Answer 4

SELECT *

FROM YourTable

WHERE YourField LIKE '%[^A-Z0-9]%'

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Lowell SSC Guru Points: 323518 More actions · Answer 5

drew.allen (4/11/2012)
SELECT * FROM YourTable WHERE YourField LIKE '%[^A-Z0-9]%'
Drew

i missed that the question was how to clean them as well as how to find them, thanks!

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

SGT_squeequal SSCertifiable Points: 7167 More actions · Answer 6

you can always use patindex to do the same.

select * from your table

where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0

***The first step is always the hardest *******

drew.allen SSC Guru Points: 76985 More actions · Answer 7

Since the PATINDEX presumably includes a LIKE, it's bound to be less efficient. This is borne out by a comparison of the stats.

LIKE

(8 row(s) affected)

Table 'XXXXXXXX'. Scan count 1, logical reads 585, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 36 ms.

PATINDEX

(8 row(s) affected)

Table 'XXXXXXXX'. Scan count 1, logical reads 2781, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 31 ms, elapsed time = 43 ms.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Dwain Camps SSC Guru Points: 86948 More actions · Answer 8

Dynamic SQL is my friend for this. 🙂

CREATE TABLE #t (strings VARCHAR(25));

CREATE TABLE #b (bad CHAR(1));

DECLARE @SQL NVARCHAR(MAX), @n INT, @i INT;

INSERT INTO #t (strings)

SELECT 'AABA#24$-a.a/a2@'

UNION ALL SELECT 'AAcc/24)2('

UNION ALL SELECT 'ABC123'

UNION ALL SELECT '123a-a.a/a56'

UNION ALL SELECT '&a*a)a1@#a$2$a%a^3&a*a(4'

UNION ALL SELECT '&*)1@#$2$%^3&*(4'

-- Identify the bad characters

;WITH BadStrings AS (

SELECT strings

FROM #t

WHERE strings LIKE '%[^A-Za-z0-9]%'

)

,Parse AS (

SELECT SUBSTRING(strings + '*', PATINDEX('%[^A-Za-z0-9]%',strings + '*'), 1) as bad

,SUBSTRING(strings + '*', 1+PATINDEX('%[^A-Za-z0-9]%',strings + '*'), LEN(strings + '*')) as rest

FROM BadStrings

UNION ALL

SELECT SUBSTRING(rest, PATINDEX('%[^A-Za-z0-9]%',rest), 1)

,SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest))

FROM Parse WHERE LEN(rest) > 0 and

SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest)) <> ''

)

INSERT INTO #b (bad)

SELECT DISTINCT bad

FROM Parse b

-- Now clean them up

SELECT @SQL = 'UPDATE #t SET strings = ', @i = 1, @n = (SELECT COUNT(*) FROM #b WHERE bad <> '&')

WHILE @i <= @n + 1

SELECT @SQL = @SQL + 'REPLACE(', @i = @i + 1

SELECT @SQL = @SQL + 'strings, '''

+(SELECT bad + ''', ''''), '''FROM #bWHERE bad <> '&' FOR XML PATH(''))

SELECT @SQL = SUBSTRING(@SQL, 1, LEN(@SQL)-3) + ',''&'','''')' + 'WHERE strings LIKE ''%[^A-Za-z0-9]%'''

SELECT @SQL

EXEC (@SQL)

SELECT * FROM #t

DROP TABLE #t

DROP TABLE #b

1. Identify bad characters

2. Clean them up

No guarantees on performance.

Edit: Corrected for & in the replacement string (XML doesn't allow) and improved performance of the recursive query slightly.

Not a DBA, just messin' around!

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 9

If you need the best possible performance - write CLR function (with defining static Regex with RegexOptions.Compiled option)

Ping if you need help with it.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Mark Cowne One Orange Chip Points: 26952 More actions · Answer 10

SGT_squeequal (4/11/2012)
you can always use patindex to do the same.
select * from your table
where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0

You don't need the comma in the regular expression

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

GSquared SSC Guru Points: 260824 More actions · Answer 11

If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., try something like this:

CREATE TABLE #T (String VARCHAR(100)) ;

INSERT INTO #T -- Sample Data

(String)

VALUES ('123a-./56'),

('&*)1@#$2$%^3&*(4') ;

SELECT *,

(

SELECT Sub + ''

FROM (SELECT SUBSTRING(String, Number, 1) AS Sub,

Number

FROM dbo.Numbers -- table of positive integers

WHERE Number BETWEEN 1 AND LEN(String)) AS Parser

WHERE Sub LIKE '[0-9a-zA-Z]'

ORDER BY Number

FOR XML PATH(''),

TYPE).value('.[1]', 'varchar(100)')

FROM #T ;

The Numbers table is simply a table of positive integers. They're very easy to create, and very, very useful. The inner part of the sub-query, "Parser", uses the Numbers table to split the value into single-character rows. Then it selects from that, using Like to grab only those values you want. I included both lower- and upper-case letters, in case your system is case-sensitive. Then use the Number column again, to enforce that it retains the order of the original string, and For XML Path('') concatenates it back together again. That's a common trick that's been in use since SQL 2005 came out with the For XML clause. The XQuery "value" function turns it back into varchar, from XML.

You'll find that a function like that, built as an inline sub-query, is fast, efficient, and easy to document. If you need to expand the list of allowed characters, you just modify the Like statement, so it's easy to maintain too.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 12

If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., ...

I'm sure CLR will beat the use of Number table on large datasets where most of values do need replacement (we had a question here, not long ago, about getting rid off GUIDs). Just think about that 1,000,000 rows containing 12 digit values will grow to 12,000,000 rowset ....

Properly written CLR will be efficient and elegant:

SELECT dbo.CleanupNumeric(Value)

FROM [Table]

Here is the c# implementation:

using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

using System.Text.RegularExpressions;

public partial class UserDefinedFunctions

{

static readonly Regex _regex = new Regex(@"^[0-9a-zA-Z]", RegexOptions.Compiled);

[Microsoft.SqlServer.Server.SqlFunction]

public static SqlString CleanupNumeric(SqlString val)

{

if (val.IsNull) return SqlString.Null;

return (SqlString)_regex.Replace(val.ToString(), String.Empty);

}

};

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Dwain Camps SSC Guru Points: 86948 More actions · Answer 13

GSquared,

Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.

Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?

I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).

As they say, the proof is in the pudding.

I will give you credit for elegance compared to my brute force though.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

GSquared SSC Guru Points: 260824 More actions · Answer 14

dwain.c (4/12/2012)
GSquared,
Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.
Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?
I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).
As they say, the proof is in the pudding.
I will give you credit for elegance compared to my brute force though.

I use a version of what I posted to clean up phone numbers submitted through web forms and a variety of import sources, with variable data hygiene. There are different formatting rules, et al, for different ETL targets for the data. I recently processed 5.5-million records in a couple of minutes, on medium-level hardware with a moderate load on it from other processes.

I've tried a fair number of alternatives, and it's been the fastest and most reliable thus far. It's entirely possible someone with better C#/VB.NET skill than mine can make a CLR function that's faster, but the options I found online for that were all much, much slower, or buggy (in one case, it scrambled character-sequences in some cases).

But don't take my word for it. Speed-test and reliability-test for yourself. If you trust me (or anyone else) on that kind of thing, then you don't know, you just believe, and knowing is better.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon