April 11, 2012 at 12:42 pm
I have a field called " TranNo" which is a 12 digit number but it seems like some of the reords have junk characters like * , / , empty space etc.. though i am not sure what other junk characters are there. How can find all junk recrods and clean them?
April 11, 2012 at 12:53 pm
First we should know what according to your system is considered as a junk character. With that, we have mechanisms to weed the rows that contain one or more junk character.
April 11, 2012 at 1:09 pm
any character other than alph-numeric is considered as junk.
April 11, 2012 at 1:15 pm
here's one way: a scalar function that strips out the non-compliant characters:
I've got the same functionality as a CLR, which uses Regular Expressions, but I've never tested which performs better.
SELECT dbo.StripNonAlphaNumeric(someColumn) from someTable
and my old function:
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
April 11, 2012 at 1:16 pm
SELECT *
FROM YourTable
WHERE YourField LIKE '%[^A-Z0-9]%'
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
April 11, 2012 at 1:23 pm
drew.allen (4/11/2012)
SELECT *
FROM YourTable
WHERE YourField LIKE '%[^A-Z0-9]%'
Drew
i missed that the question was how to clean them as well as how to find them, thanks!
Lowell
April 11, 2012 at 1:34 pm
you can always use patindex to do the same.
select * from your table
where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0
***The first step is always the hardest *******
April 11, 2012 at 2:06 pm
Since the PATINDEX presumably includes a LIKE, it's bound to be less efficient. This is borne out by a comparison of the stats.
LIKE
(8 row(s) affected)
Table 'XXXXXXXX'. Scan count 1, logical reads 585, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 36 ms.
PATINDEX
(8 row(s) affected)
Table 'XXXXXXXX'. Scan count 1, logical reads 2781, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31 ms, elapsed time = 43 ms.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
April 12, 2012 at 12:18 am
Dynamic SQL is my friend for this. 🙂
CREATE TABLE #t (strings VARCHAR(25));
CREATE TABLE #b (bad CHAR(1));
DECLARE @SQL NVARCHAR(MAX), @n INT, @i INT;
INSERT INTO #t (strings)
SELECT 'AABA#24$-a.a/a2@'
UNION ALL SELECT 'AAcc/24)2('
UNION ALL SELECT 'ABC123'
UNION ALL SELECT '123a-a.a/a56'
UNION ALL SELECT '&a*a)a1@#a$2$a%a^3&a*a(4'
UNION ALL SELECT '&*)1@#$2$%^3&*(4'
-- Identify the bad characters
;WITH BadStrings AS (
SELECT strings
FROM #t
WHERE strings LIKE '%[^A-Za-z0-9]%'
)
,Parse AS (
SELECT SUBSTRING(strings + '*', PATINDEX('%[^A-Za-z0-9]%',strings + '*'), 1) as bad
,SUBSTRING(strings + '*', 1+PATINDEX('%[^A-Za-z0-9]%',strings + '*'), LEN(strings + '*')) as rest
FROM BadStrings
UNION ALL
SELECT SUBSTRING(rest, PATINDEX('%[^A-Za-z0-9]%',rest), 1)
,SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest))
FROM Parse WHERE LEN(rest) > 0 and
SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest)) <> ''
)
INSERT INTO #b (bad)
SELECT DISTINCT bad
FROM Parse b
-- Now clean them up
SELECT @SQL = 'UPDATE #t SET strings = ', @i = 1, @n = (SELECT COUNT(*) FROM #b WHERE bad <> '&')
WHILE @i <= @n + 1
SELECT @SQL = @SQL + 'REPLACE(', @i = @i + 1
SELECT @SQL = @SQL + 'strings, '''
+(SELECT bad + ''', ''''), '''FROM #bWHERE bad <> '&' FOR XML PATH(''))
SELECT @SQL = SUBSTRING(@SQL, 1, LEN(@SQL)-3) + ',''&'','''')' + 'WHERE strings LIKE ''%[^A-Za-z0-9]%'''
SELECT @SQL
EXEC (@SQL)
SELECT * FROM #t
DROP TABLE #t
DROP TABLE #b
1. Identify bad characters
2. Clean them up
No guarantees on performance.
Edit: Corrected for & in the replacement string (XML doesn't allow) and improved performance of the recursive query slightly.
Not a DBA, just messin' around!
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
April 12, 2012 at 6:36 am
If you need the best possible performance - write CLR function (with defining static Regex with RegexOptions.Compiled option)
Ping if you need help with it.
April 12, 2012 at 6:46 am
SGT_squeequal (4/11/2012)
you can always use patindex to do the same.
select * from your table
where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0
You don't need the comma in the regular expression
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537April 12, 2012 at 6:56 am
If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., try something like this:
CREATE TABLE #T (String VARCHAR(100)) ;
INSERT INTO #T -- Sample Data
(String)
VALUES ('123a-./56'),
('&*)1@#$2$%^3&*(4') ;
SELECT *,
(
SELECT Sub + ''
FROM (SELECT SUBSTRING(String, Number, 1) AS Sub,
Number
FROM dbo.Numbers -- table of positive integers
WHERE Number BETWEEN 1 AND LEN(String)) AS Parser
WHERE Sub LIKE '[0-9a-zA-Z]'
ORDER BY Number
FOR XML PATH(''),
TYPE).value('.[1]', 'varchar(100)')
FROM #T ;
The Numbers table is simply a table of positive integers. They're very easy to create, and very, very useful. The inner part of the sub-query, "Parser", uses the Numbers table to split the value into single-character rows. Then it selects from that, using Like to grab only those values you want. I included both lower- and upper-case letters, in case your system is case-sensitive. Then use the Number column again, to enforce that it retains the order of the original string, and For XML Path('') concatenates it back together again. That's a common trick that's been in use since SQL 2005 came out with the For XML clause. The XQuery "value" function turns it back into varchar, from XML.
You'll find that a function like that, built as an inline sub-query, is fast, efficient, and easy to document. If you need to expand the list of allowed characters, you just modify the Like statement, so it's easy to maintain too.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 12, 2012 at 7:49 am
If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., ...
I'm sure CLR will beat the use of Number table on large datasets where most of values do need replacement (we had a question here, not long ago, about getting rid off GUIDs). Just think about that 1,000,000 rows containing 12 digit values will grow to 12,000,000 rowset ....
Properly written CLR will be efficient and elegant:
SELECT dbo.CleanupNumeric(Value)
FROM [Table]
Here is the c# implementation:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
static readonly Regex _regex = new Regex(@"^[0-9a-zA-Z]", RegexOptions.Compiled);
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString CleanupNumeric(SqlString val)
{
if (val.IsNull) return SqlString.Null;
return (SqlString)_regex.Replace(val.ToString(), String.Empty);
}
};
April 12, 2012 at 8:42 am
GSquared,
Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.
Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?
I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).
As they say, the proof is in the pudding.
I will give you credit for elegance compared to my brute force though.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
April 12, 2012 at 9:30 am
dwain.c (4/12/2012)
GSquared,Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.
Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?
I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).
As they say, the proof is in the pudding.
I will give you credit for elegance compared to my brute force though.
I use a version of what I posted to clean up phone numbers submitted through web forms and a variety of import sources, with variable data hygiene. There are different formatting rules, et al, for different ETL targets for the data. I recently processed 5.5-million records in a couple of minutes, on medium-level hardware with a moderate load on it from other processes.
I've tried a fair number of alternatives, and it's been the fastest and most reliable thus far. It's entirely possible someone with better C#/VB.NET skill than mine can make a CLR function that's faster, but the options I found online for that were all much, much slower, or buggy (in one case, it scrambled character-sequences in some cases).
But don't take my word for it. Speed-test and reliability-test for yourself. If you trust me (or anyone else) on that kind of thing, then you don't know, you just believe, and knowing is better.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 15 posts - 1 through 15 (of 29 total)
You must be logged in to reply to this topic. Login to reply