junk characters in a field

  • I have a field called " TranNo" which is a 12 digit number but it seems like some of the reords have junk characters like * , / , empty space etc.. though i am not sure what other junk characters are there. How can find all junk recrods and clean them?

  • First we should know what according to your system is considered as a junk character. With that, we have mechanisms to weed the rows that contain one or more junk character.

  • any character other than alph-numeric is considered as junk.

  • here's one way: a scalar function that strips out the non-compliant characters:

    I've got the same functionality as a CLR, which uses Regular Expressions, but I've never tested which performs better.

    SELECT dbo.StripNonAlphaNumeric(someColumn) from someTable

    and my old function:

    CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(8000)

    ;WITH tally (N) as

    (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii upper case letters A-Z is 65 thru 90

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii lower case letters a-z is 97 thru 122

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    FROM tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • SELECT *

    FROM YourTable

    WHERE YourField LIKE '%[^A-Z0-9]%'

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (4/11/2012)


    SELECT *

    FROM YourTable

    WHERE YourField LIKE '%[^A-Z0-9]%'

    Drew

    i missed that the question was how to clean them as well as how to find them, thanks!

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • you can always use patindex to do the same.

    select * from your table

    where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0

    ***The first step is always the hardest *******

  • Since the PATINDEX presumably includes a LIKE, it's bound to be less efficient. This is borne out by a comparison of the stats.

    LIKE

    (8 row(s) affected)

    Table 'XXXXXXXX'. Scan count 1, logical reads 585, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 36 ms.

    PATINDEX

    (8 row(s) affected)

    Table 'XXXXXXXX'. Scan count 1, logical reads 2781, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 31 ms, elapsed time = 43 ms.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Dynamic SQL is my friend for this. 🙂

    CREATE TABLE #t (strings VARCHAR(25));

    CREATE TABLE #b (bad CHAR(1));

    DECLARE @SQL NVARCHAR(MAX), @n INT, @i INT;

    INSERT INTO #t (strings)

    SELECT 'AABA#24$-a.a/a2@'

    UNION ALL SELECT 'AAcc/24)2('

    UNION ALL SELECT 'ABC123'

    UNION ALL SELECT '123a-a.a/a56'

    UNION ALL SELECT '&a*a)a1@#a$2$a%a^3&a*a(4'

    UNION ALL SELECT '&*)1@#$2$%^3&*(4'

    -- Identify the bad characters

    ;WITH BadStrings AS (

    SELECT strings

    FROM #t

    WHERE strings LIKE '%[^A-Za-z0-9]%'

    )

    ,Parse AS (

    SELECT SUBSTRING(strings + '*', PATINDEX('%[^A-Za-z0-9]%',strings + '*'), 1) as bad

    ,SUBSTRING(strings + '*', 1+PATINDEX('%[^A-Za-z0-9]%',strings + '*'), LEN(strings + '*')) as rest

    FROM BadStrings

    UNION ALL

    SELECT SUBSTRING(rest, PATINDEX('%[^A-Za-z0-9]%',rest), 1)

    ,SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest))

    FROM Parse WHERE LEN(rest) > 0 and

    SUBSTRING(rest, 1+PATINDEX('%[^A-Za-z0-9]%',rest), LEN(rest)) <> ''

    )

    INSERT INTO #b (bad)

    SELECT DISTINCT bad

    FROM Parse b

    -- Now clean them up

    SELECT @SQL = 'UPDATE #t SET strings = ', @i = 1, @n = (SELECT COUNT(*) FROM #b WHERE bad <> '&')

    WHILE @i <= @n + 1

    SELECT @SQL = @SQL + 'REPLACE(', @i = @i + 1

    SELECT @SQL = @SQL + 'strings, '''

    +(SELECT bad + ''', ''''), '''FROM #bWHERE bad <> '&' FOR XML PATH(''))

    SELECT @SQL = SUBSTRING(@SQL, 1, LEN(@SQL)-3) + ',''&'','''')' + 'WHERE strings LIKE ''%[^A-Za-z0-9]%'''

    SELECT @SQL

    EXEC (@SQL)

    SELECT * FROM #t

    DROP TABLE #t

    DROP TABLE #b

    1. Identify bad characters

    2. Clean them up

    No guarantees on performance.

    Edit: Corrected for & in the replacement string (XML doesn't allow) and improved performance of the recursive query slightly.

    Not a DBA, just messin' around!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • If you need the best possible performance - write CLR function (with defining static Regex with RegexOptions.Compiled option)

    Ping if you need help with it.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • SGT_squeequal (4/11/2012)


    you can always use patindex to do the same.

    select * from your table

    where PATINDEX('%[^a-z,0-9]%',yourcollumm)>0

    You don't need the comma in the regular expression

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., try something like this:

    CREATE TABLE #T (String VARCHAR(100)) ;

    INSERT INTO #T -- Sample Data

    (String)

    VALUES ('123a-./56'),

    ('&*)1@#$2$%^3&*(4') ;

    SELECT *,

    (

    SELECT Sub + ''

    FROM (SELECT SUBSTRING(String, Number, 1) AS Sub,

    Number

    FROM dbo.Numbers -- table of positive integers

    WHERE Number BETWEEN 1 AND LEN(String)) AS Parser

    WHERE Sub LIKE '[0-9a-zA-Z]'

    ORDER BY Number

    FOR XML PATH(''),

    TYPE).value('.[1]', 'varchar(100)')

    FROM #T ;

    The Numbers table is simply a table of positive integers. They're very easy to create, and very, very useful. The inner part of the sub-query, "Parser", uses the Numbers table to split the value into single-character rows. Then it selects from that, using Like to grab only those values you want. I included both lower- and upper-case letters, in case your system is case-sensitive. Then use the Number column again, to enforce that it retains the order of the original string, and For XML Path('') concatenates it back together again. That's a common trick that's been in use since SQL 2005 came out with the For XML clause. The XQuery "value" function turns it back into varchar, from XML.

    You'll find that a function like that, built as an inline sub-query, is fast, efficient, and easy to document. If you need to expand the list of allowed characters, you just modify the Like statement, so it's easy to maintain too.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • If you want a more efficient version, that doesn't require as much pre-processing, any CLR, etc., ...

    I'm sure CLR will beat the use of Number table on large datasets where most of values do need replacement (we had a question here, not long ago, about getting rid off GUIDs). Just think about that 1,000,000 rows containing 12 digit values will grow to 12,000,000 rowset ....

    Properly written CLR will be efficient and elegant:

    SELECT dbo.CleanupNumeric(Value)

    FROM [Table]

    Here is the c# implementation:

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Text.RegularExpressions;

    public partial class UserDefinedFunctions

    {

    static readonly Regex _regex = new Regex(@"^[0-9a-zA-Z]", RegexOptions.Compiled);

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString CleanupNumeric(SqlString val)

    {

    if (val.IsNull) return SqlString.Null;

    return (SqlString)_regex.Replace(val.ToString(), String.Empty);

    }

    };

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • GSquared,

    Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.

    Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?

    I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).

    As they say, the proof is in the pudding.

    I will give you credit for elegance compared to my brute force though.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (4/12/2012)


    GSquared,

    Thanks for pointing out an issue with the ampersand REPLACE I was doing. I corrected my post accordingly.

    Now as to performance, I'm curious why you think the version you posted is better performing? Have you tried it vs. an alternative?

    I'd like to see someone prove it one way or another but I'm not the man with the 1,000,000 row test harness. Of course, they'd need to modify your version to UPDATE instead of just SELECT (not hard, I'm just saying).

    As they say, the proof is in the pudding.

    I will give you credit for elegance compared to my brute force though.

    I use a version of what I posted to clean up phone numbers submitted through web forms and a variety of import sources, with variable data hygiene. There are different formatting rules, et al, for different ETL targets for the data. I recently processed 5.5-million records in a couple of minutes, on medium-level hardware with a moderate load on it from other processes.

    I've tried a fair number of alternatives, and it's been the fastest and most reliable thus far. It's entirely possible someone with better C#/VB.NET skill than mine can make a CLR function that's faster, but the options I found online for that were all much, much slower, or buggy (in one case, it scrambled character-sequences in some cases).

    But don't take my word for it. Speed-test and reliability-test for yourself. If you trust me (or anyone else) on that kind of thing, then you don't know, you just believe, and knowing is better.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply