removing random characters

  • Hi, I'm a bit of a noobie and this is my first post so be gentle.

    I've got some data that I've scanned in using OCR and imported into a table, the problem is they are cheques and the Values they have imported are a bit messey. It's imported the amounts but around the amounts I have *'s plus a lot of other randomly generated characters that the OCR system thinks it has seen (see sample below).

    Is there a way of only pulling out numbers and certain characters (e.g. '.' and '-')

    sample date

    text1

    **44-14**

    **44-87**

    **4284**

    ,**59-70**·

    **442-05**

    **53-55**

    **267-59**'

    **48-73**

    **344-97**'

    ·**4503**

    **222-98** '

    ', **56-34**··,

    **144-94**

    **55-51**

    ,**52-53**

    **97-40**

    **75-55**

    k*44-03**

    **55-95**

    **64-04**

    Thanks

    Matt

  • here you go Matt: this is using a function which strips the string down and removes unwanted chars;

    take a look at the logic, and see if you understand how it works.

    --sample results:

    44-14

    44-87

    4284

    59-70

    442-05

    the code:

    ALTER FUNCTION StripNonNumeric(@OriginalText VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(8000)

    ;WITH tally (N) as

    (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii for dash and period is 45 and 46 --SELECT ASCII('-'),ASCII('.')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 45 AND 46

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    FROM tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    GO

    WITH MySampleData

    AS (SELECT '**44-14**' As StringToClean UNION ALL

    SELECT '**44-87**' UNION ALL

    SELECT '**4284**' UNION ALL

    SELECT ',**59-70**·' UNION ALL

    SELECT '**442-05** ' UNION ALL

    SELECT '**53-55**' UNION ALL

    SELECT '**267-59**''' UNION ALL

    SELECT '**48-73**' UNION ALL

    SELECT '**344-97**''' UNION ALL

    SELECT '·**4503**' UNION ALL

    SELECT '**222-98** ''' UNION ALL

    SELECT ''', **56-34**··,' UNION ALL

    SELECT '**144-94**' UNION ALL

    SELECT '**55-51**' UNION ALL

    SELECT ',**52-53**' UNION ALL

    SELECT '**97-40**' UNION ALL

    SELECT '**75-55**' UNION ALL

    SELECT 'k*44-03**' UNION ALL

    SELECT '**55-95**' UNION ALL

    SELECT '**64-04**'

    --the business end of the sql:

    SELECT dbo.StripNonNumeric(StringToClean) FROM MySampleData

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • There's a very simple way: Use the REPLACE function.

    For example

    SELECT REPLACE(REPLACE(SampleColumn, '*', ''), '-', '')

    FROM SampleTable

    [/Code]

    Just nest the REPLACE for each character you want to remove and replace it with an empty string. Look up the REPLACE function to get a better idea of its functionality.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Lowell - thank you very much that is perfect!

    Stefan - thanks for the post but I used the replace function and the characters where too inconsistent and on the odd occasion they appeared in between the numerics which I didn't want.

    Thanks again Lowell

    Matt

  • matt.parry (6/9/2010)


    Lowell - thank you very much that is perfect!

    Stefan - thanks for the post but I used the replace function and the characters where too inconsistent and on the odd occasion they appeared in between the numerics which I didn't want.

    Thanks again Lowell

    Matt

    I can certainly understand not wanting to create a REPLACE for every non-numeric character. It really works best if you have a limited number of characters you want to exclude. From your example it seemed to be arond 5 which is about the maximum I'd use this approach for. The other approach is best if you want to specify which characters to keep instead of which to discard. However, I'm not sure I know what you mean by

    on the odd occasion they appeared in between the numerics which I didn't want.

    You want to keep the non-numeric characters if they appear in between numbers?

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Here's a set-based method that can be adapted into a table-valued function (Lowell's solution utilizes a scalar function). A TVF can be utilized with CROSS APPLY to get pretty dramatic results.

    declare @test-2 table (RowID INT IDENTITY, Col1 varchar(50))

    insert into @test-2

    SELECT '**44-14**' UNION ALL

    SELECT '**44-87**' UNION ALL

    SELECT '**4284**' UNION ALL

    SELECT ',**59-70**·' UNION ALL

    SELECT '**442-05** ' UNION ALL

    SELECT '**53-55**' UNION ALL

    SELECT '**267-59**''' UNION ALL

    SELECT '**48-73**' UNION ALL

    SELECT '**344-97**''' UNION ALL

    SELECT '·**4503**' UNION ALL

    SELECT '**222-98** ''' UNION ALL

    SELECT ''', **56-34**··,' UNION ALL

    SELECT '**144-94**' UNION ALL

    SELECT '**55-51**' UNION ALL

    SELECT ',**52-53**' UNION ALL

    SELECT '**97-40**' UNION ALL

    SELECT '**75-55**' UNION ALL

    SELECT 'k*44-03**' UNION ALL

    SELECT '**55-95**' UNION ALL

    SELECT '**64-04**'

    -- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/.

    ;WITH

    Tally (N) AS (SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns sc1),

    CTE AS (

    -- get the values that are numbers, or match other desired characters.

    -- this breaks each character down into a separate row

    SELECT t.RowID,

    t.Col1,

    Tally.N,

    Value = SUBSTRING(Col1,Tally.N,1)

    FROM @test-2 t

    CROSS JOIN Tally

    WHERE Tally.N <= LEN(t.Col1)

    AND (ASCII(SUBSTRING(t.Col1,Tally.N,1)) between 48 and 57

    OR SUBSTRING(t.Col1,Tally.N,1) IN ('.','-'))

    )

    -- put everything left back together

    SELECT DISTINCT

    RowID,

    Original_value = Col1,

    Value = REPLACE((SELECT ',' + Value FROM CTE WHERE RowID = c.RowID ORDER BY N FOR XML PATH('')),',','')

    FROM CTE c

    ORDER BY RowID

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • yes I wanted the keep the characters in between the numbers because these are monetary values so the characters in between the numbers are important (pounds and pence)

    Due to the quality of the text inserted via our scanning system the character '.' and '-' appeared both in between the numbers and outside the numbers so removing them could cause £3.45 to look like £345 - plus some of the character it returned I didn't even know existed :ermm:

  • Got it. I thought the bad characters were appearing both inbetween and outside. Sorry about that.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • no probs, thanks for the suggestion anyway

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply