June 9, 2010 at 10:50 am
Hi, I'm a bit of a noobie and this is my first post so be gentle.
I've got some data that I've scanned in using OCR and imported into a table, the problem is they are cheques and the Values they have imported are a bit messey. It's imported the amounts but around the amounts I have *'s plus a lot of other randomly generated characters that the OCR system thinks it has seen (see sample below).
Is there a way of only pulling out numbers and certain characters (e.g. '.' and '-')
sample date
text1
**44-14**
**44-87**
**4284**
,**59-70**·
**442-05**
**53-55**
**267-59**'
**48-73**
**344-97**'
·**4503**
**222-98** '
', **56-34**··,
**144-94**
**55-51**
,**52-53**
**97-40**
**75-55**
k*44-03**
**55-95**
**64-04**
Thanks
Matt
June 9, 2010 at 11:07 am
here you go Matt: this is using a function which strips the string down and removes unwanted chars;
take a look at the logic, and see if you understand how it works.
--sample results:
44-14
44-87
4284
59-70
442-05
the code:
ALTER FUNCTION StripNonNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii for dash and period is 45 and 46 --SELECT ASCII('-'),ASCII('.')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 45 AND 46
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
GO
WITH MySampleData
AS (SELECT '**44-14**' As StringToClean UNION ALL
SELECT '**44-87**' UNION ALL
SELECT '**4284**' UNION ALL
SELECT ',**59-70**·' UNION ALL
SELECT '**442-05** ' UNION ALL
SELECT '**53-55**' UNION ALL
SELECT '**267-59**''' UNION ALL
SELECT '**48-73**' UNION ALL
SELECT '**344-97**''' UNION ALL
SELECT '·**4503**' UNION ALL
SELECT '**222-98** ''' UNION ALL
SELECT ''', **56-34**··,' UNION ALL
SELECT '**144-94**' UNION ALL
SELECT '**55-51**' UNION ALL
SELECT ',**52-53**' UNION ALL
SELECT '**97-40**' UNION ALL
SELECT '**75-55**' UNION ALL
SELECT 'k*44-03**' UNION ALL
SELECT '**55-95**' UNION ALL
SELECT '**64-04**'
--the business end of the sql:
SELECT dbo.StripNonNumeric(StringToClean) FROM MySampleData
Lowell
June 9, 2010 at 11:15 am
There's a very simple way: Use the REPLACE function.
For example
SELECT REPLACE(REPLACE(SampleColumn, '*', ''), '-', '')
FROM SampleTable
[/Code]
Just nest the REPLACE for each character you want to remove and replace it with an empty string. Look up the REPLACE function to get a better idea of its functionality.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
June 9, 2010 at 11:29 am
Lowell - thank you very much that is perfect!
Stefan - thanks for the post but I used the replace function and the characters where too inconsistent and on the odd occasion they appeared in between the numerics which I didn't want.
Thanks again Lowell
Matt
June 9, 2010 at 11:42 am
matt.parry (6/9/2010)
Lowell - thank you very much that is perfect!Stefan - thanks for the post but I used the replace function and the characters where too inconsistent and on the odd occasion they appeared in between the numerics which I didn't want.
Thanks again Lowell
Matt
I can certainly understand not wanting to create a REPLACE for every non-numeric character. It really works best if you have a limited number of characters you want to exclude. From your example it seemed to be arond 5 which is about the maximum I'd use this approach for. The other approach is best if you want to specify which characters to keep instead of which to discard. However, I'm not sure I know what you mean by
on the odd occasion they appeared in between the numerics which I didn't want.
You want to keep the non-numeric characters if they appear in between numbers?
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
June 9, 2010 at 11:51 am
Here's a set-based method that can be adapted into a table-valued function (Lowell's solution utilizes a scalar function). A TVF can be utilized with CROSS APPLY to get pretty dramatic results.
declare @test-2 table (RowID INT IDENTITY, Col1 varchar(50))
insert into @test-2
SELECT '**44-14**' UNION ALL
SELECT '**44-87**' UNION ALL
SELECT '**4284**' UNION ALL
SELECT ',**59-70**·' UNION ALL
SELECT '**442-05** ' UNION ALL
SELECT '**53-55**' UNION ALL
SELECT '**267-59**''' UNION ALL
SELECT '**48-73**' UNION ALL
SELECT '**344-97**''' UNION ALL
SELECT '·**4503**' UNION ALL
SELECT '**222-98** ''' UNION ALL
SELECT ''', **56-34**··,' UNION ALL
SELECT '**144-94**' UNION ALL
SELECT '**55-51**' UNION ALL
SELECT ',**52-53**' UNION ALL
SELECT '**97-40**' UNION ALL
SELECT '**75-55**' UNION ALL
SELECT 'k*44-03**' UNION ALL
SELECT '**55-95**' UNION ALL
SELECT '**64-04**'
-- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
-- at http://www.sqlservercentral.com/articles/T-SQL/62867/.
;WITH
Tally (N) AS (SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns sc1),
CTE AS (
-- get the values that are numbers, or match other desired characters.
-- this breaks each character down into a separate row
SELECT t.RowID,
t.Col1,
Tally.N,
Value = SUBSTRING(Col1,Tally.N,1)
FROM @test-2 t
CROSS JOIN Tally
WHERE Tally.N <= LEN(t.Col1)
AND (ASCII(SUBSTRING(t.Col1,Tally.N,1)) between 48 and 57
OR SUBSTRING(t.Col1,Tally.N,1) IN ('.','-'))
)
-- put everything left back together
SELECT DISTINCT
RowID,
Original_value = Col1,
Value = REPLACE((SELECT ',' + Value FROM CTE WHERE RowID = c.RowID ORDER BY N FOR XML PATH('')),',','')
FROM CTE c
ORDER BY RowID
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
June 9, 2010 at 11:54 am
yes I wanted the keep the characters in between the numbers because these are monetary values so the characters in between the numbers are important (pounds and pence)
Due to the quality of the text inserted via our scanning system the character '.' and '-' appeared both in between the numbers and outside the numbers so removing them could cause £3.45 to look like £345 - plus some of the character it returned I didn't even know existed :ermm:
June 9, 2010 at 12:18 pm
Got it. I thought the bad characters were appearing both inbetween and outside. Sorry about that.
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams
June 10, 2010 at 2:26 am
no probs, thanks for the suggestion anyway
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply