November 5, 2008 at 2:29 pm
Hey all,
Have a tough one ... Long story short, we have an OCR application that scans documents and dumps the records into a database. Well, they're just now noticing that the dates it has been inserting are ... wrong. Some are right, but most are not. There is no common pattern between them (see attached for all unique entries) and most are very unique and can't even be translated back to a real date.
Just looking for ideas/suggestions at this point on some good ways to try and attack this data for a one time cleanup as well as an ongoing maintenance routine to at least identify bad matches and simply send out notification for human intervention.
Thanks all
November 5, 2008 at 3:07 pm
ISDATE()
November 5, 2008 at 4:03 pm
cliffb (11/5/2008)
ISDATE()
Yeah but I'd at least like to be able to somewhat be able to translate those that are close ... those with misspellings, or bad characters. I was thinking maybe a fuzzy lookup against a dimdate table?
November 5, 2008 at 7:39 pm
Probably not what you want to hear, but I'm thinking that the OCR software, well... sucks and you may have to find some better software.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 6, 2008 at 12:12 am
You'll have to write a cleanser.
I would load a staging table that has the same coumns as your target table plus
CleansedDate field
IsValidDate field
Update the IsValidDate field with the results of IsDate
Run the bad records through a cleansing process.
That's the best I can come up with right now. How important is the date? If it's not critical, don't spend a lot of time on it. Either way,it's going to take time to look through it. If this is something that is going to be a repetitive task, then you are going to be running in circles as you find more and more cases where the character is off by one (IE the dates with + on the end).
I think Jeff is right though, look into some other software if possible. If not, try the above technique.
November 6, 2008 at 5:57 am
Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 6, 2008 at 6:50 am
Is this reliable?
SELECT dates FROM
(
SELECT '$.00 AIIMIE 01 2002' as dates union all
SELECT '$1,764 21' union all
SELECT '$10.00 T..MAS ONM' union all
SELECT '$15.00 T..MAS ONM' union all
SELECT '$15.00 TILLV R 2002' union all
SELECT '$15.00 TIME 01 2002' union all
SELECT '$15.00 TIME DR 2002' union all
SELECT '$154A4' union all
SELECT '$24-00' union all
SELECT '$282.00 NT 1, U ONNO' union all
SELECT '$7.89 AIIMIE 01 2002' union all
SELECT '$799A4' union all
SELECT '%EE' union all
SELECT '(CHMARY 22, 2006' union all
SELECT '(KINHER 21 2(H)H' union all
SELECT '(KLUBER 25 2U2' union all
SELECT '(MOBTR 21 2004' union all
SELECT ')EC 1U, 2UU/' union all
SELECT ')EC 31, 2007' union all
SELECT ',SEP 02,2006' union all
SELECT '.1/14/2008' union all
SELECT '.DEC 17, 2007' union all
SELECT '.MAR 25, 2008' union all
SELECT '.NOVEMBER 23, 2004' union all
SELECT '0/16/03' union all
SELECT '0/21/08' union all
SELECT 'APR 27, 0200' union all
SELECT 'APR 27, 1987' union all
SELECT 'APR 27, 2000' union all
SELECT 'APR 27, 2001' union all
SELECT 'APR 27, 2002' union all
SELECT 'APR 27, 2003' union all
SELECT 'APR 27, 2004' union all
SELECT 'APR 27, 2005' union all
SELECT 'APR 27, 2006' union all
SELECT 'APR 27, 2007' union all
SELECT 'APR 27, 2008' union all
SELECT 'APR 27,2007' union all
SELECT 'APR 28, 0200' union all
SELECT 'APR 28, 0201' union all
SELECT 'APR 28, 2000' union all
SELECT 'APR 28, 2001' union all
SELECT 'APR 28, 2003' union all
SELECT 'APR 28, 2004' union all
SELECT 'APR 28, 2005' union all
SELECT 'APR 28, 2006' union all
SELECT 'APR 28, 2007' union all
SELECT 'APR 28, 2008' union all
SELECT 'APR 28,2007' union all
SELECT 'FEB 14, 2005' union all
SELECT 'FEB 14, 2006' union all
SELECT 'FEB 14, 2007' union all
SELECT 'FEB 14, 2008' union all
SELECT 'JAN 13, 2006' union all
SELECT 'JAN 13, 2007' union all
SELECT 'JAN 13, 2008' union all
SELECT 'JAN 14, 2008' union all
SELECT 'JAN 14, 0200' union all
SELECT 'JAN 14, 1900' union all
SELECT 'JAN 14, 1963' union all
SELECT 'JAN 15, 2008' union all
SELECT 'JAN 16, 0200' union all
SELECT 'JAN 16, 1900' union all
SELECT 'JAN 16, 2001' union all
SELECT 'JAN 16, 2002' union all
SELECT 'JAN 16, 2003' union all
SELECT 'JAN 16, 2004' union all
SELECT 'JAN 16, 2006' union all
SELECT 'JAN 18, 2003' union all
SELECT 'JAN 18, 2004' union all
SELECT 'JAN 18, 2005' union all
SELECT 'JAN 18, 2006' union all
SELECT 'JAN 18, 2007' union all
SELECT 'JAN 18, 2008' union all
SELECT 'JAN 19, 0200' union all
SELECT 'JAN 19, 1900' union all
SELECT 'JUL 01, 0212' union all
SELECT 'JUL 01, 0312' union all
SELECT 'JUL 01, 0407' union all
SELECT 'JUL 01, 0427' union all
SELECT 'JUL 01, 0429' union all
SELECT 'ULY 15, 2003' union all
SELECT 'ULY 17, 2001' union all
SELECT 'UNCUI,JUUT I' union all
SELECT 'UOY. NO, AVOU' union all
SELECT 'UOY. VU, AVOU' union all
SELECT 'UUN U,NE .IM E' union all
SELECT 'VEBRUEY DI 2004' union all
SELECT 'VEMBER 12, 2004' union all
SELECT 'WEY-YE $25-00'
) as t
WHERE ISDATE(dates)=1 and LEN(dates)>=8
Failing to plan is Planning to fail
November 6, 2008 at 7:37 am
Jeff Moden (11/6/2008)
Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.
Yep, I agree. I also know how stubborn Adam is so wanted to give him an option even if it was a crazy way to do it.
November 6, 2008 at 8:11 am
Madhivanan (11/6/2008)
Is this reliable?
SELECT dates FROM
(
SELECT '$.00 AIIMIE 01 2002' as dates union all
SELECT '$1,764 21' union all
SELECT '$10.00 T..MAS ONM' union all
SELECT '$15.00 T..MAS ONM' union all
SELECT '$15.00 TILLV R 2002' union all
SELECT '$15.00 TIME 01 2002' union all
SELECT '$15.00 TIME DR 2002' union all
SELECT '$154A4' union all
SELECT '$24-00' union all
SELECT '$282.00 NT 1, U ONNO' union all
SELECT '$7.89 AIIMIE 01 2002' union all
SELECT '$799A4' union all
SELECT '%EE' union all
SELECT '(CHMARY 22, 2006' union all
SELECT '(KINHER 21 2(H)H' union all
SELECT '(KLUBER 25 2U2' union all
SELECT '(MOBTR 21 2004' union all
SELECT ')EC 1U, 2UU/' union all
SELECT ')EC 31, 2007' union all
SELECT ',SEP 02,2006' union all
SELECT '.1/14/2008' union all
SELECT '.DEC 17, 2007' union all
SELECT '.MAR 25, 2008' union all
SELECT '.NOVEMBER 23, 2004' union all
SELECT '0/16/03' union all
SELECT '0/21/08' union all
SELECT 'APR 27, 0200' union all
SELECT 'APR 27, 1987' union all
SELECT 'APR 27, 2000' union all
SELECT 'APR 27, 2001' union all
SELECT 'APR 27, 2002' union all
SELECT 'APR 27, 2003' union all
SELECT 'APR 27, 2004' union all
SELECT 'APR 27, 2005' union all
SELECT 'APR 27, 2006' union all
SELECT 'APR 27, 2007' union all
SELECT 'APR 27, 2008' union all
SELECT 'APR 27,2007' union all
SELECT 'APR 28, 0200' union all
SELECT 'APR 28, 0201' union all
SELECT 'APR 28, 2000' union all
SELECT 'APR 28, 2001' union all
SELECT 'APR 28, 2003' union all
SELECT 'APR 28, 2004' union all
SELECT 'APR 28, 2005' union all
SELECT 'APR 28, 2006' union all
SELECT 'APR 28, 2007' union all
SELECT 'APR 28, 2008' union all
SELECT 'APR 28,2007' union all
SELECT 'FEB 14, 2005' union all
SELECT 'FEB 14, 2006' union all
SELECT 'FEB 14, 2007' union all
SELECT 'FEB 14, 2008' union all
SELECT 'JAN 13, 2006' union all
SELECT 'JAN 13, 2007' union all
SELECT 'JAN 13, 2008' union all
SELECT 'JAN 14, 2008' union all
SELECT 'JAN 14, 0200' union all
SELECT 'JAN 14, 1900' union all
SELECT 'JAN 14, 1963' union all
SELECT 'JAN 15, 2008' union all
SELECT 'JAN 16, 0200' union all
SELECT 'JAN 16, 1900' union all
SELECT 'JAN 16, 2001' union all
SELECT 'JAN 16, 2002' union all
SELECT 'JAN 16, 2003' union all
SELECT 'JAN 16, 2004' union all
SELECT 'JAN 16, 2006' union all
SELECT 'JAN 18, 2003' union all
SELECT 'JAN 18, 2004' union all
SELECT 'JAN 18, 2005' union all
SELECT 'JAN 18, 2006' union all
SELECT 'JAN 18, 2007' union all
SELECT 'JAN 18, 2008' union all
SELECT 'JAN 19, 0200' union all
SELECT 'JAN 19, 1900' union all
SELECT 'JUL 01, 0212' union all
SELECT 'JUL 01, 0312' union all
SELECT 'JUL 01, 0407' union all
SELECT 'JUL 01, 0427' union all
SELECT 'JUL 01, 0429' union all
SELECT 'ULY 15, 2003' union all
SELECT 'ULY 17, 2001' union all
SELECT 'UNCUI,JUUT I' union all
SELECT 'UOY. NO, AVOU' union all
SELECT 'UOY. VU, AVOU' union all
SELECT 'UUN U,NE .IM E' union all
SELECT 'VEBRUEY DI 2004' union all
SELECT 'VEMBER 12, 2004' union all
SELECT 'WEY-YE $25-00'
) as t
WHERE ISDATE(dates)=1 and LEN(dates)>=8
You don't need the length check, just using the Isdate function should be fine.
November 7, 2008 at 12:42 am
http://sqlblogcasts.com/blogs/madhivanan/archive/2007/09/24/handle-isdate-with-care.aspx
Failing to plan is Planning to fail
November 7, 2008 at 6:09 am
Interesting. thanks for that link.
November 7, 2008 at 6:20 pm
cliffb (11/6/2008)
Jeff Moden (11/6/2008)
Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.Yep, I agree. I also know how stubborn Adam is so wanted to give him an option even if it was a crazy way to do it.
LOL! You're right about that... "tenacious" would be his middle name.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply