Pattern matching on bad date stamps

  • Hey all,

    Have a tough one ... Long story short, we have an OCR application that scans documents and dumps the records into a database. Well, they're just now noticing that the dates it has been inserting are ... wrong. Some are right, but most are not. There is no common pattern between them (see attached for all unique entries) and most are very unique and can't even be translated back to a real date.

    Just looking for ideas/suggestions at this point on some good ways to try and attack this data for a one time cleanup as well as an ongoing maintenance routine to at least identify bad matches and simply send out notification for human intervention.

    Thanks all

  • ISDATE()

  • cliffb (11/5/2008)


    ISDATE()

    Yeah but I'd at least like to be able to somewhat be able to translate those that are close ... those with misspellings, or bad characters. I was thinking maybe a fuzzy lookup against a dimdate table?

  • Probably not what you want to hear, but I'm thinking that the OCR software, well... sucks and you may have to find some better software.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You'll have to write a cleanser.

    I would load a staging table that has the same coumns as your target table plus

    CleansedDate field

    IsValidDate field

    Update the IsValidDate field with the results of IsDate

    Run the bad records through a cleansing process.

    That's the best I can come up with right now. How important is the date? If it's not critical, don't spend a lot of time on it. Either way,it's going to take time to look through it. If this is something that is going to be a repetitive task, then you are going to be running in circles as you find more and more cases where the character is off by one (IE the dates with + on the end).

    I think Jeff is right though, look into some other software if possible. If not, try the above technique.

  • Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Is this reliable?

    SELECT dates FROM

    (

    SELECT '$.00 AIIMIE 01 2002' as dates union all

    SELECT '$1,764 21' union all

    SELECT '$10.00 T..MAS ONM' union all

    SELECT '$15.00 T..MAS ONM' union all

    SELECT '$15.00 TILLV R 2002' union all

    SELECT '$15.00 TIME 01 2002' union all

    SELECT '$15.00 TIME DR 2002' union all

    SELECT '$154A4' union all

    SELECT '$24-00' union all

    SELECT '$282.00 NT 1, U ONNO' union all

    SELECT '$7.89 AIIMIE 01 2002' union all

    SELECT '$799A4' union all

    SELECT '%EE' union all

    SELECT '(CHMARY 22, 2006' union all

    SELECT '(KINHER 21 2(H)H' union all

    SELECT '(KLUBER 25 2U2' union all

    SELECT '(MOBTR 21 2004' union all

    SELECT ')EC 1U, 2UU/' union all

    SELECT ')EC 31, 2007' union all

    SELECT ',SEP 02,2006' union all

    SELECT '.1/14/2008' union all

    SELECT '.DEC 17, 2007' union all

    SELECT '.MAR 25, 2008' union all

    SELECT '.NOVEMBER 23, 2004' union all

    SELECT '0/16/03' union all

    SELECT '0/21/08' union all

    SELECT 'APR 27, 0200' union all

    SELECT 'APR 27, 1987' union all

    SELECT 'APR 27, 2000' union all

    SELECT 'APR 27, 2001' union all

    SELECT 'APR 27, 2002' union all

    SELECT 'APR 27, 2003' union all

    SELECT 'APR 27, 2004' union all

    SELECT 'APR 27, 2005' union all

    SELECT 'APR 27, 2006' union all

    SELECT 'APR 27, 2007' union all

    SELECT 'APR 27, 2008' union all

    SELECT 'APR 27,2007' union all

    SELECT 'APR 28, 0200' union all

    SELECT 'APR 28, 0201' union all

    SELECT 'APR 28, 2000' union all

    SELECT 'APR 28, 2001' union all

    SELECT 'APR 28, 2003' union all

    SELECT 'APR 28, 2004' union all

    SELECT 'APR 28, 2005' union all

    SELECT 'APR 28, 2006' union all

    SELECT 'APR 28, 2007' union all

    SELECT 'APR 28, 2008' union all

    SELECT 'APR 28,2007' union all

    SELECT 'FEB 14, 2005' union all

    SELECT 'FEB 14, 2006' union all

    SELECT 'FEB 14, 2007' union all

    SELECT 'FEB 14, 2008' union all

    SELECT 'JAN 13, 2006' union all

    SELECT 'JAN 13, 2007' union all

    SELECT 'JAN 13, 2008' union all

    SELECT 'JAN 14, 2008' union all

    SELECT 'JAN 14, 0200' union all

    SELECT 'JAN 14, 1900' union all

    SELECT 'JAN 14, 1963' union all

    SELECT 'JAN 15, 2008' union all

    SELECT 'JAN 16, 0200' union all

    SELECT 'JAN 16, 1900' union all

    SELECT 'JAN 16, 2001' union all

    SELECT 'JAN 16, 2002' union all

    SELECT 'JAN 16, 2003' union all

    SELECT 'JAN 16, 2004' union all

    SELECT 'JAN 16, 2006' union all

    SELECT 'JAN 18, 2003' union all

    SELECT 'JAN 18, 2004' union all

    SELECT 'JAN 18, 2005' union all

    SELECT 'JAN 18, 2006' union all

    SELECT 'JAN 18, 2007' union all

    SELECT 'JAN 18, 2008' union all

    SELECT 'JAN 19, 0200' union all

    SELECT 'JAN 19, 1900' union all

    SELECT 'JUL 01, 0212' union all

    SELECT 'JUL 01, 0312' union all

    SELECT 'JUL 01, 0407' union all

    SELECT 'JUL 01, 0427' union all

    SELECT 'JUL 01, 0429' union all

    SELECT 'ULY 15, 2003' union all

    SELECT 'ULY 17, 2001' union all

    SELECT 'UNCUI,JUUT I' union all

    SELECT 'UOY. NO, AVOU' union all

    SELECT 'UOY. VU, AVOU' union all

    SELECT 'UUN U,NE .IM E' union all

    SELECT 'VEBRUEY DI 2004' union all

    SELECT 'VEMBER 12, 2004' union all

    SELECT 'WEY-YE $25-00'

    ) as t

    WHERE ISDATE(dates)=1 and LEN(dates)>=8


    Madhivanan

    Failing to plan is Planning to fail

  • Jeff Moden (11/6/2008)


    Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.

    Yep, I agree. I also know how stubborn Adam is so wanted to give him an option even if it was a crazy way to do it.

  • Madhivanan (11/6/2008)


    Is this reliable?

    SELECT dates FROM

    (

    SELECT '$.00 AIIMIE 01 2002' as dates union all

    SELECT '$1,764 21' union all

    SELECT '$10.00 T..MAS ONM' union all

    SELECT '$15.00 T..MAS ONM' union all

    SELECT '$15.00 TILLV R 2002' union all

    SELECT '$15.00 TIME 01 2002' union all

    SELECT '$15.00 TIME DR 2002' union all

    SELECT '$154A4' union all

    SELECT '$24-00' union all

    SELECT '$282.00 NT 1, U ONNO' union all

    SELECT '$7.89 AIIMIE 01 2002' union all

    SELECT '$799A4' union all

    SELECT '%EE' union all

    SELECT '(CHMARY 22, 2006' union all

    SELECT '(KINHER 21 2(H)H' union all

    SELECT '(KLUBER 25 2U2' union all

    SELECT '(MOBTR 21 2004' union all

    SELECT ')EC 1U, 2UU/' union all

    SELECT ')EC 31, 2007' union all

    SELECT ',SEP 02,2006' union all

    SELECT '.1/14/2008' union all

    SELECT '.DEC 17, 2007' union all

    SELECT '.MAR 25, 2008' union all

    SELECT '.NOVEMBER 23, 2004' union all

    SELECT '0/16/03' union all

    SELECT '0/21/08' union all

    SELECT 'APR 27, 0200' union all

    SELECT 'APR 27, 1987' union all

    SELECT 'APR 27, 2000' union all

    SELECT 'APR 27, 2001' union all

    SELECT 'APR 27, 2002' union all

    SELECT 'APR 27, 2003' union all

    SELECT 'APR 27, 2004' union all

    SELECT 'APR 27, 2005' union all

    SELECT 'APR 27, 2006' union all

    SELECT 'APR 27, 2007' union all

    SELECT 'APR 27, 2008' union all

    SELECT 'APR 27,2007' union all

    SELECT 'APR 28, 0200' union all

    SELECT 'APR 28, 0201' union all

    SELECT 'APR 28, 2000' union all

    SELECT 'APR 28, 2001' union all

    SELECT 'APR 28, 2003' union all

    SELECT 'APR 28, 2004' union all

    SELECT 'APR 28, 2005' union all

    SELECT 'APR 28, 2006' union all

    SELECT 'APR 28, 2007' union all

    SELECT 'APR 28, 2008' union all

    SELECT 'APR 28,2007' union all

    SELECT 'FEB 14, 2005' union all

    SELECT 'FEB 14, 2006' union all

    SELECT 'FEB 14, 2007' union all

    SELECT 'FEB 14, 2008' union all

    SELECT 'JAN 13, 2006' union all

    SELECT 'JAN 13, 2007' union all

    SELECT 'JAN 13, 2008' union all

    SELECT 'JAN 14, 2008' union all

    SELECT 'JAN 14, 0200' union all

    SELECT 'JAN 14, 1900' union all

    SELECT 'JAN 14, 1963' union all

    SELECT 'JAN 15, 2008' union all

    SELECT 'JAN 16, 0200' union all

    SELECT 'JAN 16, 1900' union all

    SELECT 'JAN 16, 2001' union all

    SELECT 'JAN 16, 2002' union all

    SELECT 'JAN 16, 2003' union all

    SELECT 'JAN 16, 2004' union all

    SELECT 'JAN 16, 2006' union all

    SELECT 'JAN 18, 2003' union all

    SELECT 'JAN 18, 2004' union all

    SELECT 'JAN 18, 2005' union all

    SELECT 'JAN 18, 2006' union all

    SELECT 'JAN 18, 2007' union all

    SELECT 'JAN 18, 2008' union all

    SELECT 'JAN 19, 0200' union all

    SELECT 'JAN 19, 1900' union all

    SELECT 'JUL 01, 0212' union all

    SELECT 'JUL 01, 0312' union all

    SELECT 'JUL 01, 0407' union all

    SELECT 'JUL 01, 0427' union all

    SELECT 'JUL 01, 0429' union all

    SELECT 'ULY 15, 2003' union all

    SELECT 'ULY 17, 2001' union all

    SELECT 'UNCUI,JUUT I' union all

    SELECT 'UOY. NO, AVOU' union all

    SELECT 'UOY. VU, AVOU' union all

    SELECT 'UUN U,NE .IM E' union all

    SELECT 'VEBRUEY DI 2004' union all

    SELECT 'VEMBER 12, 2004' union all

    SELECT 'WEY-YE $25-00'

    ) as t

    WHERE ISDATE(dates)=1 and LEN(dates)>=8

    You don't need the length check, just using the Isdate function should be fine.

  • http://sqlblogcasts.com/blogs/madhivanan/archive/2007/09/24/handle-isdate-with-care.aspx


    Madhivanan

    Failing to plan is Planning to fail

  • Interesting. thanks for that link.

  • cliffb (11/6/2008)


    Jeff Moden (11/6/2008)


    Cliff, take a look at the text file attached in the first post of this thread... not much to "pattern match" on... certainly, not enough to build a whole date in many cases.

    Yep, I agree. I also know how stubborn Adam is so wanted to give him an option even if it was a crazy way to do it.

    LOL! You're right about that... "tenacious" would be his middle name.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply