Using Regular Expressions to clean text in a column using “Master Data Services” ???

  • Hi,

    I have two columns, the first is full of data that requires some sort of text matching clean up. There are words that are misspelled and there are sentences that require me to extract particular words. There are also cases that need to be defaulted to 'UNKNOWN'.

    The desired result is in column 2. Take not of the final few cases where it may be tricky since there are two key words that feature in one column.

    DROP TABLE #Driver

    CREATE TABLE #Driver (Driver varchar(max) ,DriverFix varchar(max))

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSSURED','INSURED')

    INSERT INTO #Driver VALUES ('10138','UNKNOWN')

    INSERT INTO #Driver VALUES ('PARKED','PARKED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SAM','UNKNOWN')

    INSERT INTO #Driver VALUES ('NULL','UNKNOWN')

    INSERT INTO #Driver VALUES ('MOTHER - 5703230003083 (WILMA VERMAAK)','MOTHER')

    INSERT INTO #Driver VALUES ('MR LUCAS KRUGER-ID NUMBER ','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MPHO','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('MR','UNKNOWN')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVER - PETER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('PARKD','PARKED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('PARENT','PARENT')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('REGULAR DRIVVER','REGULAR DRIVER')

    INSERT INTO #Driver VALUES ('INSURED - MR JOHN','INSURED')

    INSERT INTO #Driver VALUES ('INNSURRED','INSURED')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('SPOUSE','SPOUSE')

    INSERT INTO #Driver VALUES ('INSURED','INSURED')

    INSERT INTO #Driver VALUES ('FATTHER','FATHER')

    INSERT INTO #Driver VALUES ('THE INSURED PETER','INSURED')

    INSERT INTO #Driver VALUES ('THE INSURED JOHN','INSURED')

    INSERT INTO #Driver VALUES ('THE INSURED HARRY','INSURED')

    INSERT INTO #Driver VALUES ('THE INSURED MICHAEL ID NO 87689346846','INSURED')

    INSERT INTO #Driver VALUES ('THE INSURED WAS SARAH','INSURED')

    INSERT INTO #Driver VALUES ('THE INSURED 636913491','INSURED')

    INSERT INTO #Driver VALUES ('THE CAR WAS PARKED','PARKED')

    INSERT INTO #Driver VALUES ('PARKED AT THE STORE','PARKED')

    INSERT INTO #Driver VALUES ('MR. JOHN WAS PARKED','PARKED')

    INSERT INTO #Driver VALUES ('THE DAUGHTER WAS THE DRIVER','DAUGHTER')

    INSERT INTO #Driver VALUES ('SON HAD THE CAR 98787977','SON')

    INSERT INTO #Driver VALUES ('87894747','UNKNOWN')

    INSERT INTO #Driver VALUES ('THE INSURED DAUGHTER','DAUGHTER')

    INSERT INTO #Driver VALUES ('THE INSURED SON','SON')

    INSERT INTO #Driver VALUES ('WIFE ON THE INSURED','WIFE')

    Select * from #Driver

    I was told by the DBA guys that apparently SQL Server 2008R2 can support regular expressions from within T-SQL, but it requires the installation of “Master Data Services” – which ships free with SQL 2008. This has now been installed and configured.

    I was then then given this sample code to demonstrate how to use the regular expression:

    -- Test is a regular expression is valid...

    SELECT 'Is Valid:' , dbo.RegexIsValid('\d{3}');

    -- Create a "mask" argument to control how regular expressions are executed

    DECLARE @mask int = dbo.RegexMask(

    1, --cultureInvariant

    0, --explicitCapture

    1, --ignoreCase

    0, --ignorePatternWhiteSpace

    0, --multiline

    0, --rightToLeft

    0 --singleLine

    );

    -- Search a string for a regular expression

    SELECT 'Match 3 digits in 123-456-789 : ', dbo.RegexIsMatch('123-456-789', '(\d{3})', @mask);

    -- Replace any 3 digits

    SELECT 'Replace 3 digits in 123-456-789 with XXX : ', dbo.RegexReplace('123-456-789', '(\d{3})', 'XXX', @mask);

    SELECT 'Extract 1st group: ', dbo.RegexExtract('123-456-789', '(?<group1>\d{3})', 'group1', @mask);

    Problem is I have no idea what's going on and my knowledge of SQL Server is very limited.

    I'm looking for a solution so if there is any other method that doesn't use “Master Data Services” that's fine with me 🙂

    Thanks a million!!!!

  • You do not have to use regular expressions.

    You can:

    1) separate sentence into words (by space for example),

    2) match words with your custom dictionary of words with factors

    3) return matched word with highest factor value

    You can use the same dictionary for spelling corrections.

    Maybe there also can be used a fulltext search functionality of sql server.

    Regards,

    Vedran

    _____________________________________________________
    Microsoft Certified Master: SQL Server 2008
    XDetails Addin - for SQL Developers
    blog.sqlxdetails.com - Transaction log myths
  • Ended up using SSIS and the Fuzzy Lookup algorithm.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply