Replace non numeric characters in string

  • Sergei Zarembo (3/20/2008)


    If you go via a CLR regex function to render "235 dogs, 5 cats" to "2355" then [considering u r doing millions of recs] this a bit more advanced replace would probably give you better performance:

    regex.replace("235 dogs, 5 cats", @"\D+","")

    --returns: "2355"

    it allows the .NET regex engine to roll from Nth to (N+1)th occurrence of a digit [\d] without stopping for a replace, capture all the \D (non-digits) and replace them at once with an empty str.

    Give the man a cigar! Very good thought, and borne out in testing. It shaves off another 10 seconds.

    Same function, just better replacement criteria. Very nice.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • ;)...for your regex needs, consider the best regex site:

    regexadvice.com

    We've got a regex construction forum there, including .NET, C# and VB versions.

    everyone is welcome to stop by.

  • Sorry guys, I lost track of this one... what ended up winning the race on the 4 million row test? RegEx or one of the functions?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The regex won if the pattern had text in between separate numbers, and you wanted that simply removed, the last function I proposed won if not.

    The example I used was "234 dogs, 5 cats". The regex would get you "2345" as the number, while my last function would get "234 dogs, 5". Since neither of those results makes any sense in any real-world situation, I don't think that part matters.

    On the other hand, complex numbers like "1.234e-10", I'm not sure what the regex would do, but my function would leave that intact. Mine would destroy "X^5", leaving only the "5". Again, not sure what the regex would do. Haven't tried it.

    As always, it depends on what data you're trying to clean up. But in cases where it will work, the last version of the string function was definitely much faster.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (4/21/2008)


    The regex won if the pattern had text in between separate numbers, and you wanted that simply removed, the last function I proposed won if not.

    The example I used was "234 dogs, 5 cats". The regex would get you "2345" as the number, while my last function would get "234 dogs, 5". Since neither of those results makes any sense in any real-world situation, I don't think that part matters.

    On the other hand, complex numbers like "1.234e-10", I'm not sure what the regex would do, but my function would leave that intact. Mine would destroy "X^5", leaving only the "5". Again, not sure what the regex would do. Haven't tried it.

    As always, it depends on what data you're trying to clean up. But in cases where it will work, the last version of the string function was definitely much faster.

    Correct - if you don't have advanced knowledge of any pattern - then regex takes over handily. Advanced knowledge that there is only one contiguous numeric range helps tilt that advantage to a T-SQL solution.

    As to scientific notation, or formula detection - not sure we took that one on. The initial issue simply involved pulling out nothing but the digits (no decimal places, signs, etc...).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Yeah, telling a regex to keep an "e" or "E" if it's after a number and before another number or a "+" or "-" which is followed by another number, but to remove it otherwise, is a bit of a trick. I'm sure it can be done (I'm NOT a regex designer), but it wasn't part of this request. I segued from "just keep the digits" to more complex situations.

    For the purpose of the original post, I think the regex function will do what's needed best, since it will deal with "(123) 456-7890" correctly. Though, for phone numbers, extension numbers will break the regex unless you make sure to truncate after the right number of digits. You'd have to also make sure it could deal with a variable for that, so that it handles "1 (234) 567-8901" if the preceding "1" is included in the field, or "(234) 567-8901" if it's not. Otherwise, you'll corrupt data like "1-234-567-8901 ext 2345", and end up with an undialable number, "1234567890" (left 10), or "123456789012345".

    Either way, use the regex to turn it into just numbers, or expand the regex to deal with phone number formats, or after you turn it into just numbers, check for known phone number issues like leading "1" and more than 10 digits. (At least, that will work for US phones. Will be different in other countries.)

    That part could easily be set-based.

    update table set phone = right(phone, len(phone) - 1) where phone like '1%'

    update table set extension= right(phone, len(phone) - 10) where len(phone) > 10

    update table set phone = left(phone, 10)

    Something like that, run after the regex, would get the desired end result of phone numbers in a standardized format, with extension stored in a separate column.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • So, in the million and 4 million row tests, what does your table and data look like? Are you just making a table with a Primary Key and a VARCHAR column to contain "235 dogs, 5 cats"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Never mind... I got it from one of the posts above... just to make sure, you're using this from one of Gus' posts?

    create table NumberClean (

    Number bigint,

    Clean varchar(100))

    go

    set nocount on

    set statistics time OFF

    insert into dbo.numberclean(number, clean)

    select 100, '100 ' union all

    select 5, 'number 5' union all

    select 15, '643' union all

    select 1234567890123456, '1234567890123456 pounds'

    go 1000000

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/22/2008)


    Never mind... I got it from one of the posts above... just to make sure, you're using this from one of Gus' posts?

    create table NumberClean (

    Number bigint,

    Clean varchar(100))

    go

    set nocount on

    set statistics time OFF

    insert into dbo.numberclean(number, clean)

    select 100, '100 ' union all

    select 5, 'number 5' union all

    select 15, '643' union all

    select 1234567890123456, '1234567890123456 pounds'

    go 1000000

    I can' t find what I used to build my table (started with the table def above, but rebuilt it once of twice) - right now, looks like 2 newID()'s cast as varchar's and smacked together....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks... I'll use Gus', then. I'll also throw in the dog/cat thing.

    Just in some preliminary testing on 4Meg rows, I've found that the SET STATISTICS TIME ON command really makes things go haywire. One test I did without it took 3:24... ran the exact same code snippet with it on and stopped the query after 13 minutes. Guess I'm going to have to upgrade from sp2 to the latest cumulitive update...

    Also, the following code makes the same test table as what Gus built, but a wee bit faster...

    [font="Courier New"]--=======================================================================================

    --      Create the 4 million row test table

    --      20 seconds to create, 31 seconds to create CLUSTERED PK

    --=======================================================================================

    --DROP TABLE dbo.NumberClean

    --===== Create and populate the table on the fly

    ;WITH

    cteDataBlock AS

    (

     SELECT CAST(100 AS BIGINT) AS Number, CAST('100 ' AS VARCHAR(100)) AS Clean UNION ALL

     SELECT 5, 'Number 5' UNION ALL

     SELECT 15, '643' UNION ALL

     SELECT 1234567890123456, '1234567890123456 Pounds'

    )

     SELECT TOP 4000000 IDENTITY (INT,1,1) AS ID,db.Number,db.Clean

       INTO dbo.NumberClean

       FROM cteDataBlock db,

            Master.sys.All_Objects ao1,

            Master.sys.All_Objects ao2

    --===== Add the clustered PK

      ALTER TABLE dbo.NumberClean

        ADD CONSTRAINT PK_NumberClean_ID

            PRIMARY KEY CLUSTERED (ID)[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GSquared (4/21/2008)


    Yeah, telling a regex to keep an "e" or "E" if it's after a number and before another number or a "+" or "-" which is followed by another number, but to remove it otherwise, is a bit of a trick. I'm sure it can be done (I'm NOT a regex designer), but it wasn't part of this request. I segued from "just keep the digits" to more complex situations.

    the logic can be implemented by the following regex match pattern:

    (?<!\d)[eE](?![+-]?\d)

    when run vs this input text:

    <

    1.25e-10; 123break5

    Yeah, telling a regex to keep

    an "e" or "E" if it's after a number and before another number or a "+" or "-" which is

    followed by another number, but to remove it otherwise, is a bit of a tric

    >

    ..with subsequent replacement of the matches by an empty string will give this result

    <

    1.25e-10; 123brak5

    Yah, tlling a rgx to kp

    an "" or "" if it's aftr a numbr and bfor anothr numbr or a "+" or "-" which is

    followd by anothr numbr, but to rmov it othrwis, is a bit of a trick.

    >

    ..keeping the mantissa intact. Construct (?<!\d) , called a look-behind, is telling the regex engine NOT to match on the next char IF it's preceded by a digit. Logically {AND} combined with similarly-functioning look-ahead (?![+-]?\d) {DON"T match IF a signed or unsigned digit follows} allows to avoid matching on *E* in hopefully most of the relevant cases. Depends on the input of course.

  • Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (4/24/2008)


    Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?

    GSquared,

    please tell me which Matt's regex you have in mind? I scrolled down the posts but could not find it.

    Thanks,

    Sergei Z

  • I don't think he actually posted the code for this one.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Sergei Zarembo (4/24/2008)


    GSquared (4/24/2008)


    Now, can that be added into Matt's regex easily, to deal with removing all other non-numeric symbols, but leaving complex numbers intact?

    GSquared,

    please tell me which Matt's regex you have in mind? I scrolled down the posts but could not find it.

    Thanks,

    Sergei Z

    I'm thinking GSquared is actually referring to the Regex expression you came up with

    dbo.regexreplace('235 dogs, 5 cats', '(\D)+','')

    the function is simply a wrapper to get to the CLR Regex functionality.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 31 through 45 (of 81 total)

You must be logged in to reply to this topic. Login to reply