how to remove numbers from strings?

  • Matt Miller (10/19/2007)


    I figured that was the general direction you were heading in. A lot of that's improved with the newer updates to the framework. Certainly after them and SQL2005 SP2 - seems to not happen a lot.

    And it never happens in SQL2000.

    Now tell me about advantages of the new version. 😉

    Lowell,

    I think Jeff repeated it twice, to avoid any misunderstanding.

    😎

    _____________
    Code for TallyGenerator

  • Now - the fun begins:)... Did I mention I love the cage-match format for testing? hehe.....

    All right - you want your 2 seconds back....I got them. I'll buy them for a dollar, and hand you back change for a twenty:P

    Some optimization tips only come back in the middle of the night, and like I mentioned in another post, I sat bolt upright @ 3am or so last night, in the "middle" of coding in my mind. I remembered a trick I was told might help optimize SQL Server's handling of CLR functions, which was to make sure to set the IsDeterministic andIsPrecise flags to True so that SQL would treat them as such.

    So - I replicated Jeff's scenario, added the non-flagged and the flagged versions of the code, and....drum roll please....

    Count of victims: 1,030,932

    Tally table: 00:00:52:510

    Not Flagged: 00:00:57:003

    Flagged: 00:00:52:853

    So - it looks like it's statistical a tie? WHEN you flag the functions correctly.

    Did I mention that RegexReplace has backreference capabilities? (So you could, say, flip the digit groups in a phone number lickety split) It also comes with a sleek leather carrying case? :w00t:

    Sorry - dream-state coding can have deleterious effects....but usually on your sense of humor...:P

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Oh sergiy - don't be afraid of little ol' CLR...It ain't going to bite you....Show me one feature

    MS introduced without a memory leak in its "first attempt":)

    "Just another arrow in the quiver"...:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • FYI - for those interested... The "flagged" version looks like this.

    Imports System

    Imports System.Data

    Imports System.Data.SqlClient

    Imports System.Data.SqlTypes

    Imports Microsoft.SqlServer.Server

    Imports System.Text.RegularExpressions

    Partial Public Class UserDefinedFunctions

    <Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

    Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString

    ' Add your code here

    Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value, optionS)

    Return New SqlString(rex.Replace(New String(input.Value), rep.Value))

    End Function

    End Class

    And then the whole "cleaning process" becomes:

    declare @dirtystring varchar(500)

    declare @cleanstring varchar(500)

    declare @rex varchar(100)

    select @dirtystring='123k4biodf-k123l4h123098d-09dfknqwerklhsdfo0er98tqwer/;/;leriuqwer-t0'

    select @rex='[^0-9]'

    select @cleanstring=dbo.regexreplace(@dirtystring,@rex,'')

    select @cleanstring

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt, do you think with Jeff's function fnDeleteNumbers (@String VARCHAR(8000)) from the post above code would be any longer?

    And Jeff already committed performance test.

    Result was not in favour of the "new feature".

    _____________
    Code for TallyGenerator

  • Sergiy - you missed my response to Jeff. Short answer - on the same platform (hardware+table+indexes+data spead) - they tie, or at least are close enough to consider either in my mind.

    Again... limited window of functionality being tested - not necessarily scalable to the T-SQL vs CLR war that is about to break out...

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (10/19/2007)


    Sergiy - you missed my response to Jeff. Short answer - on the same platform (hardware+table+indexes+data spead) - they tie, or at least are close enough to consider either in my mind.

    Again... limited window of functionality being tested - not necessarily scalable to the T-SQL vs CLR war that is about to break out...

    Matt, thanks for taking this so well and continuing to help measure the performance between the methods.

    Just to clarify, my point was meant to say the following rather than do a "cage match" over a couple of seconds... 'cause a couple of seconds just don't matter in the face of 5.5 million rows... 41 minutes? That's a different story 😛

    No, I've been trying to drive a couple of points home about CLRs, upgrades, and a couple of other things...

    1. Most people would accept the code that Derek found as good code... in fact, someone did because they were either too lazy or too much in a hurry to even consider making the Tally table. And, to make the code run faster, they would buy faster hardware and take all that time to set it up, do the installs, transfer the data, etc, etc, etc... when they could've spent an hour or two researching the correct solution and spent a very small bit of time changing the code to use the new solution.

    2. CLRs, cursors, DTS, and a whole herd of other "features" were put there for just for one reason... a lot of folks just don't understand the true power of RELATIONAL databases. They think it's just a place to "store data"... that T-SQL has no power... that the GUI is the real source of power and performance. Heck, look at cursors... if memory serves me correctly, SQL Server didn't even have cursors until version 6.5... they put cursors in because most people couldn't get a grip on the fact that it's a relational database and that good set-based code was required to get it to work. So, the put cursors in to allow "typical" ISAM (RBAR) Procedural programmers to use SQL both in GUI's and Batch code.

    That brings us into the next point...

    3. The reason why people think like that is because the GUI has become the focal point of all business and that's patently wrong! Look at all the IT ads... "Wanted: Java/C#/ASP developer with at least 5 years experience and some knowledge of SQL". They know the cost of everything and the value of nothing... I've seen many DBA's (even on this very forum) make the switch from being a DBA to being a GUI programmer because the "glory of the GUI" was too much to resist. Those same folks also claim that DBA's will become extinct... Maybe that will be partially true for pure "system" DBA's, but application DBA's that know how to serve up the data in a very high performance manner, protect the performance of the server, and protect the security, viability, and purity of the data will always be needed and have a job forever.

    4. Finally, even though the performance of the CLR and Tally Table solution are neck'n'neck when it comes to run duration (we haven't measured other performance factors, but high speed code doesn't use heary resources as long as low performance code), there is something that everyone forgets about... documentation. All my T-SQL code has the documentation built in including likely "search words"... if you want to find the code, just do a word search on SysComments and just like a forum, it'll find the correct code... if it's documented. Need to troubleshoot the code? Don't need to find the documentation or analyze raw code... the documentation in the code will walk you through it. The code is complete and it's all in SQL Server... don't need for a different type of "expert" to get involved, either. Don't need a C# guy or a Java guy to write any code for the database. And, like I said, the documentation is built in to the T-SQL. When you use CLR's, can you do a search on it's comments... or is that documentation somewhere else that the developer would need to know about?

    No... I whole heartedly agree with Serqiy... most of the new features that have come out are just to make it easier for the GUI folks and self proclaimed "SQL Experts" that don't really know T-SQL. And, some of those features are actually slower than "real" T-SQL... again, I don't have SQL 2k5 to test with, but (I believe it was Serqiy on the other CLR thread that said it), things like Cross-Apply are actually a fair bit slower than the real T-SQL to do the same thing... (some things like CTE's can be useful, but they're not needed)... people just don't wanna think about how to do it, anymore, and when performance suffers and that shiney new water-cooled server still doesn't improve the performance, then they finally post a thread on this forum saying "I've got this performance problem... can someone help me?"

    "Before you think outside the box, consider how good the box you're in is."

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Wow - lot to handle here

    Jeff Moden (10/19/2007)


    Just to clarify, my point was meant to say the following rather than do a "cage match" over a couple of seconds... 'cause a couple of seconds just don't matter in the face of 5.5 million rows... 41 minutes? That's a different story 😛

    Don't think anything on it. I don't often get a good opportunity to get into a tussle over doing the RIGHT thing. I really do like a chance once in a while to test my own assumptions (based incidently on BOL's own descriptions of the interaction between T-SQL and CLR).

    Besides - those 2 seconds got me to learn something, so I came out ahead.

    Jeff Moden (10/19/2007)


    1. Most people would accept the code that Derek found as good code... in fact, someone did because they were either too lazy or too much in a hurry to even consider making the Tally table. And, to make the code run faster, they would buy faster hardware and take all that time to set it up, do the installs, transfer the data, etc, etc, etc... when they could've spent an hour or two researching the correct solution and spent a very small bit of time changing the code to use the new solution.

    Well, I don't know if lazy is a unversal qualifier that always works. I wouldn't consider it good enough, but I might consider it good enough "for now" if I didn't know something about CLR and the circumstances required it. It might even get published while I work on something else to "improve" it.

    Jeff Moden (10/19/2007)


    2. CLRs, cursors, DTS, and a whole herd of other "features" were put there for just for one reason... a lot of folks just don't understand the true power of RELATIONAL databases. They think it's just a place to "store data"... that T-SQL has no power... that the GUI is the real source of power and performance. Heck, look at cursors... if memory serves me correctly, SQL Server didn't even have cursors until version 6.5... they put cursors in because most people couldn't get a grip on the fact that it's a relational database and that good set-based code was required to get it to work. So, the put cursors in to allow "typical" ISAM (RBAR) Procedural programmers to use SQL both in GUI's and Batch code.

    I'll both agree and disagree with you in there. All of these were brought in to deal with different aspects and for different purposes. The new features tend to gets abused and used when not appropriate, and that's truly a shame, but that doesn't negate their correct use IN THE RIGHT PLACE.

    Data processing isn't just limited to GUI level, or batch-level anymore. The reality is that there are lots of circumstances in between, often some ugly conglomerate in between. Like having a RAD option for getting some reasonably low volume imports/exports/messy workflow processing done (=DTS), or taking a task which is procedural by nature and programming it using a procedural language like incredibly ugly string processing, interacting with things OUTSIDE of SQL, etc.. (CLR).

    Data in my mind is like a project - it never survives contact with the real world. We as the DBA's and data modelers and data architects can come up with a "perfect model", but there are ALWAYS exceptions we somehow have to account for. Now our job and our duty as DBA's/modeler/etc... is to stand in the way of chaos and prevent the users from running roughshod all over our data, but the reality is - that data never quite stays on the pedestal it was on when we planned the d*** thing.

    The trick is having the ability to see which tasks might work under which tools you have and possibly test one or two options to see which works better.

    To paraphrase the BOL section about best practices when dealing with CLR: Data access happens SOLELY through T-SQL, but things that are extensive procedural computations MIGHT be better done using a compiled procedural solution like CLR. I believe that, I see it, and I sometimes have to test two options to know what is better.

    You, Sergiy, and a host of others here have a phenomenal gift at knowing somewhat undocumented/obscure features/optimized kludges in SQL Server that allow you to find T-SQL solutions to damn near anything, such as the running total concepts (which I've seen qualified by various MS bloggers as the "inner loop process", which I hope won't send you into convulsions). It might still be worth checking the other side of the fence on occasion.

    Jeff Moden (10/19/2007)


    That brings us into the next point...

    3. The reason why people think like that is because the GUI has become the focal point of all business and that's patently wrong! Look at all the IT ads... "Wanted: Java/C#/ASP developer with at least 5 years experience and some knowledge of SQL". They know the cost of everything and the value of nothing... I've seen many DBA's (even on this very forum) make the switch from being a DBA to being a GUI programmer because the "glory of the GUI" was too much to resist. Those same folks also claim that DBA's will become extinct... Maybe that will be partially true for pure "system" DBA's, but application DBA's that know how to serve up the data in a very high performance manner, protect the performance of the server, and protect the security, viability, and purity of the data will always be needed and have a job forever.

    agreed. enough said there

    Jeff Moden (10/19/2007)


    4. Finally, even though the performance of the CLR and Tally Table solution are neck'n'neck when it comes to run duration (we haven't measured other performance factors, but high speed code doesn't use heary resources as long as low performance code), there is something that everyone forgets about... documentation. All my T-SQL code has the documentation built in including likely "search words"... if you want to find the code, just do a word search on SysComments and just like a forum, it'll find the correct code... if it's documented. Need to troubleshoot the code? Don't need to find the documentation or analyze raw code... the documentation in the code will walk you through it. The code is complete and it's all in SQL Server... don't need for a different type of "expert" to get involved, either. Don't need a C# guy or a Java guy to write any code for the database. And, like I said, the documentation is built in to the T-SQL. When you use CLR's, can you do a search on it's comments... or is that documentation somewhere else that the developer would need to know about?

    you're right - I'm not sure we gave it a good enough test to know. I think we picked something where the data access (T-SQL wins) and the procedural aspects (CLR wins) were fairly evenly matched. We also didn't exhaustively look for what other resources might be getting taxed/leveling the results, etc... I'd venture to say that we could find circumstances on both sides where one becomes the "clear winner".

    As to documentation - the CLR is compiled, so its source is available only in the pre-compiled view of it (Visual Studio). No compiled code ever does. Then again, it isn't for XP's either AFAIK, which are also compiled, so that's not a new limitation. It is a shame that there doesn't seem to be any way to "publish" comments for CLR to something like syscomments for the DBA to see. Of course - it's compiled, so it doesn't get re-interpreted each time.

    As to legibility/ease of understanding - a lot of that is in the eye of the beholder. If you have a thorough understanding of something like Regular Expressions, a one line regexreplace(string,pattern,replace) statement instead of a 4-25 line set of complicated statements (again - for something more complex than what we're doing right here) helps me understand the REST of the process a lot faster. Again - exposure is the key: I'm sure you can instinctively spot a tally table process in 5 pages of code, because that's something you live and breathe; it might take me somewhat longer.

    Jeff Moden (10/19/2007)


    No... I whole heartedly agree with Serqiy... most of the new features that have come out are just to make it easier for the GUI folks and self proclaimed "SQL Experts" that don't really know T-SQL. And, some of those features are actually slower than "real" T-SQL... again, I don't have SQL 2k5 to test with, but (I believe it was Serqiy on the other CLR thread that said it), things like Cross-Apply are actually a fair bit slower than the real T-SQL to do the same thing... (some things like CTE's can be useful, but they're not needed)... people just don't wanna think about how to do it, anymore, and when performance suffers and that shiney new water-cooled server still doesn't improve the performance, then they finally post a thread on this forum saying "I've got this performance problem... can someone help me?"

    Dealing with ideal circumstances, where you have full control over the data, the code, etc..., I'd agree. But some of those items (like CROSS APPLY) are built to deal with less than perfect scenarios, and yes - you pay a price for having them. Set-based processing covers a LOT of possible tasks, and when it does - use it since that's what SQL Server is optimized for.

    You and I are advocating the same thing ultimately - think (and I mean THINK) about what would be the best way to approach something. Once you THINK you know the right way, TEST that it's the right way, and challenge yourself by researching it and see it anyone else has done better. When you THINK it's a procedural process only, step back and take a second and/or third look - be SURE it's not something a set couldn't do.

    Jeff Moden (10/19/2007)


    "Before you think outside the box, consider how good the box you're in is."

    Corollary: No matter how good the box, remember - you're IN a box. There are things outside of the box.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Just a couple of comments.

    I really enjoyed the thread. A lot of good information and no one got personal as I have seen happen when people have different ideas in a forum. Of course any thread with posts from Jeff and Matt will be interesting:hehe:.

    I have to admit that I would have probably gone with Derek's code, not because I am lazy, but because I would not have thought of a better solution. Since I regularly visit SSC, I would have eventually seen this thread and learned a better way (actually 2). Not to say I would have immediately gone back to my code and changed it, but I would and will add the alternate solutions to my toolbox.

    I am with Jeff in that I always look for a set-based solution to a data problem avoiding cursors and loops as much as possible, even if I'm not smart enough to come up with the tally solution;). Sql Server and relational database systems are designed to work with sets so that's what I do.

  • Jack Corbett (10/19/2007)


    I am with Jeff in that I always look for a set-based solution to a data problem avoiding cursors and loops as much as possible, even if I'm not smart enough to come up with the tally solution;). Sql Server and relational database systems are designed to work with sets so that's what I do.

    Agreed - and for the record, so do I (look for the set-based approach). There are (few and far between) times where I might consider using another tactic, and I'd test both options to see which fly.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Sergiy


    Probably 2k5 is more comfortable for procedural language programmers, but for those who understand relational model and use to operate with datasets it's useless.

    While I'm on board with the "set based over CLR whenever possible" crowd, I have to take exception to this. If you want to claim that anything set-based that can be done in 2005 can be accomplished with 2000, you'll get only a minor argument from me, but that does't mean it's useless. There is nothing I can do in C++ that I can't do in Assembler, but that doesn't make C++ useless.

    For instance, you'll take Over() back out of SQL Server over my cold dead body. As Jeff mentioned, CTEs aren't necessary (although their recursive form is really a pain to duplicate in 2000), but they sure are nice. There are a ton of other things that come in very handy, even though most can be accomplished, without too much difficulty, via alternate means in 2000.

    Are there things that I liked more in 2000? Sure. Is 2005 useless? Not even close.

    I wouldn't go back to 2000 for a $5000 raise.

  • David, for some reason I did not have any doubts about you and some other folks.

    😉

    SQL 2k5 made for programming tools users having no math background, just like you.

    As I said - if you have trouble with learning relational model and proper SQL - then SQL2k5 is right for you. It makes data access more procedural, as you like and understand.

    But it does not make SQL2k5 useful for those who can build proper model and do proper querying.

    Out of curiosity: what have you found in OVER() what did not exist in derived table?

    _____________
    Code for TallyGenerator

  • Matt Miller (10/19/2007)


    Corollary: No matter how good the box, remember - you're IN a box. There are things outside of the box.

    Trick is that outside the box is another box.

    And for those who is in THAT box you're already outside the box.

    :hehe:

    So, choose carefully.:P

    _____________
    Code for TallyGenerator

  • Matt Miller (10/19/2007)


    Data in my mind is like a project - it never survives contact with the real world. We as the DBA's and data modelers and data architects can come up with a "perfect model", but there are ALWAYS exceptions we somehow have to account for.

    It meand they were bad data modellers. Nothing else.

    When I do my model I know where I put limits for the project.

    So, when something from outside those limitations come I don't treat it as an exception, I understand that I need just expand my model.

    Dealing with ideal circumstances, where you have full control over the data, the code, etc..., I'd agree. But some of those items (like CROSS APPLY) are built to deal with less than perfect scenarios, and yes - you pay a price for having them. Set-based processing covers a LOT of possible tasks, and when it does - use it since that's what SQL Server is optimized for.

    Cannot wait to hear for a single scenario where CROSS APPLY is the best way to go.

    Yes I know one - bad T-SQL skills.

    But it should be resolved not in using CROSS APLY but in learning things.

    Any other one you can think of?

    _____________
    Code for TallyGenerator

  • Jack Corbett (10/19/2007)


    Just a couple of comments.

    I really enjoyed the thread. A lot of good information and no one got personal as I have seen happen when people have different ideas in a forum. Of course any thread with posts from Jeff and Matt will be interesting:hehe:.

    I have to admit that I would have probably gone with Derek's code, not because I am lazy, but because I would not have thought of a better solution. Since I regularly visit SSC, I would have eventually seen this thread and learned a better way (actually 2). Not to say I would have immediately gone back to my code and changed it, but I would and will add the alternate solutions to my toolbox.

    I am with Jeff in that I always look for a set-based solution to a data problem avoiding cursors and loops as much as possible, even if I'm not smart enough to come up with the tally solution;). Sql Server and relational database systems are designed to work with sets so that's what I do.

    Mr. Corbett... thank you and other folks like you, Matt, Serqiy, Grant, Greg, and a whole host of others... you guys make this all worthwhile. Some of these experiments take a long time to setup, format, and test, and some of the posts take a long time to write... It's nice to get these type of compliments but it's really great to see folks participate without getting all personal and all. Better than that I'm tickled to see folks go for the set-based solution even when it might take a bit of extra time... it means the "lights" are coming on 🙂

    ... I learn something new here every day 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 46 through 60 (of 172 total)

You must be logged in to reply to this topic. Login to reply