Regular expression in T-sql

  • Eugene Elutin (10/1/2012)


    A couple of us just went through all of this on another thread a couple of months ago and the LIKE expressions blew the doors off the RegEx CLR.

    Could anyone post a link, please.

    Took me a bit to remember that the discussion was on an article by David Poole because it was Solomon Rutzky's code that we were discussing. Here's the link...

    http://www.sqlservercentral.com/Forums/Topic1296195-60-1.aspx

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/1/2012)


    Eugene Elutin (10/1/2012)


    A couple of us just went through all of this on another thread a couple of months ago and the LIKE expressions blew the doors off the RegEx CLR.

    Could anyone post a link, please.

    Took me a bit to remember that the discussion was on an article by David Poole because it was Solomon Rutzky's code that we were discussing. Here's the link...

    http://www.sqlservercentral.com/Forums/Topic1296195-60-1.aspx

    Thank you Jeff! I've read through thread and original article, and found what I've expected.

    This sort of using RegEx (or any other .NET lib) in CLR I call "CLR wrapper". And you are right: this implementation most likely will loose to T-SQL one in terms of performance. And there is good reson for this: It is because that kind of implementation is too generic and not optimised for performance.

    The main reason for it to be quite slow and to require higher memory consumption is that every time such CLR function is called, underlying object needs to be instantiated in .NET framework and regular expression pattern re-compiled!

    I don't usually do this (and not only with RegEx's) for the above particular reason.

    Now, if you would like to design process/function optimised for performance you may need to trade-off its "generic" nature.

    If you design your CLR function to work for the specific pattern (and it was couple discussions recently about how to remove every occurrence of GUID string in any text or strip some "patterns" from beginning and end of strings), you can create RegEx based CLR, which will be quite hard (if possible) to outperform with T-SQL.

    That is all to do with the way you code CLR with RegEx, there are two simple tips:

    1. Create static RegEx object with compiled pattern!

    2. Don't error handle in CLR

    There are another aspects to mention here:

    It will also depend on the string sizes: CLR will win over T-SQL even with bigger margin

    In case of replacement required, than more occurrences of string to replace, than again such CLR will win over T-SQL with bigger margin.

    But, as I've stated, you will need to have dedicated CLR for a pattern. Yes, it's not too flexible, but it gives you much, much better performance...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • The Dictionary.com definition of kludge:

    a software or hardware configuration that, while inelegant, inefficient, clumsy, or patched together, succeeds in solving a specific problem or performing a particular task

    I would not personally say that CLR is inherently inelegant, inefficient, clumsy or patched together. That you can write poor code with CLR shouldn't be evidence of it being a kludge. People can write bad code in any language.

    Certainly the judgement is in the eyes of the beholder and I can understand why someone with a lack of familiarity with .net languages would find CLR to meet the definition of a kludge. I just disagree that is inherently such.

    └> bt



    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • bteraberry (10/3/2012)


    The Dictionary.com definition of kludge:

    a software or hardware configuration that, while inelegant, inefficient, clumsy, or patched together, succeeds in solving a specific problem or performing a particular task

    I would not personally say that CLR is inherently inelegant, inefficient, clumsy or patched together. That you can write poor code with CLR shouldn't be evidence of it being a kludge. People can write bad code in any language.

    Certainly the judgement is in the eyes of the beholder and I can understand why someone with a lack of familiarity with .net languages would find CLR to meet the definition of a kludge. I just disagree that is inherently such.

    There is a great Itzek Ben-Gan article from 2006 or 07 titled (something like), "To CLR or Not to CLR: that is the question" In it he talks about how DBAs were apprehensive about introducing CLRs because they can be like giving knives and matches to kids."

    I have always been apprehensive of enabling CLRs for exactly that reason. In the same article he does recommend, for example, using a CLR for regexs (he includes a couple in the article). He tests the regex CLR vs Padindex and the CLR smokes the T-Sql implementation like 10 fold.

    CLRs are not risk free nor are they always a better choice. But, in some cases, to ignore them is to ignore a powerful Sql server tool.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • p.s. thanks jeff for the link... I am still reading that (very long) thread. Great stuff.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • bteraberry (10/3/2012)


    People can write bad code in any language.

    They certainly can. Why, I know this one guy that still uses a 1950's push stack and punch card technology to convert Adjacency Lists to Nested Sets! 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • XMLSQLNinja (10/3/2012)


    In the same article he does recommend, for example, using a CLR for regexs (he includes a couple in the article). He tests the regex CLR vs Padindex and the CLR smokes the T-Sql implementation like 10 fold.

    I guess it's my turn and you've just gotta know the question coming up...

    ... got a link for that? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eugene Elutin (10/2/2012)


    Jeff Moden (10/1/2012)


    Eugene Elutin (10/1/2012)


    A couple of us just went through all of this on another thread a couple of months ago and the LIKE expressions blew the doors off the RegEx CLR.

    Could anyone post a link, please.

    Took me a bit to remember that the discussion was on an article by David Poole because it was Solomon Rutzky's code that we were discussing. Here's the link...

    http://www.sqlservercentral.com/Forums/Topic1296195-60-1.aspx

    Thank you Jeff! I've read through thread and original article, and found what I've expected.

    This sort of using RegEx (or any other .NET lib) in CLR I call "CLR wrapper". And you are right: this implementation most likely will loose to T-SQL one in terms of performance. And there is good reson for this: It is because that kind of implementation is too generic and not optimised for performance.

    The main reason for it to be quite slow and to require higher memory consumption is that every time such CLR function is called, underlying object needs to be instantiated in .NET framework and regular expression pattern re-compiled!

    I don't usually do this (and not only with RegEx's) for the above particular reason.

    Now, if you would like to design process/function optimised for performance you may need to trade-off its "generic" nature.

    If you design your CLR function to work for the specific pattern (and it was couple discussions recently about how to remove every occurrence of GUID string in any text or strip some "patterns" from beginning and end of strings), you can create RegEx based CLR, which will be quite hard (if possible) to outperform with T-SQL.

    That is all to do with the way you code CLR with RegEx, there are two simple tips:

    1. Create static RegEx object with compiled pattern!

    2. Don't error handle in CLR

    There are another aspects to mention here:

    It will also depend on the string sizes: CLR will win over T-SQL even with bigger margin

    In case of replacement required, than more occurrences of string to replace, than again such CLR will win over T-SQL with bigger margin.

    But, as I've stated, you will need to have dedicated CLR for a pattern. Yes, it's not too flexible, but it gives you much, much better performance...

    Great post, Eugene. Even though I'm not a .Net programmer in any way, shape, or form, I can summarize all of that...

    "It Depends". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/3/2012)


    XMLSQLNinja (10/3/2012)


    In the same article he does recommend, for example, using a CLR for regexs (he includes a couple in the article). He tests the regex CLR vs Padindex and the CLR smokes the T-Sql implementation like 10 fold.

    I guess it's my turn and you've just gotta know the question coming up...

    ... got a link for that? 😉

    here you http://www.sqlmag.com/article/tsql3/clr-or-not-clr-is-that-the-question-

    I am interested on your take.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jeff Moden (10/3/2012)


    bteraberry (10/3/2012)


    People can write bad code in any language.

    They certainly can. Why, I know this one guy that still uses a 1950's push stack and punch card technology to convert Adjacency Lists to Nested Sets! 😀

    Ouh, yes! He is a famous writer too! :hehe:

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply