RegEx for DBAs

  • Cadavre (5/9/2012)


    Thanks for the reply, although I think I need more convincing. I don't see it as being a "win" for RegEx due to it being more complicated to do in T-SQL. I'd always rather have a fast method than a slow method, complications can be well documented and everything goes through rigorous testing to ensure bugs are found.

    Thank you for doing that test (really). I simply don't have the time these days and it is good to see solid numbers. For this particular example I do agree that the T-SQL equivalent is not unreadable and in fact I figured it would be an even simpler form along the lines of:

    WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')

    But again, to be fair to the intent of David's article, this was an intro to Regular Expressions for people and not a shining example of the full power of super-complex patterns (Corey Lawson, in the post following your post that I am replying to, alluded to some of the more complex operations). The PostalCode example is just that: something that is easily digestible for people to comprehend what the syntax is doing. For the most part, I did kinda expect that the pure T-SQL condition would perform better, even if not as pretty.

    And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes. There are definitely some situations at work, typically ad-hoc debugging (i.e. functionality outweighs performance), where I would have GREATLY benefited from some RegEx functions, especially where the patterns in the data are not uniform like they are here. I am working on a related article (albeit very slowly due to being unable to teach the kids how to change their own diapers ;-)) that goes over various situations that I, and some others, have encountered where RegEx proved indispensable. Maybe when I get farther along with that I can have you and Jeff try to come up with pure T-SQL equivalents. Might be kinda fun :-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • corey lawson (5/9/2012)


    We should all try not to be too dogmatic about things. Even CURSORs and cross-joins have their uses, after all (and not everything tastes better with tabasco sauce on it).

    Who's being dogmatic here? I saw an article that was praising some RegEx SQLLRs and thought that might be a good thing but also saw nothing of comparative performance tests.

    As many others have stated, there are many things to consider when considering which tools to use and I know that very well because I've been burned by the best of them! Instead of getting lectured (yet again, I may add!) on that fact that there are indeed other things to consider other than performance, can't you Regex folks just meet my simple request to do some broad spectrum million row performance tests so people don't get burned?

    It's what any self respecting vendor, author, or endorser would do.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sherifffruitfly (5/8/2012)


    As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

    -sff

    You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/9/2012)


    sherifffruitfly (5/8/2012)


    As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

    -sff

    You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.

    Thanks, Jeff. I was unsure what relevance the post had to a suggestion for a simple RegEx testing tool...

    The next bit isn't in any way argumentative, just illustrative...

    I'd add that the RegEx I use is on low row-count but high string-count-per-row data - data that's completely isolated to our custom app. Nasty stuff like pulling any and all of four possible custom operators and their parameters from a string pulled from XML built by the UI. Yuk but it works very well.

    e.g.

    using '\[\d+[\:\-*\d+]+\]' to shred:

    WHERE site_id = 81 AND ([3154:2]-[3154:1]=0 AND [3154:3]-[3154:2]=0 AND [3154:4]-[3154:3]=0 AND [3154:5]-[3154:4]=0 AND [3154:6]-[3154:5]=0 AND [3154:7]-[3154:6]=0 AND [3154:8]-[3154:7]=0 AND [3154:9]-[3154:8]=0)

    or 'DailySum\(\[\d+[\:\-*\d+]+\]\)' and '\[\d+[\:\-*\d+]+\]' to shred:

    WHERE site_id = 29 AND (DailySum([1116]) > 0 or DailySum([1048]) > 0 and [1117] > 0 or DailySum([1027]) > 0 or DailySum([1026]) > 0 or DailySum([1126]) > 0 or DailySum([1141]) > 0 or DailySum([1088]) > 0 or DailySum([1090]) > 0 and DailySum([1092]) > 0 or DailySum([1128]) > 0)

    Can't quite bring myself to make it go faster with custom T-SQL for the sake of a few updates per day.

    Another great "it depends" article and series of comments! (EXCEPT FOR THE MORONS POLLUTING THE SITE WITH THEIR CRASS ATTEMPTS AT ATTRACTING BUSINESS - THEIR ONLY SUCCESS: ATTRACTING VILIFICATION!)

  • Solomon Rutzky (5/9/2012)


    And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes.

    I couldn't agree more! I just wanted to know what the performance curves on these "powerful and flexible" tools actually are because I've been burned by people that brought such tools into the shop.

    Although performance certainly isn't the only consideration in the proper evaluation of any tool, it becomes a major negative consideration when the sophisticated tool gets beat 45 to 1 by the simple and much less elegant native tool. Would YOU knowingly use a tool for a simple task if you knew it was going to be 45 times slower than another method? I would hope not.

    The other part of the problem is that if that tool is so slow on such a simple task, people really want to know what it's going to do on a more complex task where that "Powerful and Flexible tool" is supposed to really show it's stuff. Because there isn't much performance testing available on these tools (never mind any comparative testing), this thread looks like we've been introduced to a tool that's guaranteed to be 45 times slower according to the bit of testing done on this thread. That's a LOT slower in anyone's book and all other beneficial claims will not withstand that major shortcoming in most people's eyes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SAinCA (5/9/2012)


    Jeff Moden (5/9/2012)


    sherifffruitfly (5/8/2012)


    As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

    -sff

    You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.

    Thanks, Jeff. I was unsure what relevance the post had to a suggestion for a simple RegEx testing tool...

    The next bit isn't in any way argumentative, just illustrative...

    I'd add that the RegEx I use is on low row-count but high string-count-per-row data - data that's completely isolated to our custom app. Nasty stuff like pulling any and all of four possible custom operators and their parameters from a string pulled from XML built by the UI. Yuk but it works very well.

    e.g.

    using '\[\d+[\:\-*\d+]+\]' to shred:

    WHERE site_id = 81 AND ([3154:2]-[3154:1]=0 AND [3154:3]-[3154:2]=0 AND [3154:4]-[3154:3]=0 AND [3154:5]-[3154:4]=0 AND [3154:6]-[3154:5]=0 AND [3154:7]-[3154:6]=0 AND [3154:8]-[3154:7]=0 AND [3154:9]-[3154:8]=0)

    or 'DailySum\(\[\d+[\:\-*\d+]+\]\)' and '\[\d+[\:\-*\d+]+\]' to shred:

    WHERE site_id = 29 AND (DailySum([1116]) > 0 or DailySum([1048]) > 0 and [1117] > 0 or DailySum([1027]) > 0 or DailySum([1026]) > 0 or DailySum([1126]) > 0 or DailySum([1141]) > 0 or DailySum([1088]) > 0 or DailySum([1090]) > 0 and DailySum([1092]) > 0 or DailySum([1128]) > 0)

    Can't quite bring myself to make it go faster with custom T-SQL for the sake of a few updates per day.

    Another great "it depends" article and series of comments! (EXCEPT FOR THE MORONS POLLUTING THE SITE WITH THEIR CRASS ATTEMPTS AT ATTRACTING BUSINESS - THEIR ONLY SUCCESS: ATTRACTING VILIFICATION!)

    Gosh. Admittedly, I've spent no time learning Regex (just haven't needed it yet, I guess) and can only guess at what such parsing formulas do. That's why I was so happy that Craig posted some simple tests that I actually understand. I am definitely with you on not working on something for the sake of a few updates per day. There are much bigger fish to fry. But, as Solomon suggested, if I knew what these did, it might be fun to try to figure out a T-SQL way to do it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Guys, the reason I wrote the article was not to say RegEx is king or you must use it. I've got away with not using it for a very long time.

    I wrote the article with two intents

    1. To show a basic primer to RegEx

    2. To open up the possibility of it use as a weapon in the DBA arsenal.

    The examples I have given are VERY simple. This is deliberate to illustrate the basics of RegEx syntax.

    The performance tests posted by Cadavre

    are interesting. If I am reading them correctly they are saying the RegEx CPU usage is more intensive but that reads and scans are significantly lower.

    RegEx CPU = 52625 Reads = 2964

    T-SQL CPU = 22672 Reads = 11856

    After indexing the T-SQL reads drop to 3252

    If you are CPU bound then be very careful using RegEx.

    If you are IO bound then perhaps its got legs.

    If you have a tight execution window then a longer running process isn't good news.

    It is all about understanding the constraints that are applicable to your environment.

    When I use RegEx it tends to be for RegEx.Replace().

    I use normal T-SQL in the WHERE clause to get to close to the recordset that should be affected.

    As a DBA the natural inclination is to tune everything to its limits. Great if you have the time and resource but sometimes you have to take the triage approach.

    The puzzles on the BeyondRealtional site illustrate what is possible from T-SQL but a lot of it falls into the realms of expert T-SQL. Judging by the T-SQL I see day in day out most developers and DBAs would struggle to produce some of that code. Elegant and efficient it may be but that is because it is produced by master craftsman. I've done a lot of interviewing and there ain't many of those about.

  • Solomon Rutzky (5/9/2012)


    Thank you for doing that test (really). I simply don't have the time these days and it is good to see solid numbers. For this particular example I do agree that the T-SQL equivalent is not unreadable and in fact I figured it would be an even simpler form along the lines of:

    WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')

    But again, to be fair to the intent of David's article, this was an intro to Regular Expressions for people and not a shining example of the full power of super-complex patterns (Corey Lawson, in the post following your post that I am replying to, alluded to some of the more complex operations). The PostalCode example is just that: something that is easily digestible for people to comprehend what the syntax is doing. For the most part, I did kinda expect that the pure T-SQL condition would perform better, even if not as pretty.

    And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes. There are definitely some situations at work, typically ad-hoc debugging (i.e. functionality outweighs performance), where I would have GREATLY benefited from some RegEx functions, especially where the patterns in the data are not uniform like they are here. I am working on a related article (albeit very slowly due to being unable to teach the kids how to change their own diapers ;-)) that goes over various situations that I, and some others, have encountered where RegEx proved indispensable. Maybe when I get farther along with that I can have you and Jeff try to come up with pure T-SQL equivalents. Might be kinda fun :-).

    Take care,

    Solomon...

    What I like about this site, what keeps me coming back, is that I can have a discussion with complete strangers that often have far more experience than me (and greater knowledge!), where I disagree with what they've said and instead of getting angry they will discuss the topic. Kudos.

    Anyway, back to the topic at hand. You're right of course, the WHERE OR combination is simpler (I overlooked it to be honest).

    Here's some performance results: -

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'

    OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    No index

    -----------

    530604

    Table 'testEnvironment'. Scan count 1, logical reads 3212, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1422 ms, elapsed time = 1413 ms.

    With index

    -----------

    530604

    Table 'testEnvironment'. Scan count 1, logical reads 2606, physical reads 0, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1453 ms, elapsed time = 1447 ms.

    As for the more complex patterns, well if you have anything to hand then I'd love to see some performance results. The fact is that I'd love to use regular expressions instead of building up more complicated T-SQL, it'd make my job much easier. But I'm paid for doing the best I can in the time frame available, which means I can't justify using something that I have never seen outperform T-SQL.


    --edit--


    David.Poole (5/10/2012)


    Guys, the reason I wrote the article was not to say RegEx is king or you must use it. I've got away with not using it for a very long time.

    I wrote the article with two intents

    1. To show a basic primer to RegEx

    2. To open up the possibility of it use as a weapon in the DBA arsenal.

    The examples I have given are VERY simple. This is deliberate to illustrate the basics of RegEx syntax.

    Sorry, you must have posted while I was typing.

    I'm not arguing against using RegEx per se, I just wanted to see some performance results to give me a reason to use it. Don't get me wrong here, I have a RegEx CLR that I wrote on my server. I do tests to see if it's worth using, it's part of my process to explore a few options to every problem. I've just not come across a place where it beats the T-SQL alternative (except for string splitting, and it was so close that the T-SQL alternative was implemented so that we wouldn't need to convince our clients to enable CLR on their servers).

    David.Poole (5/10/2012)


    If I am reading them correctly they are saying the RegEx CPU usage is more intensive but that reads and scans are significantly lower.

    Yes, I read the results the same. So in an environment where CPU is king and memory is limited, it's probably better to use a CLR. Perhaps I'll set up a VM to simulate such an environment at a later date to test that theory.

    David.Poole (5/10/2012)


    When I use RegEx it tends to be for RegEx.Replace().

    I use normal T-SQL in the WHERE clause to get to close to the recordset that should be affected.

    I'll give this sort of task a go at some point, but I suspect that the T-SQL will have a very good chance of keeping up with the RegEx.

    Shall we take an example of an address string where we want to extract the post-code? Or do you have something else in mind? I'll set up a test for this scenario later on and we'll take a look.

    David.Poole (5/10/2012)


    The puzzles on the BeyondRealtional site illustrate what is possible from T-SQL but a lot of it falls into the realms of expert T-SQL. Judging by the T-SQL I see day in day out most developers and DBAs would struggle to produce some of that code. Elegant and efficient it may be but that is because it is produced by master craftsman. I've done a lot of interviewing and there ain't many of those about.

    I've been a developer for about 3 years now. I'm competent, but certainly not a master (or even a journeyman). I've had days where "right now" was more important that elegance and efficiency (not in my current job, where performance is king), but even in those cases it'd just mean that I'd go back to it later when time was more on my side.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Jeff Moden (5/12/2012)


    SPAM reported.

    Time for a spot of RegEx pattern matching:hehe:

  • David.Poole (5/13/2012)


    Jeff Moden (5/12/2012)


    SPAM reported.

    Time for a spot of RegEx pattern matching:hehe:

    Only if it doesn't take too long. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Cadavre (5/10/2012)


    What I like about this site, what keeps me coming back, is that I can have a discussion with complete strangers that often have far more experience than me (and greater knowledge!), where I disagree with what they've said and instead of getting angry they will discuss the topic. Kudos.

    Thank you. I see this as a good discussion and there is rarely, if ever, a reason to get angry about this stuff. It is either a difference of opinion or a miscommunication. And showing frustration and/or anger certainly isn't going to help mitigate those. I see the current issue as being one of communication so I will again try to clarify 😉 (and I also plan on using the explanation/examples below as part of my forth-coming article).

    Cadavre (5/10/2012)


    Anyway, back to the topic at hand. You're right of course, the WHERE OR combination is simpler (I overlooked it to be honest).

    I think the OR structure was slightly faster as it allows for short-circuiting and does not need to process all 6 patterns if any prior ones in the list produce a TRUE.

    Cadavre (5/10/2012)


    As for the more complex patterns, well if you have anything to hand then I'd love to see some performance results. The fact is that I'd love to use regular expressions instead of building up more complicated T-SQL, it'd make my job much easier. But I'm paid for doing the best I can in the time frame available, which means I can't justify using something that I have never seen outperform T-SQL.

    I'm not arguing against using RegEx per se, I just wanted to see some performance results to give me a reason to use it. Don't get me wrong here, I have a RegEx CLR that I wrote on my server. I do tests to see if it's worth using, it's part of my process to explore a few options to every problem. I've just not come across a place where it beats the T-SQL alternative (except for string splitting, and it was so close that the T-SQL alternative was implemented so that we wouldn't need to convince our clients to enable CLR on their servers).

    Ok. So the heart of the matter and confusion here (and this goes for Jeff and others here as well) is not in how Regular Expressions work, but what they can REALLY do. As has been said a few times already, the rather simple patterns discussed so far are more for educational purposes than real-world usage. Please (please!) do not get caught up on performance differences between these examples and what can be done in not-so-complicated T-SQL. Nobody is recommending that even moderately complex T-SQL be replaced by Regular Expressions. Regular Expressions are used to go beyond what T-SQL can do and looking at uniform patterns masks much of the power of RegEx.

    Regular Expressions really shine when looking for non-uniform patterns. SSNs, Phone Numbers, Postal Codes, etc. all have a somewhat fixed pattern. But when the width, or even existence of segments of patterns is variable, the number of LIKE or PATINDEX can become impossible to do. To put it another way, if you have a pattern that can be tested side-by-side with a T-SQL equivalent then maybe you should just use the T-SQL. Hence, what RegEx should be used for is situations that are not going to be able to be performance tested because there will be no T-SQL equivalent to compare against.

    In T-SQL we have the ability to do a single-character range using square brackets: [ and ]. But specifying that range does not allow for situations where that positions doesn't exist OR that single-character spec repeats to a variable degree. It also does not handle very well (or at all) specifying control characters, tab, newlines, unicode, etc. in that range whereas Regular Expressions allow for everything. But again, the variable nature of the pattern is really the power here. To simply say you have a pattern of "1 or more alpha characters (case-sensitive) followed by 1 or more white-spaces characters followed by 5 to 5000 digits followed by a single, optional ^ or % followed by 1 or more word characters (non-case-sensitive)" is represented by the following Regular Expression:

    [a-z]+\s+\d{5,5000}[^%]?\w+

    Now please represent that using straight T-SQL. Keep in mind that this is a subset of whatever string is being tested. Meaning it can either be the entire string or a fragment anywhere within a larger string. And there might be multiple instance of that pattern within the string.

    But we also are not just testing for the existence of the pattern (i.e. IsMatch). We typically want to extract that pattern from the larger string, similar to using SUBSTRING (i.e. Match, Matches, and CaptureGroup).

    I have some code below that shows a working example to hopefully give a more visual idea of what Regular Expressions really do for us. The example is very contrived so it might seem a bit silly but it does show a few different things that can be done not only with patterns, but also how Match and CaptureGroup work to go beyond mere existence testing.

    The pattern is (emoticons are ": (" without the space between them):

    ([a-y]+)\d+(?:(?:(test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b

    and it means:

    • One or more alpha a-y characters (saved as Group1)
    • followed by one or more digits
    • followed by EITHER the word "test" xor "demo" followed by a dash followed by one or more digits (saved as Group2)

      XOR the word "real" xor one or more characters a,b, or c followed by three or more white-space characters

      followed by one or more characters x, y, or z (saved as Group3)

    • followed by the string ":"
    • followed by whatever was captured as Group1
    • followed by the string ":SKU{"
    • followed by whatever was captured as Group2 or Group3
    • followed by the string "}"
    • followed by whatever was captured as Group1
    • followed by a word-boundary (i.e. white-space, end of line, end of string, most punctuation)

    If you have SQL# installed you should be able to just copy and paste the code below (emoticons are ": (" without the space between them) to get a sense of what we are really talking about with regards to Regular Expressions.

    DECLARE @RegularExpression NVARCHAR(4000)

    DECLARE @TestCases TABLE

    (

    TestID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,

    Test NVARCHAR(4000) NOT NULL,

    IsMatchResult BIT

    )

    SET @RegularExpression =

    N'([a-y]+)\d+(?:(?:(test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b'

    INSERT INTO @TestCases (Test, IsMatchResult)

    SELECT'zzz bob12312test-65:bob:SKU{test}bob' AS [Test], 1 AS [IsMatchResult]

    UNION ALL

    SELECT'zzz bob12312real:bob:SKU{real}bob', 1

    UNION ALL

    SELECT'zzz bob12312real:bob:SKU{real}bob-', 1

    UNION ALL

    SELECT'zzz bob12312real:bob:SKU{real}boby', 0 -- last character negates the word-boundary

    UNION ALL

    SELECT'intro struffs..bob12312real:bob:SKU{real}bob.extra stuffs at the end', 1 -- intro and extra stuffs excluded from match

    UNION ALL

    SELECT'zzz bob12312aba z:bob:SKU{aba z}bob', 1

    UNION ALL

    SELECT'zzz bob12312aba

    zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx:bob:SKU{aba

    zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx}bob', 1

    UNION ALL

    SELECT'zzz bob12312aba z:bob:SKU{aba z}bob', 0 -- only 2 spaces between "aba" and "z" for group 3

    UNION ALL

    SELECT'zzzbob12312aba z:bob:SKU{aba z}bob', 1 -- still works as "z" is not part of group 1 pattern

    UNION ALL

    SELECT'yyybob12312aba z:bob:SKU{aba z}bob', 1 -- "y" could be part of group 1 but ignored as "yyybob" doesn't repeat

    UNION ALL

    SELECT'yyyyybob12312aba z:yyybob:SKU{aba z}yyybob', 1 -- first 3 "y" now included in group 1 as that pattern does repeat

    SELECTtc.*,

    '---' AS [---],

    match.*,

    '---' AS [---],

    SQL#.RegEx_IsMatch(tc.Test, @RegularExpression, 1, '') AS [IsMatch],

    SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 1, '', 1, -1, '') AS [CaptureGroup1],

    SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 3, '', 1, -1, '') AS [CaptureGroup3]

    FROM @TestCases tc

    CROSS APPLY SQL#.RegEx_Match(tc.Test, @RegularExpression, 1, '') match

    Assuming you have run the code above and can see what IsMatch, Match, and CaptureGroup are doing as well as what does and does not match that pattern, please represent that in straight T-SQL so we can do a performance comparison. And be sure that your code allows for the pattern to exist multiple times within the larger string and can extract each instance. This is done via the Matches function which I have not shown here but will in my future article.

    Hopefully it is a bit clearer now as to why Regular Expressions are such a powerful tool. It is doubtful that you will need to use them frequently, but just knowing about them and what they can do will hopefully help when encountering such situations :-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • I very much appreciate the time you've spent on that great explanation, Solomon. And I absolutely believe that you've hit the nail on the head when you said...

    ...if you have a pattern that can be tested side-by-side with a T-SQL equivalent then maybe you should just use the T-SQL. Hence, what RegEx should be used for is situations that are not going to be able to be performance tested because there will be no T-SQL equivalent to compare against.

    That's the whole crux of the matter and the very much needed point that people tend to leave out when they write articles about SQL Server and Regex. Without mentioning that point, there's going to be a lot of inexperienced folks that use Regex for things that they really should be using T-SQL for. I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server until it's been cleaned up and sorted out a bit. It's kind of like people trying to use Full Text Search to do the things that you really need a more dedicated product for.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/13/2012)


    I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server.

    ...and then there is "Big Data".

  • David.Poole (5/14/2012)


    Jeff Moden (5/13/2012)


    I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server.

    ...and then there is "Big Data".

    Not sure what you mean by that, David but "Big Data" is nothing more than little data on steroids.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the explanation Solomon, I haven't forgotten but I'm quite busy at the moment. I intend to attempt to convert your more complicated RegEx into T-SQL, but I suspect that the amount of manipulation required may be too much for a particularly performant solution.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 15 posts - 31 through 45 (of 57 total)

You must be logged in to reply to this topic. Login to reply