I was introduced to Regular Expressions or RegEx when I was wearing my non-DBA hat in a content management consultancy. I had two specific projects that involved parsing and extracting content from HTML, XML and MS Word documents then converting them into content fragments within a content management system.
Of course I could have achieved the same ends using the ECMA script equivalent of LIKE, PATINDEX, CHARINDEX, SUBSTRING and REPLACE but RegEx offered a number of advantages:
- A rich and varied means of pattern matching strings albeit with a steep learning curve
- Wild card replace functionality
- The ability to split strings and handle the individual elements separately
- A simpler, less error prone means of parsing those strings.
For much of my time as a DBA the traditional string handling functions have been more than adequate for what I needed to do so my RegEx skills have been rusting quietly for 6 years.
With the advent of a big data warehousing project RegEx suddenly leapt back to the fore as a useful skill to have. If I can liken the SQL functions mentioned earlier to a Ford Focus diesel estate then RegEx is an all-options checked Audi A7 3.0TDI. It is possible to drive the Audi and get 50+ mpg but if you make full use of its capabilities you will see it dip below 30mpg. The same applies to RegEx. You can produce RegEx that is fairly efficient, but it is also possible to write expressions that are extremely expensive to run.
Nevertheless I've often wished SQL Server had RegEx functions so when SQL2005 introduced CLR functions that was on top of my list of additions to my SQL toolbelt.
SQL# and AdventureWorks
Solomon Rutzky put together an incredibly useful toolkit of CLR functions known as SQL# and several RegEx functions are included in that toolkit.
The toolkit has a community edition that is free and a commercial offering that has additional functionality. It is distributed as a SQL file that you simply run in the database in which you wish to implement the toolkit.
There is also a well written manual for the SQL# toolkit with examples that act as a good primer to help you get started with RegEx.
In this article I am going to provide some basic RegEx examples which, for simplicity's sake, will use the SQL# toolkit and AdventureWorks to demonstrate some of the uses and the power of RegEx.I am going to focus on RegEx.Replace as wild card replacement functionality is one of those wish list items that really demonstrate the power of regular expressions.
Data profiling with RegEx.Replace
If you are going to build a data warehouse it is essential to understand both the structure and the content of the data that is going to go into that warehouse. There are a number of techniques that can be used to inspect what the quality of your data actually is and amongst those is data profiling.
Arkady Maydanchick gives a good general description of the issues and techniques in his book "Data Quality Assessment".
For each character field we would list the following: -
- COUNT(<string field>) - Count the non-null occurrences
- COUNT(DISTINCT <string field> - Count the unique occurrences
- MIN(LEN(<string field>)) - Shortest length
- MAX(LEN(<string field>)) - Longest length
- MIN(<string field>) - First alphabetic occurrence
- MAX(<string field>) - Last alphabetic occurrence
What we could also do is count the separate profiles of the strings. For example, UK postcodes have the following profiles
- A9 9AA
- A99 9AA
- AA9 9AA
- AA99 9AA
- AA9A 9AA
- A9A 9AA
The latter two are London specific.
If we saw profiles other than the ones listed in a list of addresses that were supposed to be UK only then this would flag that we could have a data quality issue.
So if we wanted to replace all letters with capital A using the SQL#.RegEx_Replace, we would do something similar to the following:-
USE
AdventureWorksgo SELECT
PostalCode, PostalCodeProfile = SQL#.RegEx_Replace( PostalCode,-- The field we want to profile
'[A-Z]'
,-- The RegEx comparison for the range of letters we want to replace
'A'
,-- What we want to replace them with
-1,-- The number of occurrences that we will replace, in this case all of them.
1,-- The start position within the string
'IgnoreCase'
-- RegEx defaults to case sensitive so this overrides that behaviour
)FROM
Person.Address
There are a couple of things to notice here
- Although we want to change all letters to capital A we still specify the range [A-Z] and not [B-Z]. That is because if we had a postcode entered as Sa99 9BF the a would not change to an A.
- The IgnoreCase option turns the entire RegEx comparison into case insensitive.
There are occassions when we may want selective case insensitivity. We could have written our RegEx comparison as follows
SELECT
PostalCode, PostalCodeProfile = SQL#.RegEx_Replace( PostalCode,-- The field we want to profile
'(?i)[A-Z]'
,-- The RegEx comparison for the range of letters we want to replace
'A'
,-- What we want to replace them with
-1,-- The number of occurrences that we will replace, in this case all of them.
1,-- The start position within the string
''
-- Accept RegEx as case sensitive
)FROM
Person.Address
Of course we also want to replace any digits with a 9 so our full code to run a profile becomes
USE
AdventureWorks goWITH
PostCodeSet(PostalCode,PostalCodeProfile)AS
(SELECT
PostalCode, PostalCodeProfile = SQL#.RegEx_Replace( SQL#.RegEx_Replace(PostalCode,'[A-Z]'
,'A'
,-1,1,'IgnoreCase'
),'\d'
,'9'
,-1,1,''
)FROM
Person.Address )SELECT
PostalCodeProfile,Occurrences=COUNT(*)FROM
PostCodeSetGROUP BY
PostalCodeProfileORDER BY
2DESC
In the 2nd replace notice the \d comparison. This instructs RegEx to match against any digit. We could equally have written [0-9].
If you run the query above you will see some strange pattern occurrences which is precisely why we carry out such profiling activity in data quality assessments.
Stripping multiple spaces with RegEx_Replace
One of the most common problems with parsing strings is the occurrences of double spaces. Jeff Moden wrote a neat article on how to replace multiple spaces with one.
This is something that is also possible with RegEx_Replace.
DECLARE
@MyStringVARCHAR
(1000)SET
@MyString='This is a test of my strange spaces'
' +'
,-- The RegEx comparison for the range of letters we want to replace
' '
,-- What we want to replace them with
-1,-- The number of occurrences that we will replace, in this case all of them.
1,-- The start position within the string
''
-- Accept RegEx as case sensitive
)
Stripping double characters with RegEx_Replace
When we do phonetic matching it is often useful to get rid of double characters, or for that matter and number of multiple characters. Again RegEx can do this for us.
DECLARE
@MyStringVARCHAR
(1000)SET
@MyString='Cool Cattery Commission'
'(\w)\1+'
,-- Match against any word character [A-Za-z0-9_]
'$1'
,-- The variable to which the group match was assigned
-1,-- The number of occurrences that we will replace, in this case all of them.
1,-- The start position within the string
'IgnoreCase'
-- Make RegEx as case insensitive
)
The important thing here is that the condition wrapped up in parentheses (\w) identifies a "group" which will be assigned to the $1 variabl which can then be used in the replacement.
What we are really saying is:
- Match against any word character individually that immediately repeats one or more times.
- Capture the first character of this pattern to be referenced in the pattern itself (via \1) and in the replacement string (via $1)
- Replace the complete matched string, no matter how many consecutive characters with the single character held in the $1 variable.
Word boundary matches
One feature of RegEx I have found particulary useful is its ability to look at words within a string. One particular experiment I am currently leading is to find ways of mining text strings within unstructured data.
During an early iteration of the experiment I wanted to produce a SOUNDEX encoding of every word in a phrase.
One of the rules of SOUNDEX is that all vowels are dropped except when the vowel is the first letter of the word. RegEx can do this very simply as is illustrated by the example below
DECLARE
@StringVARCHAR
(50)='Annastasia''s anathema befuddled us all'
, @RegExVARCHAR
(50)='\B[AEIOUWHY]'
SELECT
correction=SQL#.RegEx_Replace(@String,@RegEx,'',-1,1,'IgnoreCase'),@String AS OriginalString
In this case \B tells RegEx that any vowel (including the Welsh ones) not at the beginning of a word should be replaced.
- \B means not at a word boundary
- \b means at a word boundary
Putting the \B or \b at the beginning of an expression tells RegEx that you are interested in the beginning of the word.
Complex patterns with RegEx
If we look at the possible patterns for a UK postal code we can see that there are some matching rules that can be applied
- A9 9AA
- A99 9AA
- AA9 9AA
- AA99 9AA
- AA9A 9AA
- A9A 9AA
Rule | RegEx | Example |
---|---|---|
One or two letter | [A-Z]{1,2} | A AA |
A digit | [0-9] | 9 |
Optionally either a letter or a digit | ([A-Z]|\d){0,1} | <nothing> A 9 |
A space followed by a digit and two letter | \d[A-Z]{2} | 9AA |
The following query demonstrates the RegEx pattern in operation
SELECT
*FROM
Person.AddressWHERE
SQL#.RegEx_IsMatch(PostalCode,'[A-Z]{1,2}\d([A-Z]|\d){0,1} \d[A-Z]{2}'
,1,'IgnoreCase'
)=1
This example introduces us to anew RegEx pattern matching rule.
Curly brackets { }, which allow us to specify how many of the preceding group of characters:-
- {2} means we must have two of the preceding "group" so [A-Z]{2} means we must have 2 letters.
- {1,2} means we must have a minimum of one and a maximum of 2 so [A-Z]{1,2} means we must have 1 or two letters.
- {0,1} means we can either have 0 or up to 1 so ineffect ([A-Z]|[0-9]){0,1} means that we don't to have either but we can have 1 of either a letter or a number.
At this point I have to point out that different implementations of RegEx across languages and operating systems have different dialects. In some cases {0,1} could be written as {,1} or just a single question mark. Also 3 or more could be written as {3,}.
The pipe symbol acts as an OR statement
Groups and RegEx
RegEx allows matches to be split into groups and assigned to variables.
For example, let us suppose that we have a system that writes UK Postal codes to the database as a concatenated string with no space between the first and second half of the postal code.
- CH5 1AA is written as CH51AA
- SK11 9AS is written as SK119AS
- ...etc
It is possible to use RegEx_Replace to insert a space at the appropriate place.
The following example shows how to do this using "groups" of letters as identified by () parentheses.
DECLARE
@PostalCodesTABLE
(PostalCodeVARCHAR
(8) NOT NULL)INSERT INTO
@PostalCodes(PostalCode)VALUES
('CH51AA'
),('SK119AS'
),('M11AA'
),('WC1N4DJ'
),('S119PF'
)/* SQL 2005 Equivalent INSERT INTO @PostalCodes(PostalCode) SELECT 'CH51AA' UNION ALL SELECT 'SK119AS' UNION ALL SELECT 'M11AA' UNION ALL SELECT 'WC1N4DJ' UNION ALL SELECT 'S119PF' */
DECLARE
@RegExVARCHAR
(2000)='([A-Z]{1,2}\d([A-Z]|\d){0,1})(\d[A-Z]{2})'
SELECT
CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'$1 $3'
,-1,1,'IgnoreCase'
), PostalCodeFROM
@PostalcodesWHERE
SQL#.RegEx_IsMatch(PostalCode,@RegEx,1,'IgnoreCase'
)=1
I have emphasised the '$1 $3'
for a reason.
Variable | Description |
---|---|
$1 | The first group encapsulates the entire 1st half of the postal code |
$2 | The 2nd group is nested in the 1st half of the postal code to look after the cases with the optional 4th alphanumeric. |
$3 | The 3rd group is the last 3 characters of the postal code |
By now you are beginning to see the power of RegEx but also the fact that the expressions can become complex and hard to debug.
I could have achieved the same split of postal codes from a slightly different RegEx
DECLARE
@RegExVARCHAR
(2000)='([A-Z]{1,2}\d)([A-Z]|\d){0,1}(\d[A-Z]{2})'
SELECT
CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'$1$2 $3'
,-1,1,'IgnoreCase'
),
Variable | Description |
---|---|
$1 | The first group handles the mandatory part of the first half of the postal code |
$2 | The 2nd group looks after the cases with the optional 4th alphanumeric. |
$3 | The 3rd group is the last 3 characters of the postal code |
If the two RegEx expressions are put side by side you will see the subtle differences
DECLARE
@RegExVARCHAR
(2000)='([A-Z]{1,2}\d([A-Z]|\d){0,1}
)
(\d[A-Z]{2})'DECLARE
@RegExVARCHAR
(2000)='([A-Z]{1,2}\d
)
([A-Z]|\d){0,1}(\d[A-Z]{2})'
This illustrates how easy it is to miss potential logic holes in RegEx expressions.
Non-capturing matches
Another useful feature is the ability to specify a pattern match where the entire string must be matched but only certain key segments are assigned to variables. We would use this for reasons of efficiency as the non-capturing matches are only for evaluation and require no special handling.
Let us suppose that we want to capture the first half of a UK postal code.
- Match against the full UK postal code to ensure that it actually is a postal code
- Only capture the first half
If the bit of the RegEx pattern that identifies the 2nd half of the postal code was changed as follows then this can be achieved.
(\d[A-Z]{2})
-- Match but do not capture
(?:\d[A-Z]{2})
Just to emphasise the point consider the following text
"In the UK contractors are allowed to own their own companies and have their employers pay that company. The contractor can then draw a salary from their company and be paid dividends taxable at a much lower rate than income tax. This is known as IR35. By far the worst offending postal code is EC4M 7LS."
If we wanted to extract the postal sector then if we simply ignored the 2nd half of the postal code we would extract IR35 which is completely erroneous.
Named group matches
It is also possible to capture match groups and refer to them by name. The code below illustrates a 3rd way of splitting out our postal codes.
DECLARE
@RegEx VARCHAR(2000)='(?''OutboundPostCode''([A-Z]{1,2}\d([A-Z0-9]){0,1}))(?''InboundPostCode''(\d[A-Z]{2}))'
SELECT
CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'${OutboundPostCode} ${InboundPostCode}'
,-1,1,'IgnoreCase'
),
As RegEx patterns become more complex the ability to assign names that have a business meaning to groups can help in aiding debugging.
RegEx and SEO keyword extraction
If you work for an e-commerce site then the cost of SEM (Search Engine Marketing) can rapidly consume a large part of a marketing budget. The more traffic you can gain from SEO (Search Engine Optimisation) the better.
SQL#.RegEx_Replace can be used to extract the search terms from the HTTP_REFERER for known search engines.
The example below shows a quick mock up of some search engine URLs and a method for extracting the search terms.
@URLList
-- Build a table containing search engine urls
DECLARETABLE
(URLVARCHAR
(4096) NOT NULL)INSERT INTO
@URLList(URL)SELECT
'http://www.google.co.uk/search?q=willier+macclesfield&ie=utf-8&oe=utf-8&aq=t&d=test+search&rls=org.mozilla:en-GB:official&client=firefox-a'
UNION ALL SELECT
'http://uk.search.yahoo.com/search;_ylt=AkV3R0MiO42H88ok8XhXbng4hJp4;_ylc=X1MDMjAyMzM5MjMxMgRfcgMyBGZyA3lmcC10LTcwMgRuX2dwcwMwBG9yaWdpbgN1ay55YWhvby5jb20EcXVlcnkDZGF3ZXMgbWFjY2xlc2ZpZWxkBHNhbwMx?p=dawes+macclesfield&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-702'
UNION ALL
-- Search engine that does not really exist to show that the url parameter "q" can be used as a default
SELECT
'http://www.phoneysearch.yahoo.com/search?q=trek+holmes+chapel&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-702'
-- Build a table containing the root url and the url parameter that contains the search terms.
DECLARE
@SearchEngineParamsTABLE
( SearchDomainVARCHAR
(255) NOT NULL , SearchIdentifierVARCHAR
(20) NOT NULL)INSERT INTO
@SearchEngineParams(SearchDomain,SearchIdentifier)SELECT
'http://www.google.co.uk/search'
,'q='
UNION ALL SELECT
'http://uk.search.yahoo.com/search'
,'p='
SELECT
U.URL, SearchTerms=SQL#.RegEx_Replace(URL,'.+(?:[\?&]'
+ISNULL
(SP.SearchIdentifier,'q='
) +')([^&]+)|(.+$)'
,'$1'
,-1,1,''
)FROM
@URLListAS
ULEFT JOIN
@SearchEngineParamsAS
SPON
U.URLLIKE
SP.SearchDomain+'%'
For those search engines for which we have no knowledge the following parts of the SQL code are of relevance.
- LEFT JOIN ensures that all urls are investigated regardless of whether our @SearchEngineParams has the particular search engine we are interested in
- The ISNULL statement allows us to specify a default search term identifier of "q=" for any missing search engine
It should be noted that search engines represent search terms in different ways
- Most use the url parameter q, hence our chosen default value
- Some use terms=, qry = , p= and a variety of others
- Some use + as a word separator, other use %20 meaning space (character 32)
Conclusion and useful references
Firstly thanks to Solomon Rutzky for his time peer reviewing this article and feedback. Constructive criticism is always welcome.
The examples given here are fairly basic to wet your appetite. Regular Expressions are as much art as science; it is truly amazing what some people have achieved with RegEx patterns.
Regular Expressions have been around for a considerable length of time and over the years RegEx engines have been refined and optimised so despite the complexity of what they are asked to do they are surprisingly efficient. That said, as with any tool, particularly a powerful one, it is easy to get carried away and use it inappropriately.
As mentioned earlier different languages and operating systems have slightly different dialects for RegEx.
For further reading consider some of the following: -
It is also worth finding a good RegEx testing tool. I generally use http://weitz.de/regex-coach/#install however not all RegEx syntax is supported by this particular tool.
One final point to make is that for efficiency sake it may be worth combining the use of T-SQL LIKE statements in the WHERE clause of a query to narrow down the number of records evaluated by RegEx.
This is similar to the concept of storing a CHECKSUM value for a string in an indexed field and then using this in a query such as in the example below.
CREATE PROC
dbo.SearchBigString @BigStringParamVARCHAR
(900)AS SET NOCOUNT ON SELECT
*FROM
dbo.BigStringTableWHERE
CheckSumField = CHECKSUM(@BigStringParam)-- Selective enough to get an INDEX SEEK
AND ContentField = @BigStringParam-- drills down to the specific item
RETURN
@@ROWCOUNTGO