September 28, 2015 at 10:44 pm
Comments posted to this topic are about the item Wildcard Searches
September 29, 2015 at 1:56 am
Hi,
Thanks for the article. I decided to try your email wildcard like search on our email list with a small tweak.
Not all email address end with .com or .something. As in 90% of the emails in my db they end with .co.za, .co.gh, .co.zm - you get the idea.
Herewith the tweaked code:
SELECT *
FROM dbo.Emails
WHERE EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%.[A-Za-z][A-Za-z]%'
OR EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%'
I'm sure this should pick up most email adress combinations, but has anyone else come across other nuances in emails?
Ciao!
Michael
Michael Gilchrist
Database Specialist
There are 10 types of people in the world, those who understand binary and those that don't. 😀
September 29, 2015 at 2:04 am
It would be interesting to see some performance information with some large data to see how this performs.
I'd always thought that a lot of these kinds of operations would be fairly slow on anything other than trivially small amount of data (i.e. < 100K)
September 29, 2015 at 4:13 am
Thanks for this nice piece Luis!
😎
September 29, 2015 at 4:17 am
@Peter.row:
Of course it would use high CPU if you use a complicated pattern against a big data set.
And if you have no covering index (where the column is part of the index or included) it will result in a very I/O intensive table scan.
Depending on your scenario you could deliver all possible matching data to your application and do the filtering / searches at the client side (the most programming languages supports RegEx). It's your goal to decide if the bottleneck of your server is the CPU or the network.
September 29, 2015 at 6:20 am
Hey Luis
In the section "Starting by ‘some string’"
The comments of this
--C) Return all rows when the name starts by any character between A and L
SELECT *
FROM dbo.LIKETest
WHERE Name LIKE '[A-D]%';
It must say
--C) Return all rows when the name starts by any character between A and D
Just replace the last letter.
September 29, 2015 at 7:11 am
3 comments:
SELECT *
FROM dbo.LIKETest
WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';
That is insane.
'[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'
TIL 9000-19-39 is a valid date.
Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.
Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?
September 29, 2015 at 7:12 am
Great article Luis! Informative, to the point. I picked up a couple things.
Interesting technique using REPLICATE, I would add that I like to use CHAR() more often in my searches. E.g CHAR(32) for spaces, CHAR(9) for tabs when I'm using LIKE, PATINDEX or CHARINDEX. Much easier to distinguish than the literal version of each.
There are lots of things that can be done with patterns, like a pattern based splitter.
That would make for a great article 😉
-- Itzik Ben-Gan 2001
September 29, 2015 at 7:27 am
Michael G (9/29/2015)
Hi,Thanks for the article. I decided to try your email wildcard like search on our email list with a small tweak.
Not all email address end with .com or .something. As in 90% of the emails in my db they end with .co.za, .co.gh, .co.zm - you get the idea.
Herewith the tweaked code:
SELECT *
FROM dbo.Emails
WHERE EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%.[A-Za-z][A-Za-z]%'
OR EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%'
I'm sure this should pick up most email adress combinations, but has anyone else come across other nuances in emails?
Ciao!
Michael
Actually, the code in the article should include your examples as well as the ones that have no domain type, only country code. The problem is the false positives that might show up, such as the nonsense from Bruce Banner.
September 29, 2015 at 7:30 am
peter.row (9/29/2015)
It would be interesting to see some performance information with some large data to see how this performs.I'd always thought that a lot of these kinds of operations would be fairly slow on anything other than trivially small amount of data (i.e. < 100K)
I didn't include any performance testing because there aren't alternatives to this. If there are, please share them with us. I can measure the queries, but without equivalent queries, there's no point to it.
This is about getting the correct/desired information regardless of performance, which is a real world problem.
September 29, 2015 at 7:46 am
standardbluecaboose (9/29/2015)
3 comments:
SELECT *
FROM dbo.LIKETest
WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';
That is insane.
I'm not sure if this is good or bad.
'[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'
TIL 9000-19-39 is a valid date.
That's why it's referred as a basic validation. A single pattern can't validate for every single date, especially when dealing with Feb-29 on leap years. I used one validation in the past that would even check if the date was YYYY/MM/DD, DD/MM/YYYY or MM/DD/YYYY with some priority over the last two formats. That's only for data cleansing and starting on SQL 2012 we can forget about all this by using TRY_CONVERT()
Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.
Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?
I'm sorry, I did forget about this part.
It's actually a simple set of good habits to allow an index seek whenever possible. A few examples in here:
SELECT *
FROM dbo.LIKETest
WHERE Name LIKE 'B%'; -- SARGable, Uses Index Seek
SELECT *
FROM dbo.LIKETest
WHERE CHARINDEX( 'B', Name) = 1; -- non SARGable, Uses Index Scan
SELECT *
FROM dbo.LIKETest
WHERE LEFT( Name, 1) = 'B'; -- non SARGable, Uses Index Scan
SELECT *
FROM dbo.LIKETest
WHERE Name LIKE '[A-D]%'; -- SARGable, Uses Index Seek
SELECT *
FROM dbo.LIKETest
WHERE PATINDEX( '[A-D]%', Name) = 1; -- non SARGable, Uses Index Scan
September 29, 2015 at 7:49 am
Hi Luis,
If possible a sequel of this article could be a performance comparison between LIKE and PATINDEX, and maybe also include CONTAINS.
Also if Master Services is installed, you could use the REGEX match function from that database to look for patterns.
Thanks for the explanation about LIKE wildcards.
I have had the issue that if I use too complex wildcards, then PATINDEX doesn't work, and in those cases I have to stay with LIKE.
Regards,
José
September 29, 2015 at 7:54 am
Alan.B (9/29/2015)
Great article Luis! Informative, to the point. I picked up a couple things.Interesting technique using REPLICATE, I would add that I like to use CHAR() more often in my searches. E.g CHAR(32) for spaces, CHAR(9) for tabs when I'm using LIKE, PATINDEX or CHARINDEX. Much easier to distinguish than the literal version of each.
There are lots of things that can be done with patterns, like a pattern based splitter.
That would make for a great article 😉
Interesting, I prefer the single space over the CHAR(32) but the CHAR(9) over the tab literal. Maybe I'd just have to get used to it.
About the article, you mean writing about the things that can be done? or the splitter? because the splitter already has an article written by Dwain.
September 29, 2015 at 7:58 am
jose.pla (9/29/2015)
Hi Luis,If possible a sequel of this article could be a performance comparison between LIKE and PATINDEX, and maybe also include CONTAINS.
Also if Master Services is installed, you could use the REGEX match function from that database to look for patterns.
Thanks for the explanation about LIKE wildcards.
I have had the issue that if I use too complex wildcards, then PATINDEX doesn't work, and in those cases I have to stay with LIKE.
Regards,
José
I haven't had the opportunity to work with CONTAINS or REGEX. If you are familiar with them, write an article and submit it. It would be very interesting to see how it compares with the basic T-SQL. I know the flexibility would be different and there are additional requirements to implement those options, but some people might find them worth it for their applications.
September 29, 2015 at 7:58 am
Luis Cazares (9/29/2015)
standardbluecaboose (9/29/2015)
3 comments:
SELECT *
FROM dbo.LIKETest
WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';
That is insane.
I'm not sure if this is good or bad.
'[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'
TIL 9000-19-39 is a valid date.
That's why it's referred as a basic validation. A single pattern can't validate for every single date, especially when dealing with Feb-29 on leap years. I used one validation in the past that would even check if the date was YYYY/MM/DD, DD/MM/YYYY or MM/DD/YYYY with some priority over the last two formats. That's only for data cleansing and starting on SQL 2012 we can forget about all this by using TRY_CONVERT()
Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.
Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?
I'm sorry, I did forget about this part.
It's actually a simple set of good habits to allow an index seek whenever possible. A few examples in here:
SELECT *
FROM dbo.LIKETest
WHERE Name LIKE 'B%'; -- SARGable, Uses Index Seek
SELECT *
FROM dbo.LIKETest
WHERE CHARINDEX( 'B', Name) = 1; -- non SARGable, Uses Index Scan
SELECT *
FROM dbo.LIKETest
WHERE LEFT( Name, 1) = 'B'; -- non SARGable, Uses Index Scan
SELECT *
FROM dbo.LIKETest
WHERE Name LIKE '[A-D]%'; -- SARGable, Uses Index Seek
SELECT *
FROM dbo.LIKETest
WHERE PATINDEX( '[A-D]%', Name) = 1; -- non SARGable, Uses Index Scan
Thanks for the quick reply.
The WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';
is insane in a bad way. I have no suggestions for how to make it better, but it feels like such a hack. However, a lot about SQL feels like a hack to me.
Thank you for the update about the PATINDEX.
I will say that Column LIKE '%X%'`
and `PATINDEX('X', Column)
both do a full scan.
Viewing 15 posts - 1 through 15 (of 38 total)
You must be logged in to reply to this topic. Login to reply