April 3, 2012 at 7:03 am
Hi
I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.
April 3, 2012 at 7:20 am
S_Kumar_S (4/3/2012)
HiI need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Hello and welcome to SSC!
I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.
If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.
For now, I've made a best guess.
Assuming your sample data looks like this: -
--Your sample data!!
CREATE TABLE yourTable (name VARCHAR(50))
INSERT INTO yourTable
SELECT name
FROM (VALUES('Good'), --Keep
('Goabc'), --Filter Out
('Goghi'), --Filter Out
('Goghp') --Keep
) a(name);
name
--------------------------------------------------
Good
Goabc
Goghi
Goghp
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
On the above sample data, that produces: -
name
--------------------------------------------------
Good
Goghp
If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.
Thanks.
April 3, 2012 at 8:35 am
Thanks. it helped a lot. Sorry for not putting sample data.
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.
April 3, 2012 at 8:55 am
Cadavre (4/3/2012)
S_Kumar_S (4/3/2012)
HiI need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Hello and welcome to SSC!
I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.
If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.
For now, I've made a best guess.
Assuming your sample data looks like this: -
--Your sample data!!
CREATE TABLE yourTable (name VARCHAR(50))
INSERT INTO yourTable
SELECT name
FROM (VALUES('Good'), --Keep
('Goabc'), --Filter Out
('Goghi'), --Filter Out
('Goghp') --Keep
) a(name);
name
--------------------------------------------------
Good
Goabc
Goghi
Goghp
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
On the above sample data, that produces: -
name
--------------------------------------------------
Good
Goghp
If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.
Thanks.
Very cool solution! Saving this in my repertoire!
Jared
CE - Microsoft
April 3, 2012 at 11:47 am
SQLKnowItAll (4/3/2012)
Cadavre (4/3/2012)
S_Kumar_S (4/3/2012)
HiI need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Hello and welcome to SSC!
I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.
If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.
For now, I've made a best guess.
Assuming your sample data looks like this: -
--Your sample data!!
CREATE TABLE yourTable (name VARCHAR(50))
INSERT INTO yourTable
SELECT name
FROM (VALUES('Good'), --Keep
('Goabc'), --Filter Out
('Goghi'), --Filter Out
('Goghp') --Keep
) a(name);
name
--------------------------------------------------
Good
Goabc
Goghi
Goghp
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
On the above sample data, that produces: -
name
--------------------------------------------------
Good
Goghp
If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.
Thanks.
Very cool solution! Saving this in my repertoire!
+1, i like the way you generate the strings to match on. one of the reasons im on here is to find new solutions for future problems.
For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]
Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]
April 3, 2012 at 12:21 pm
Is there an easier way to determine "good" vs "garbage"?
would it be a solution to find values not in the dictionary, or values that do not pass a spell check operation, for example?
I've got a dictionary that I converted to a SQL table from the Gutenburg project's Webster Unabridged Dictionary, and could use that to find exceptions as an example.
What are some typical good vs bad values?
Lowell
April 3, 2012 at 11:17 pm
Cadavre (4/3/2012)
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.
Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).
If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.
And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.
--===== Create a variable to dump to to take the display time out of the picture.
DECLARE @Bitbucket VARCHAR(50);
--===== Now, run the tests
PRINT '========== Craig''s Method ======================================================'
SET STATISTICS TIME ON;
SELECT @Bitbucket = name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0
;
SET STATISTICS TIME OFF;
PRINT '========== Jeff''s Method ======================================================='
SET STATISTICS TIME ON;
SELECT @Bitbucket = yt.Name
FROM dbo.YourTable yt
WHERE NOT EXISTS
(
SELECT TOP 1 1
FROM dbo.Tally t
WHERE t.N BETWEEN 1 AND LEN(yt.Name)-2
AND CHARINDEX(SUBSTRING(yt.Name,t.N,3),'ABCDEFGHIJKLMNOPQRSTUVWXYZ') > 0
)
;
SET STATISTICS TIME OFF;
Here are the statistics results (parallelism kicked in on both)...
========== Craig's Method ======================================================
SQL Server Execution Times:
CPU time = 56862 ms, elapsed time = 34221 ms.
========== Jeff's Method =======================================================
SQL Server Execution Times:
CPU time = 32292 ms, elapsed time = 16701 ms.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 4, 2012 at 12:51 am
Jeff Moden (4/3/2012)
Ohhhhh, be careful, Craig
Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂
April 4, 2012 at 1:25 am
Jeff Moden (4/3/2012)
Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).
If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.
And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.
========== Craig's Method ======================================================
SQL Server Execution Times:
CPU time = 56862 ms, elapsed time = 34221 ms.
========== Jeff's Method =======================================================
SQL Server Execution Times:
CPU time = 32292 ms, elapsed time = 16701 ms.
Excellent. Thanks Jeff, will have a dig into it later.
ColdCoffee (4/4/2012)
Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂
I go by "Craig" far more often than "Cadavre" (or Cad) 😀
But I've been using Cadavre online since I was under 10 years old, more than half of my life! So it sort of feels like I have to keep using it now.
SELECT TOP 1000000
CONVERT(VARCHAR(36),NEWID()) AS name
INTO yourTable
FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;
========== Craig's Method ======================================================
SQL Server Execution Times:
CPU time = 156330 ms, elapsed time = 42148 ms.
========== Jeff's Method =======================================================
SQL Server Execution Times:
CPU time = 133723 ms, elapsed time = 35192 ms.
Note: the test box I ran that on has less than 2GB of memory.
I wonder if it'd be faster to have a table with all of the strings that are disallowed in? Will have a look later on, but got a bit of a busy couple of weeks (release candidate has to be ready for Friday 13th, and I'm off from Thursday 12th to go to the Gadget Show in Birmingham) so unlikely to be today.
April 4, 2012 at 3:49 am
this is a great script! i've already found a number of examples of 'bad' data.
i had to be careful though as i found that names like 'Stuart' and 'Kirsty' are filtered out. (stu,rst)
April 4, 2012 at 6:14 am
Craig,
Does your Tally table have a clustered index on it? If not, that may explain the results you got.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 4, 2012 at 6:17 am
p.s.
I used a million GUIDs, seemed the easiest way of generating a million random strings:
Great minds think alike. That's what I did, as well, and I apologize for not posting the code (it was late and I forgot).
--Jeff Moden
Change is inevitable... Change for the better is not.
April 4, 2012 at 6:44 am
Jeff Moden (4/4/2012)
Craig,Does your Tally table have a clustered index on it? If not, that may explain the results you got.
Yep, clustered index with fill factor 100.
It's only a little tally table (0-1,000,000). Would it being a zero based table make a difference? I have a 1 based tally table on this box as well, so will have a look at some point.
April 4, 2012 at 7:11 am
Zero based wouldn't make a difference here. What may be making the difference is that you're working on a machine with less memory than I am and my Tally table is only 11k. Dunno if your Tally table is slipping out of cache or not.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 4, 2012 at 7:21 am
On the same box, just tried a 1-11000 tally table and got the same results.
I guess it's a memory issue. When I've got more time I'll free up some space on one of my beefy VMs and have another go.
Viewing 15 posts - 1 through 15 (of 37 total)
You must be logged in to reply to this topic. Login to reply