query for consecutive alphabets

Question

query for consecutive alphabets

S_Kumar_S

SSCrazy Eights

Points: 9037
More actions
April 3, 2012 at 7:03 am

#258701

Hi
I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Cadavre SSC-Forever Points: 41690 More actions · Answer 1

S_Kumar_S (4/3/2012)
Hi
I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.

Hello and welcome to SSC!

I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.

If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.

For now, I've made a best guess.

Assuming your sample data looks like this: -

--Your sample data!!

CREATE TABLE yourTable (name VARCHAR(50))

INSERT INTO yourTable

SELECT name

FROM (VALUES('Good'), --Keep

('Goabc'), --Filter Out

('Goghi'), --Filter Out

('Goghp') --Keep

) a(name);

name

--------------------------------------------------

Good

Goabc

Goghi

Goghp

You could get the results you want like this: -

SELECT name

FROM yourTable a

CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

FROM (SELECT TOP 24

'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

(9),(10),(11),(12),(13),(14),(15),

(16),(17),(18),(19),(20),(21),(22),

(23),(24),(25),(26)) tbl(N)) innerQuery

ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

WHERE filter = 0;

On the above sample data, that produces: -

name

--------------------------------------------------

Good

Goghp

If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.

Thanks.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

S_Kumar_S SSCrazy Eights Points: 9037 More actions · Answer 2

Thanks. it helped a lot. Sorry for not putting sample data.

Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

Jared SSC Guru Points: 61793 More actions · Answer 3

Cadavre (4/3/2012)
S_Kumar_S (4/3/2012)
Hi
I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Hello and welcome to SSC!
I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.
If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.
For now, I've made a best guess.
Assuming your sample data looks like this: -
--Your sample data!!
CREATE TABLE yourTable (name VARCHAR(50))
INSERT INTO yourTable
SELECT name
FROM (VALUES('Good'), --Keep
('Goabc'), --Filter Out
('Goghi'), --Filter Out
('Goghp') --Keep
) a(name);
name
--------------------------------------------------
Good
Goabc
Goghi
Goghp
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
On the above sample data, that produces: -
name
--------------------------------------------------
Good
Goghp
If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.
Thanks.

Very cool solution! Saving this in my repertoire!

Jared
CE - Microsoft

CapnHector SSCoach Points: 16143 More actions · Answer 4

SQLKnowItAll (4/3/2012)
Cadavre (4/3/2012)
S_Kumar_S (4/3/2012)
Hi
I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.
Hello and welcome to SSC!
I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.
If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.
For now, I've made a best guess.
Assuming your sample data looks like this: -
--Your sample data!!
CREATE TABLE yourTable (name VARCHAR(50))
INSERT INTO yourTable
SELECT name
FROM (VALUES('Good'), --Keep
('Goabc'), --Filter Out
('Goghi'), --Filter Out
('Goghp') --Keep
) a(name);
name
--------------------------------------------------
Good
Goabc
Goghi
Goghp
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;
On the above sample data, that produces: -
name
--------------------------------------------------
Good
Goghp
If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.
Thanks.
Very cool solution! Saving this in my repertoire!

+1, i like the way you generate the strings to match on. one of the reasons im on here is to find new solutions for future problems.

For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
Jeff Moden's Cross tab and Pivots Part 1[/url]
Jeff Moden's Cross tab and Pivots Part 2[/url]

Lowell SSC Guru Points: 323518 More actions · Answer 5

Is there an easier way to determine "good" vs "garbage"?

would it be a solution to find values not in the dictionary, or values that do not pass a spell check operation, for example?

I've got a dictionary that I converted to a SQL table from the Gutenburg project's Webster Unabridged Dictionary, and could use that to find exceptions as an example.

What are some typical good vs bad values?

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 6

Cadavre (4/3/2012)
You could get the results you want like this: -
SELECT name
FROM yourTable a
CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END
FROM (SELECT TOP 24
'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings
FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),
(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),
(23),(24),(25),(26)) tbl(N)) innerQuery
ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)
WHERE filter = 0;

Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.

Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).

If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.

And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.

--===== Create a variable to dump to to take the display time out of the picture.

DECLARE @Bitbucket VARCHAR(50);

--===== Now, run the tests

PRINT '========== Craig''s Method ======================================================'

SET STATISTICS TIME ON;

SELECT @Bitbucket = name

FROM yourTable a

CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

FROM (SELECT TOP 24

'%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

(9),(10),(11),(12),(13),(14),(15),

(16),(17),(18),(19),(20),(21),(22),

(23),(24),(25),(26)) tbl(N)) innerQuery

ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

WHERE filter = 0

;

SET STATISTICS TIME OFF;

PRINT '========== Jeff''s Method ======================================================='

SET STATISTICS TIME ON;

SELECT @Bitbucket = yt.Name

FROM dbo.YourTable yt

WHERE NOT EXISTS

(

SELECT TOP 1 1

FROM dbo.Tally t

WHERE t.N BETWEEN 1 AND LEN(yt.Name)-2

AND CHARINDEX(SUBSTRING(yt.Name,t.N,3),'ABCDEFGHIJKLMNOPQRSTUVWXYZ') > 0

)

;

SET STATISTICS TIME OFF;

Here are the statistics results (parallelism kicked in on both)...

========== Craig's Method ======================================================

SQL Server Execution Times:

CPU time = 56862 ms, elapsed time = 34221 ms.

========== Jeff's Method =======================================================

SQL Server Execution Times:

CPU time = 32292 ms, elapsed time = 16701 ms.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 7

Jeff Moden (4/3/2012)
Ohhhhh, be careful, Craig

Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂

Cadavre SSC-Forever Points: 41690 More actions · Answer 8

Jeff Moden (4/3/2012)
Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.
Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).
If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.
And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.
========== Craig's Method ======================================================
SQL Server Execution Times:
CPU time = 56862 ms, elapsed time = 34221 ms.
========== Jeff's Method =======================================================
SQL Server Execution Times:
CPU time = 32292 ms, elapsed time = 16701 ms.

Excellent. Thanks Jeff, will have a dig into it later.

ColdCoffee (4/4/2012)
Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂

I go by "Craig" far more often than "Cadavre" (or Cad) 😀

But I've been using Cadavre online since I was under 10 years old, more than half of my life! So it sort of feels like I have to keep using it now.

--edit

Not quite as pronounced a difference when I tried on my system, but still your method outperformed mine. I used a million GUIDs, seemed the easiest way of generating a million random strings: -

SELECT TOP 1000000

CONVERT(VARCHAR(36),NEWID()) AS name

INTO yourTable

FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

========== Craig's Method ======================================================

SQL Server Execution Times:

CPU time = 156330 ms, elapsed time = 42148 ms.

========== Jeff's Method =======================================================

SQL Server Execution Times:

CPU time = 133723 ms, elapsed time = 35192 ms.

Note: the test box I ran that on has less than 2GB of memory.

I wonder if it'd be faster to have a table with all of the strings that are disallowed in? Will have a look later on, but got a bit of a busy couple of weeks (release candidate has to be ready for Friday 13th, and I'm off from Thursday 12th to go to the Gadget Show in Birmingham) so unlikely to be today.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

davidandrews13 SSCertifiable Points: 6820 More actions · Answer 9

this is a great script! i've already found a number of examples of 'bad' data.

i had to be careful though as i found that names like 'Stuart' and 'Kirsty' are filtered out. (stu,rst)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 10

Craig,

Does your Tally table have a clustered index on it? If not, that may explain the results you got.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

p.s.

I used a million GUIDs, seemed the easiest way of generating a million random strings:

Great minds think alike. That's what I did, as well, and I apologize for not posting the code (it was late and I forgot).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Cadavre SSC-Forever Points: 41690 More actions · Answer 12

Jeff Moden (4/4/2012)
Craig,
Does your Tally table have a clustered index on it? If not, that may explain the results you got.

Yep, clustered index with fill factor 100.

It's only a little tally table (0-1,000,000). Would it being a zero based table make a difference? I have a 1 based tally table on this box as well, so will have a look at some point.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 13

Zero based wouldn't make a difference here. What may be making the difference is that you're working on a machine with less memory than I am and my Tally table is only 11k. Dunno if your Tally table is slipping out of cache or not.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Cadavre SSC-Forever Points: 41690 More actions · Answer 14

On the same box, just tried a 1-11000 tally table and got the same results.

I guess it's a memory issue. When I've got more time I'll free up some space on one of my beefy VMs and have another go.

Forever trying to learn
My blog - http://www.cadavre.co.uk/
For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/