query for consecutive alphabets

  • Hi

    I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • S_Kumar_S (4/3/2012)


    Hi

    I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.

    Hello and welcome to SSC!

    I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.

    If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.

    For now, I've made a best guess.

    Assuming your sample data looks like this: -

    --Your sample data!!

    CREATE TABLE yourTable (name VARCHAR(50))

    INSERT INTO yourTable

    SELECT name

    FROM (VALUES('Good'), --Keep

    ('Goabc'), --Filter Out

    ('Goghi'), --Filter Out

    ('Goghp') --Keep

    ) a(name);

    name

    --------------------------------------------------

    Good

    Goabc

    Goghi

    Goghp

    You could get the results you want like this: -

    SELECT name

    FROM yourTable a

    CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

    FROM (SELECT TOP 24

    '%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

    FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

    (9),(10),(11),(12),(13),(14),(15),

    (16),(17),(18),(19),(20),(21),(22),

    (23),(24),(25),(26)) tbl(N)) innerQuery

    ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

    WHERE filter = 0;

    On the above sample data, that produces: -

    name

    --------------------------------------------------

    Good

    Goghp

    If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.

    Thanks.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Thanks. it helped a lot. Sorry for not putting sample data.

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • Cadavre (4/3/2012)


    S_Kumar_S (4/3/2012)


    Hi

    I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.

    Hello and welcome to SSC!

    I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.

    If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.

    For now, I've made a best guess.

    Assuming your sample data looks like this: -

    --Your sample data!!

    CREATE TABLE yourTable (name VARCHAR(50))

    INSERT INTO yourTable

    SELECT name

    FROM (VALUES('Good'), --Keep

    ('Goabc'), --Filter Out

    ('Goghi'), --Filter Out

    ('Goghp') --Keep

    ) a(name);

    name

    --------------------------------------------------

    Good

    Goabc

    Goghi

    Goghp

    You could get the results you want like this: -

    SELECT name

    FROM yourTable a

    CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

    FROM (SELECT TOP 24

    '%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

    FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

    (9),(10),(11),(12),(13),(14),(15),

    (16),(17),(18),(19),(20),(21),(22),

    (23),(24),(25),(26)) tbl(N)) innerQuery

    ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

    WHERE filter = 0;

    On the above sample data, that produces: -

    name

    --------------------------------------------------

    Good

    Goghp

    If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.

    Thanks.

    Very cool solution! Saving this in my repertoire!

    Jared
    CE - Microsoft

  • SQLKnowItAll (4/3/2012)


    Cadavre (4/3/2012)


    S_Kumar_S (4/3/2012)


    Hi

    I need a query(or function) that filter the names(column name) which have consecutive 3 alphabets. basically I want to filter out those garbage names. e.g. sabcmm has abc consecutive.So it should be filtered.

    Hello and welcome to SSC!

    I'd like to be able to help, but it seems you've forgot to post readily consumable sample data and ddl scripts.

    If you could read this article[/url] about the best way to post DDL and sample data then edit your post to include it, it would be extremely useful for the unpaid volunteers of this site.

    For now, I've made a best guess.

    Assuming your sample data looks like this: -

    --Your sample data!!

    CREATE TABLE yourTable (name VARCHAR(50))

    INSERT INTO yourTable

    SELECT name

    FROM (VALUES('Good'), --Keep

    ('Goabc'), --Filter Out

    ('Goghi'), --Filter Out

    ('Goghp') --Keep

    ) a(name);

    name

    --------------------------------------------------

    Good

    Goabc

    Goghi

    Goghp

    You could get the results you want like this: -

    SELECT name

    FROM yourTable a

    CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

    FROM (SELECT TOP 24

    '%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

    FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

    (9),(10),(11),(12),(13),(14),(15),

    (16),(17),(18),(19),(20),(21),(22),

    (23),(24),(25),(26)) tbl(N)) innerQuery

    ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

    WHERE filter = 0;

    On the above sample data, that produces: -

    name

    --------------------------------------------------

    Good

    Goghp

    If that doesn't either solve your problem or give you some ideas on how to solve it, then read this article[/url] and post back with your DDL, sample data and expected results.

    Thanks.

    Very cool solution! Saving this in my repertoire!

    +1, i like the way you generate the strings to match on. one of the reasons im on here is to find new solutions for future problems.


    For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

    For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

    Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
    Jeff Moden's Cross tab and Pivots Part 1[/url]
    Jeff Moden's Cross tab and Pivots Part 2[/url]

  • Is there an easier way to determine "good" vs "garbage"?

    would it be a solution to find values not in the dictionary, or values that do not pass a spell check operation, for example?

    I've got a dictionary that I converted to a SQL table from the Gutenburg project's Webster Unabridged Dictionary, and could use that to find exceptions as an example.

    What are some typical good vs bad values?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Cadavre (4/3/2012)


    You could get the results you want like this: -

    SELECT name

    FROM yourTable a

    CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

    FROM (SELECT TOP 24

    '%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

    FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

    (9),(10),(11),(12),(13),(14),(15),

    (16),(17),(18),(19),(20),(21),(22),

    (23),(24),(25),(26)) tbl(N)) innerQuery

    ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

    WHERE filter = 0;

    Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.

    Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).

    If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.

    And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.

    --===== Create a variable to dump to to take the display time out of the picture.

    DECLARE @Bitbucket VARCHAR(50);

    --===== Now, run the tests

    PRINT '========== Craig''s Method ======================================================'

    SET STATISTICS TIME ON;

    SELECT @Bitbucket = name

    FROM yourTable a

    CROSS APPLY (SELECT TOP 1 CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END

    FROM (SELECT TOP 24

    '%' + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ',tbl.N,3) + '%' AS strings

    FROM (VALUES(1),(2),(3),(4),(5),(6),(7),(8),

    (9),(10),(11),(12),(13),(14),(15),

    (16),(17),(18),(19),(20),(21),(22),

    (23),(24),(25),(26)) tbl(N)) innerQuery

    ORDER BY CASE WHEN a.name LIKE strings THEN 1 ELSE 0 END DESC) b(filter)

    WHERE filter = 0

    ;

    SET STATISTICS TIME OFF;

    PRINT '========== Jeff''s Method ======================================================='

    SET STATISTICS TIME ON;

    SELECT @Bitbucket = yt.Name

    FROM dbo.YourTable yt

    WHERE NOT EXISTS

    (

    SELECT TOP 1 1

    FROM dbo.Tally t

    WHERE t.N BETWEEN 1 AND LEN(yt.Name)-2

    AND CHARINDEX(SUBSTRING(yt.Name,t.N,3),'ABCDEFGHIJKLMNOPQRSTUVWXYZ') > 0

    )

    ;

    SET STATISTICS TIME OFF;

    Here are the statistics results (parallelism kicked in on both)...

    ========== Craig's Method ======================================================

    SQL Server Execution Times:

    CPU time = 56862 ms, elapsed time = 34221 ms.

    ========== Jeff's Method =======================================================

    SQL Server Execution Times:

    CPU time = 32292 ms, elapsed time = 16701 ms.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/3/2012)


    Ohhhhh, be careful, Craig

    Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂

  • Jeff Moden (4/3/2012)


    Ohhhhh, be careful, Craig. You have absolutely the right idea but you've applied it "backwards". If you look at the actual execution plan, you're splitting the alphabet 24 times for each word for a total of 96 substrings as witnessed by the 96 times the constant scan is used in the execution plan.

    Instead of busting up the alphabet in 24 - 3 letter segments for each word, find the 3 letter segments in each word and use CHARINDEX to do the heavy lifting. AND, if you look for the things you don't want to find, no need for a CASE to get away with a nice quick TOP 1 like you tried (which didn't help the alphabet splits at all).

    If you really want to see some "Fat" arrows, add a million rows to the table and see that you end up with more than 24 million constant scans. Even at "memory speeds", that takes a long time.

    And, I swear, sometimes life is just easier with a Tally table even if it does make a few more logical reads. Compare the actual execution plan of the following with yours. Look for "fat" lines when you do. Here's the test I ran with your code and mine.

    ========== Craig's Method ======================================================

    SQL Server Execution Times:

    CPU time = 56862 ms, elapsed time = 34221 ms.

    ========== Jeff's Method =======================================================

    SQL Server Execution Times:

    CPU time = 32292 ms, elapsed time = 16701 ms.

    Excellent. Thanks Jeff, will have a dig into it later.

    ColdCoffee (4/4/2012)


    Jeff, it was posted by Cadavre (If cadavre's original name is Craig, i apologize for the nit-pick) 🙂

    I go by "Craig" far more often than "Cadavre" (or Cad) 😀

    But I've been using Cadavre online since I was under 10 years old, more than half of my life! So it sort of feels like I have to keep using it now.


    --edit


    Not quite as pronounced a difference when I tried on my system, but still your method outperformed mine. I used a million GUIDs, seemed the easiest way of generating a million random strings: -

    SELECT TOP 1000000

    CONVERT(VARCHAR(36),NEWID()) AS name

    INTO yourTable

    FROM master.dbo.syscolumns sc1, master.dbo.syscolumns sc2, master.dbo.syscolumns sc3;

    ========== Craig's Method ======================================================

    SQL Server Execution Times:

    CPU time = 156330 ms, elapsed time = 42148 ms.

    ========== Jeff's Method =======================================================

    SQL Server Execution Times:

    CPU time = 133723 ms, elapsed time = 35192 ms.

    Note: the test box I ran that on has less than 2GB of memory.

    I wonder if it'd be faster to have a table with all of the strings that are disallowed in? Will have a look later on, but got a bit of a busy couple of weeks (release candidate has to be ready for Friday 13th, and I'm off from Thursday 12th to go to the Gadget Show in Birmingham) so unlikely to be today.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • this is a great script! i've already found a number of examples of 'bad' data.

    i had to be careful though as i found that names like 'Stuart' and 'Kirsty' are filtered out. (stu,rst)

  • Craig,

    Does your Tally table have a clustered index on it? If not, that may explain the results you got.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s.

    I used a million GUIDs, seemed the easiest way of generating a million random strings:

    Great minds think alike. That's what I did, as well, and I apologize for not posting the code (it was late and I forgot).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/4/2012)


    Craig,

    Does your Tally table have a clustered index on it? If not, that may explain the results you got.

    Yep, clustered index with fill factor 100.

    It's only a little tally table (0-1,000,000). Would it being a zero based table make a difference? I have a 1 based tally table on this box as well, so will have a look at some point.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Zero based wouldn't make a difference here. What may be making the difference is that you're working on a machine with less memory than I am and my Tally table is only 11k. Dunno if your Tally table is slipping out of cache or not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • On the same box, just tried a 1-11000 tally table and got the same results.

    I guess it's a memory issue. When I've got more time I'll free up some space on one of my beefy VMs and have another go.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply