Help with finding multiple Character Occurence in String

  • Hello all,

    I am trying to figure out how I can find the names of people who have a double occurrence or more of characters either in their first name or last name.

    This is for a data cleanup.

    Thanks!! for any suggestions

    CREATE TABLE #Names ( FIRSTNAME VARCHAR (20), LASTNAME VARCHAR (20))

    INSERT INTO #Names VALUES

    ('Thomas', 'Hohner')

    ,('Mike', 'SSmith')

    ,('TtTony', 'Hawk')

    ,('Jeff', 'Smith')

    ,('ZZSara', 'Jones')

    ,('Luke', 'HendersonZZ')

    ,('Lily', 'PPadZZZ')

    SELECT * FROM #Names

    DROP TABLE #Names

    ***SQL born on date Spring 2013:-)

  • This is a quick idea.

    WITH Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    )

    SELECT DISTINCT n.*

    FROM #Names n

    JOIN Letters l ON n.FIRSTNAME LIKE '%' + letter + letter+ '%'

    OR n.LASTNAME LIKE '%' + letter + letter+ '%'

    I'm thinking on something different but I need some more time.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here are my 2 options. Just wanted to note that I added an id to force order and that Jeff could be considered a false positive for a cleanup process.

    CREATE TABLE #Names ( id int, FIRSTNAME VARCHAR (20), LASTNAME VARCHAR (20))

    INSERT INTO #Names VALUES

    (1,'Thomas', 'Hohner')

    ,(2,'Mike', 'SSmith')

    ,(3,'TtTony', 'Hawk')

    ,(4,'Jeff', 'Smith')

    ,(5,'ZZSara', 'Jones')

    ,(6,'Luke', 'HendersonZZ')

    ,(7,'Lily', 'PPadZZZ');

    WITH E(n) AS(SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0)

    ,E36(n) AS (SELECT E.n FROM E, E x)

    ,Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    )

    SELECT DISTINCT n.*

    FROM #Names n

    JOIN Letters l ON n.FIRSTNAME LIKE '%' + letter + letter+ '%'

    OR n.LASTNAME LIKE '%' + letter + letter+ '%'

    ORDER BY id;

    WITH E(n) AS(SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0)

    ,E25(n) AS (SELECT E.n FROM E, E x)

    ,cteTally AS(

    SELECT TOP 20 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) n

    FROM E25

    )

    SELECT DISTINCT n.*

    FROM #Names n

    CROSS

    JOIN cteTally t

    WHERE (LEN( n.FIRSTNAME) >= t.n AND SUBSTRING(n.FIRSTNAME, n, 1) = SUBSTRING(n.FIRSTNAME, n + 1, 1))

    OR (LEN( n.LASTNAME) >= t.n AND SUBSTRING(n.LASTNAME, n, 1) = SUBSTRING(n.LASTNAME, n + 1, 1))

    ORDER BY n.id

    DROP TABLE #Names

    Feel free to ask any questions about the code.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (3/24/2015)


    This is a quick idea.

    WITH Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    )

    SELECT DISTINCT n.*

    FROM #Names n

    JOIN Letters l ON n.FIRSTNAME LIKE '%' + letter + letter+ '%'

    OR n.LASTNAME LIKE '%' + letter + letter+ '%'

    ORDER BY id

    I'm thinking on something different but I need some more time.

    FYI... the "ORDER BY id" is breaking it:pinch: (copy/paste error?)

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (3/24/2015)


    Luis Cazares (3/24/2015)


    This is a quick idea.

    WITH Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    )

    SELECT DISTINCT n.*

    FROM #Names n

    JOIN Letters l ON n.FIRSTNAME LIKE '%' + letter + letter+ '%'

    OR n.LASTNAME LIKE '%' + letter + letter+ '%'

    ORDER BY id

    I'm thinking on something different but I need some more time.

    FYI... the "ORDER BY id" is breaking it:pinch: (copy/paste error?)

    You're right, I changed the sample data and posted full code in my second post.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (3/24/2015)


    Alan.B (3/24/2015)


    Luis Cazares (3/24/2015)


    This is a quick idea.

    WITH Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    )

    SELECT DISTINCT n.*

    FROM #Names n

    JOIN Letters l ON n.FIRSTNAME LIKE '%' + letter + letter+ '%'

    OR n.LASTNAME LIKE '%' + letter + letter+ '%'

    ORDER BY id

    I'm thinking on something different but I need some more time.

    FYI... the "ORDER BY id" is breaking it:pinch: (copy/paste error?)

    You're right, I changed the sample data and posted full code in my second post.

    For the record, I spent about 20 minutes trying to come up with something better than what you posted and failed.:hehe: At least I could do some successful debugging :hehe:

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Just for fun, we'll get a little fancy here. 😛

    My lizard-brain recalls a really neat trick I once saw in an article by Jeff Moden.

    WITH Letters (letter) AS

    (

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64)

    FROM sys.all_columns

    UNION ALL

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 96)

    FROM sys.all_columns

    )

    SELECT FIRSTNAME, LASTNAME, letters

    FROM #Names a

    CROSS APPLY

    (

    SELECT STUFF(

    (

    SELECT ',' + letter

    FROM #Names b

    CROSS JOIN Letters

    WHERE a.FIRSTNAME = b.FIRSTNAME and a.LASTNAME = b.LASTNAME AND (

    LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(FIRSTNAME COLLATE LATIN1_GENERAL_BIN,REPLICATE(letter,2),letter+CHAR(1)),CHAR(1)+letter,''),CHAR(1),'')

    )) <> FIRSTNAME OR

    LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(LASTNAME COLLATE LATIN1_GENERAL_BIN,REPLICATE(letter,2),letter+CHAR(1)),CHAR(1)+letter,''),CHAR(1),'')

    )) <> LASTNAME)

    FOR XML PATH('')

    ), 1, 1, '')

    ) b (letters)

    WHERE letters IS NOT NULL;

    The triple REPLACE converts duplicated characters into a single character using Jeff's method:

    REPLACE Multiple Spaces with One[/url]

    I speeded up the REPLACE, since we know it is much slower without a binary collation sequence (bug identified by Paul White: REPLACE function extremely slow with (non-binary) Windows collation), but then needed to use both upper and lower case characters in the Letters CTE.

    Of course, we'd also like to see all duplicated characters, so I did the FOR XML PATH concatenation thingy ala Wayne Sheffield: Creating a comma-separated list (SQL Spackle)[/url]

    Of course, this may not be faster than Luis's solution but it was kinda fun combining the works of those three great SQL-Jedi!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Great solutions! Thank you gentlemen. I am still playing with it and testing them out.

    I have seen similar things with a Windows function but I am trying to dissect it to see what your doing here Luis. I'm so use to doing plain Jane windows functions.

    But I know windows functions can be pretty awesome. I really like creative solutions it helps me learn.

    So if I'm reading this right

    You are selection the TOP 26 Characters/Letters from the names of all System columns. That should find all 26 letters.

    You are also Converting Row_Number to Char to display letters rather than BigInt (WOW this is so cool right here never thought of doing that)

    I kind of get lost on OVER() is that windowing over the whole set?

    + 64 is converting to letters right?

    This is the part (Order by (SELECT Null)) I am completely lost on this but really want to understand this.

    Is this just tricking the window on the ORDER Clause??

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    ***SQL born on date Spring 2013:-)

  • You're really close Thomas.

    Ranking functions (ROW_NUMBER, RANK, DENSE_RANK, NTILE) need the OVER clause with ORDER BY. As I don't care about the order and I can't use a constant, the (SELECT NULL) is "tricking" the ORDER BY. Although, it realizes that it doesn't need to order.

    CHAR() function returns a character based on the ascii code introduced. ASCII codes from 65 to 90 represent A to Z (for Spanish, I would need an additional row for Ñ). That's why I need to add 64, so I can start from 65.

    A little help to explain myself:

    WITH E(n) AS(SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0)

    ,E36(n) AS (SELECT E.n FROM E, E x)

    ,Letters AS(

    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter,

    ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64 number

    FROM sys.all_columns

    )

    SELECT *

    FROM Letters

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • thomashohner (3/25/2015)


    SELECT TOP 26 CHAR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) + 64) letter

    FROM sys.all_columns

    If you remove the CHAR function and the +64, you'll see it produces a set of numbers between 1 and 26: -

    SELECT TOP 26

    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) )

    FROM sys.all_columns;

    The reason for the order by select null is that we don't care what order it grabs the top 26 rows from sys.all_columns, we're not using the data we just want a table with numbers. He could've just as easily done: -

    SELECT TOP 26

    ROW_NUMBER() OVER ( ORDER BY @@VERSION )

    FROM sys.all_columns;

    Next, Dwain is adding 64 to each of those numbers, so now it's a set between 65 and 90.

    If you run the CHAR function on that set, it just happens to produce the letters A-Z 🙂

    SELECT TOP 26

    CHAR( ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) + 64 )

    FROM sys.all_columns;

    The UNION ALL is to create the letters a-z, so that you have both the upper and lower case variants, since he is using a binary collation on the replace. Another option would've been to change the common table expression like this: -

    WITH Letters (letter) AS

    (

    SELECT CHAR(N + CASE WHEN a.N <= 26 THEN 64

    ELSE 70

    END)

    FROM ( SELECT TOP 52

    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) )

    FROM sys.all_columns

    ) a ( N )

    )

    Bear in mind that I'm not saying it'd be better, just different.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • So awesome guys. I love this stuff!

    Very, Very cool.

    Thanks for the explanation. Great learning experience.:-D

    ***SQL born on date Spring 2013:-)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply