Query to find distinct multiple instances of a pattern in a string

  • One of my varchar columns in a table has multiple key words enclosed in a pattern of special characters.

    Eg: William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world's pre-eminent [##dramatist##]. He is often called England's national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].

    I need to write to query to find all distinct key words that are enclosed within [## and ##]. My query should yield the following results from the string in the example above

    [##actor##]

    [##dramatist##]

    [##playwright##] -- 2 occurrances, but I need it only once in my result set

    [##poems##]

    [##poet##] -- 2 occurrances, but I need it only once in my result set

    [##sonnets##]

    [##verses##]

    [##writer##]

    I need to run this on a large table, so I am looking for the best possible way to minimize any performance issues.

    Just give you sample code, I have provided below 2 separate snippets, one with table variable and another with temp table.

    DECLARE @MyTable TABLE (MyString VARCHAR (8000))

    INSERT @MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')

    SELECT * from @MyTable

    IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL

    DROP TABLE #MyTable

    CREATE TABLE #MyTable (MyString VARCHAR (8000))

    INSERT #MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')

    SELECT * from #MyTable

    Thanks in advance.

  • I would probably start with Dwain Camps' article here:

    http://www.sqlservercentral.com/articles/String+Manipulation/94365/

    I might go about it by writing all the locations (use CharIndex) of a string to a temp table and then doing a select distinct on it. Not pretty for sure, but performance on something like this without a temp table would probably be hideous.

  • Thanks for the link, @pietlinden. I will go through the article.

    Just to re-phrase my original request, I am not looking for a single query. This is a one-time effort, so a T-SQL block with temp tables is just fine.

    Thanks!

  • Maybe the DelimitedSplit8k will be faster as there's no really a need for wildcards.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    SELECT SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3) AS keyword,

    '[' + LEFT( Item, CHARINDEX( '#', Item, 3) + 2) AS keywordWrapped,

    COUNT(*) AS Occurences

    from #MyTable

    CROSS APPLY dbo.DelimitedSplit8K( MyString, '[')

    WHERE ItemNumber > 1

    GROUP BY MyString,

    SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3),

    LEFT( Item, CHARINDEX( '#', Item, 3) + 2)

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • SELECT SUBSTRING(string, t.tally + 3, CHARINDEX('##]', string, t.tally + 3) - t.tally - 3)

    FROM ( --#MyTable

    SELECT CAST('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].' AS varchar(8000)) AS string

    ) AS test_data

    INNER JOIN dbo.tally t ON

    SUBSTRING(string, t.tally, 3) = '[##' AND

    CHARINDEX('##]', string, t.tally + 3) > 0

    Edit: Where "tally" is a standard tally table; I also named the column "tally". Naturally change that to match your own CTE or physical tally table.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • For fun, yet another solution

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID('dbo.TBL_SAMPLE_PARSE') IS NOT NULL

    DROP TABLE dbo.TBL_SAMPLE_PARSE

    CREATE TABLE dbo.TBL_SAMPLE_PARSE

    (

    SP_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_SAMPLE_PARSE_SP_ID PRIMARY KEY CLUSTERED

    ,MyString VARCHAR (8000) NOT NULL

    )

    INSERT dbo.TBL_SAMPLE_PARSE (MyString)

    VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')

    ;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))

    ,BASE_DATA AS

    (

    SELECT

    SP.SP_ID

    ,CHARINDEX('[##',SP.MyString,NM.N) AS POS_FROM

    ,CHARINDEX('##]',SP.MyString,NM.N) AS POS_TO

    ,SP.MyString

    FROM dbo.TBL_SAMPLE_PARSE SP

    CROSS APPLY

    (

    SELECT TOP (LEN(SP.MyString))

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4

    ) AS NM

    WHERE CHARINDEX('[##',SP.MyString,NM.N) > 0

    AND

    (

    NM.N = CHARINDEX('[##',SP.MyString,NM.N)

    OR

    NM.N = CHARINDEX('##]',SP.MyString,NM.N)

    )

    )

    SELECT DISTINCT

    BD.SP_ID

    ,SUBSTRING(BD.MyString,BD.POS_FROM,(BD.POS_TO-BD.POS_FROM) + 3) AS TOKEN_STR

    FROM BASE_DATA BD

    WHERE BD.POS_TO > BD.POS_FROM;

    Results

    SP_ID TOKEN_STR

    ----------- -----------------

    1 [##actor##]

    1 [##dramatist##]

    1 [##playwright##]

    1 [##poems##]

    1 [##poet##]

    1 [##sonnets##]

    1 [##verses##]

    1 [##writer##]

  • Thanks for your valuable inputs, @pietlinden/@Luis Cazares/@ScottPletcher/@Eirikur Eiriksson

    SQLCurious

  • Using Eirkur's sample data and the aforementioned splitter, you could do this:

    UPDATE dbo.TBL_SAMPLE_PARSE

    SET MyString = REPLACE(REPLACE(MyString,',',' '),'.',' ')

    SELECT DISTINCT Item

    FROM TBL_SAMPLE_PARSE

    CROSS APPLY dbo.delimitedSplit8K(MyString, ' ')

    WHERE item LIKE '%[##%##]%' ESCAPE '[';

    Edit: fixed one minor typo in my code.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply