April 30, 2015 at 11:39 am
One of my varchar columns in a table has multiple key words enclosed in a pattern of special characters.
Eg: William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world's pre-eminent [##dramatist##]. He is often called England's national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].
I need to write to query to find all distinct key words that are enclosed within [## and ##]. My query should yield the following results from the string in the example above
[##actor##]
[##dramatist##]
[##playwright##] -- 2 occurrances, but I need it only once in my result set
[##poems##]
[##poet##] -- 2 occurrances, but I need it only once in my result set
[##sonnets##]
[##verses##]
[##writer##]
I need to run this on a large table, so I am looking for the best possible way to minimize any performance issues.
Just give you sample code, I have provided below 2 separate snippets, one with table variable and another with temp table.
DECLARE @MyTable TABLE (MyString VARCHAR (8000))
INSERT @MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
SELECT * from @MyTable
IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
DROP TABLE #MyTable
CREATE TABLE #MyTable (MyString VARCHAR (8000))
INSERT #MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
SELECT * from #MyTable
Thanks in advance.
April 30, 2015 at 11:51 am
I would probably start with Dwain Camps' article here:
http://www.sqlservercentral.com/articles/String+Manipulation/94365/
I might go about it by writing all the locations (use CharIndex) of a string to a temp table and then doing a select distinct on it. Not pretty for sure, but performance on something like this without a temp table would probably be hideous.
April 30, 2015 at 12:00 pm
Thanks for the link, @pietlinden. I will go through the article.
Just to re-phrase my original request, I am not looking for a single query. This is a one-time effort, so a T-SQL block with temp tables is just fine.
Thanks!
April 30, 2015 at 12:57 pm
Maybe the DelimitedSplit8k will be faster as there's no really a need for wildcards.
http://www.sqlservercentral.com/articles/Tally+Table/72993/
SELECT SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3) AS keyword,
'[' + LEFT( Item, CHARINDEX( '#', Item, 3) + 2) AS keywordWrapped,
COUNT(*) AS Occurences
from #MyTable
CROSS APPLY dbo.DelimitedSplit8K( MyString, '[')
WHERE ItemNumber > 1
GROUP BY MyString,
SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3),
LEFT( Item, CHARINDEX( '#', Item, 3) + 2)
April 30, 2015 at 1:35 pm
SELECT SUBSTRING(string, t.tally + 3, CHARINDEX('##]', string, t.tally + 3) - t.tally - 3)
FROM ( --#MyTable
SELECT CAST('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].' AS varchar(8000)) AS string
) AS test_data
INNER JOIN dbo.tally t ON
SUBSTRING(string, t.tally, 3) = '[##' AND
CHARINDEX('##]', string, t.tally + 3) > 0
Edit: Where "tally" is a standard tally table; I also named the column "tally". Naturally change that to match your own CTE or physical tally table.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
May 2, 2015 at 3:10 am
For fun, yet another solution
😎
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID('dbo.TBL_SAMPLE_PARSE') IS NOT NULL
DROP TABLE dbo.TBL_SAMPLE_PARSE
CREATE TABLE dbo.TBL_SAMPLE_PARSE
(
SP_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_SAMPLE_PARSE_SP_ID PRIMARY KEY CLUSTERED
,MyString VARCHAR (8000) NOT NULL
)
INSERT dbo.TBL_SAMPLE_PARSE (MyString)
VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))
,BASE_DATA AS
(
SELECT
SP.SP_ID
,CHARINDEX('[##',SP.MyString,NM.N) AS POS_FROM
,CHARINDEX('##]',SP.MyString,NM.N) AS POS_TO
,SP.MyString
FROM dbo.TBL_SAMPLE_PARSE SP
CROSS APPLY
(
SELECT TOP (LEN(SP.MyString))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4
) AS NM
WHERE CHARINDEX('[##',SP.MyString,NM.N) > 0
AND
(
NM.N = CHARINDEX('[##',SP.MyString,NM.N)
OR
NM.N = CHARINDEX('##]',SP.MyString,NM.N)
)
)
SELECT DISTINCT
BD.SP_ID
,SUBSTRING(BD.MyString,BD.POS_FROM,(BD.POS_TO-BD.POS_FROM) + 3) AS TOKEN_STR
FROM BASE_DATA BD
WHERE BD.POS_TO > BD.POS_FROM;
Results
SP_ID TOKEN_STR
----------- -----------------
1 [##actor##]
1 [##dramatist##]
1 [##playwright##]
1 [##poems##]
1 [##poet##]
1 [##sonnets##]
1 [##verses##]
1 [##writer##]
October 20, 2015 at 12:40 pm
Thanks for your valuable inputs, @pietlinden/@Luis Cazares/@ScottPletcher/@Eirikur Eiriksson
SQLCurious
October 20, 2015 at 1:48 pm
Using Eirkur's sample data and the aforementioned splitter, you could do this:
UPDATE dbo.TBL_SAMPLE_PARSE
SET MyString = REPLACE(REPLACE(MyString,',',' '),'.',' ')
SELECT DISTINCT Item
FROM TBL_SAMPLE_PARSE
CROSS APPLY dbo.delimitedSplit8K(MyString, ' ')
WHERE item LIKE '%[##%##]%' ESCAPE '[';
Edit: fixed one minor typo in my code.
-- Itzik Ben-Gan 2001
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply