Looking for a word occurence count in a rowset scenario.

  • Does anyone know of a quick and dirty way to do this any faster? I'm trying to find the Number of occurences in a string of a specific word. It's a long story but I'm doing script evaluations for a third product that stores the scripts in my SQL db.

    IF OBJECT_ID ( 'tempdb..#tmp') IS NOT NULL

    DROP TABLE #tmp

    IF OBJECT_ID( 'tempdb..#Tally') IS NOT NULL

    DROP TABLE #Tally

    -- Slightly modified from tally article

    SELECT TOP 8000 --Largest Script Field

    IDENTITY(INT,1,1) AS N

    INTO #Tally

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    ALTER TABLE #Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    CREATE TABLE #tmp (RowID INT IDENTITY( 1,1), SearchForString VARCHAR(20), SearchInString VARCHAR(2000))

    INSERT INTO #tmp ( SearchForString, SearchInString) VALUES ( 'quick', 'The quick brown dirty fox jumped quickly over the quick dog.')

    INSERT INTO #tmp ( SearchForString, SearchInString) VALUES ( 'duck', 'Duck Soup is better if you duck the ducks.')

    INSERT INTO #tmp ( SearchForString, SearchInString) VALUES ( 'department', 'I am tired of the department of redundancy department.')

    --SELECT * FROM #tmp

    SELECT

    RowID,

    COUNT(*) AS NumOccurence

    FROM

    (SELECT

    RowID

    FROM

    #tmp AS t

    CROSS JOIN

    #Tally AS tn

    WHERE

    CHARINDEX( SearchForString, SearchInString, tn.N) = tn.N

    AND tn.N <= LEN( SearchInString)

    ) AS drv

    GROUP BY

    RowID


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Try this as well:

    IF OBJECT_ID(N'tempdb..#tmp') IS NULL

    BEGIN

    CREATE TABLE #tmp (

    RowID INT IDENTITY(1, 1),

    SearchForString VARCHAR(20),

    SearchInString VARCHAR(2000))

    INSERT INTO #tmp

    (SearchForString,

    SearchInString)

    VALUES ('quick',

    'The quick brown dirty fox jumped quickly over the quick dog.')

    INSERT INTO #tmp

    (SearchForString,

    SearchInString)

    VALUES ('duck',

    'Duck Soup is better if you duck the ducks.')

    INSERT INTO #tmp

    (SearchForString,

    SearchInString)

    VALUES ('department',

    'I am tired of the department of redundancy department.')

    END ;

    SELECT *

    FROM #tmp ;

    SELECT RowID,

    SearchForString,

    COUNT(*) AS QtyOccurances

    FROM #tmp

    CROSS APPLY (SELECT Number

    FROM dbo.Numbers

    WHERE Number <= LEN(SearchInString)) AS Numbers

    WHERE SearchForString = SUBSTRING(SearchInString, Number,

    LEN(SearchForString))

    GROUP BY RowID,

    SearchForString

    ORDER BY RowID ;

    I have a Numbers table, but you can plug in your #Tally there just as easily.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hm, nice alternative Gus, thanks. Never thought about using Cross Apply like that against Tally.

    I used the #Tally to make the code directly consumable. 🙂

    I'd originally avoided the substring figuring that the additional len calculation per Tally row would cause more heartache. Certainly can't hurt to take a swing with that though, after a second (and third) thought.

    I'll get back to you on that, appreciate the time.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • (450 row(s) affected)

    Table '#Tally______________________________________________________________________________________________________________000000000193'. Scan count 450, logical reads 1350, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp________________________________________________________________________________________________________________000000000194'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 32 ms, elapsed time = 31 ms.

    (450 row(s) affected)

    Table '#Tally______________________________________________________________________________________________________________000000000193'. Scan count 450, logical reads 1350, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp________________________________________________________________________________________________________________000000000194'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 19 ms.

    Thanks Gus, your way was much faster, almost 2x faster in this case when I expanded the data set.

    Anyone who'd like to do their own benchmarking, find the code below.

    IF OBJECT_ID ( 'tempdb..#tmp') IS NOT NULL

    DROP TABLE #tmp

    IF OBJECT_ID( 'tempdb..#Tally') IS NOT NULL

    DROP TABLE #Tally

    -- Slightly modified from tally article

    SELECT TOP 8000 -- Largest string in question, usually build out to 100,000.

    IDENTITY(INT,1,1) AS N

    INTO #Tally

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    ALTER TABLE #Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    CREATE TABLE #tmp (RowID INT IDENTITY( 1,1), SearchForString VARCHAR(20), SearchInString VARCHAR(2000))

    INSERT INTO #tmp ( SearchForString, SearchInString)

    SELECt

    drv.*

    FROM

    (SELECT'quick' AS sfs, 'The quick brown dirty fox jumped quickly over the quick dog.' AS sis UNION ALL SELECT

    'duck', 'Duck Soup is better if you duck the ducks.' UNION ALL SELECT

    'department', 'I am tired of the department of redundancy department.'

    )AS drv CROSS JOIN

    #tally AS tn

    WHERe

    tn.n <=150

    select * from #tmp

    --SELECT * FROM #tmp

    SET STATISTICS IO, TIME ON

    SELECT

    RowID,

    COUNT(*) AS NumOccurence

    FROM

    (SELECT

    RowID

    FROM

    #tmp AS t

    CROSS JOIN

    #Tally AS tn

    WHERE

    CHARINDEX( SearchForString, SearchInString, tn.N) = tn.N

    AND tn.N <= LEN( SearchInString)

    ) AS drv

    GROUP BY

    RowID

    SELECT RowID,

    SearchForString,

    COUNT(*) AS QtyOccurances

    FROM #tmp

    CROSS APPLY (SELECT N

    FROM #Tally

    WHERE N <= LEN(SearchInString)) AS Numbers

    WHERE SearchForString = SUBSTRING(SearchInString, N,

    LEN(SearchForString))

    GROUP BY RowID,

    SearchForString

    --ORDER BY RowID

    ;

    SET STATISTICS IO, TIME OFF


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Here's a performance test I ran:

    SET NOCOUNT ON;

    IF OBJECT_ID(N'tempdb..#tmp') IS NOT NULL

    DROP TABLE #tmp ;

    IF OBJECT_ID(N'tempdb..#T1') IS NOT NULL

    DROP TABLE #T1 ;

    IF OBJECT_ID(N'tempdb..#T2') IS NOT NULL

    DROP TABLE #T2 ;

    CREATE TABLE #tmp (

    RowID INT IDENTITY(1, 1),

    SearchForString VARCHAR(20),

    SearchInString VARCHAR(2000)) ;

    GO

    INSERT INTO #tmp

    (SearchForString, SearchInString)

    VALUES ('quick',

    REPLICATE('The quick brown dirty fox jumped quickly over the quick dog.',

    10))

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('duck', 'Duck Soup is better if you duck the ducks.')

    , ('department',

    'I am tired of the department of redundancy department.') ;

    go 10000

    SET STATISTICS IO ON ;

    SET STATISTICS TIME ON ;

    SELECT RowID,

    COUNT(*) AS NumOccurence

    INTO #T1

    FROM (SELECT RowID

    FROM #tmp AS t

    CROSS JOIN dbo.Numbers AS tn

    WHERE CHARINDEX(SearchForString, SearchInString, tn.Number) = tn.Number

    AND tn.Number <= LEN(SearchInString)) AS drv

    GROUP BY RowID ;

    SELECT RowID,

    COUNT(*) AS NumOccurence

    INTO #T2

    FROM #tmp

    CROSS APPLY (SELECT Number

    FROM dbo.Numbers

    WHERE Number <= LEN(SearchInString)) AS Numbers

    WHERE SearchForString = SUBSTRING(SearchInString, Number,

    LEN(SearchForString))

    GROUP BY RowID ;

    SET STATISTICS TIME OFF ;

    SET STATISTICS IO OFF ;

    Results:

    Beginning execution loop

    Batch execution completed 10000 times.

    Table 'Numbers'. Scan count 550000, logical reads 1650000, physical reads 2, read-ahead reads 1, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp'. Scan count 1, logical reads 5162, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 22828 ms, elapsed time = 23968 ms.

    Table 'Numbers'. Scan count 550000, logical reads 1650000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp'. Scan count 1, logical reads 5162, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 15063 ms, elapsed time = 16046 ms.

    The I/O stats are the same for both, but the Cross Apply version was consistently faster in the CPU. I tried a variety of sizes on the temp table, from 3 rows on up to 550,000.

    (The insert statement is one of the SQL 2008 new features. Won't matter for the tests.)

    I had both dump their data into temp tables, to avoid adding the overhead to the queries of returning a dataset to the client machine.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Out of sheer curiousity I mucked with the build a bit to see if the substring or the crossapply was the determinator.

    The substring is the faster function. I didn't see any significant difference between the crossapply and the cross join activities.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • ok how about this: it's simply doing a REPLACE on the original phrase, and using datalength to get the differences:

    No Tally, so i think it'll be faster than anything so far.

    SELECT RowID,

    --REPLACE(SearchInString,SearchForString,''),

    --SearchForString,

    --SearchInString,

    --datalength(SearchInString),

    --datalength(REPLACE(SearchInString,SearchForString,'')),

    (datalength(SearchInString) - datalength(REPLACE(SearchInString,SearchForString,''))) / datalength(SearchForString)

    FROM #tmp

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • :Wow::blink:

    Nice idea Lowell... faster by a magnitude of nearly 10. (Edit: Heh, whoops, sorry. Faster 10x at the CPU level, 50% of time at the final timing. I didn't run this in discard results settings so my guess is a lot of that is the return to screen of the data.)

    CHARINDEX Method

    (360000 row(s) affected)

    Table '#Tally______________________________________________________________________________________________________________0000000001AA'. Scan count 360000, logical reads 1080000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp________________________________________________________________________________________________________________0000000001AB'. Scan count 5, logical reads 3455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 15625 ms, elapsed time = 11448 ms.

    SUBSTRING Method

    (360000 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Tally______________________________________________________________________________________________________________0000000001AA'. Scan count 360000, logical reads 1080000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#tmp________________________________________________________________________________________________________________0000000001AB'. Scan count 5, logical reads 3455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 12203 ms, elapsed time = 11158 ms.

    DATALENGTH method

    (360000 row(s) affected)

    Table '#tmp________________________________________________________________________________________________________________0000000001AB'. Scan count 1, logical reads 3455, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1735 ms, elapsed time = 7074 ms.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Awesome idea Lowell. One quick note, (and this is probably completely unnecessary in this particular discussion, but to anyone that wants to implement this method elsewhere), is to watch the datatypes used with datalength.

    DECLARE @a varchar(20)

    DECLARE @b-2 nvarchar(20)

    SET @a = 'aaaaa'

    SET @b-2 = 'aaaaa'

    SELECT DATALENGTH(@a), DATALENGTH(@b)

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply