Trying to iteratively use PATINDEX commands to comb through a large amount of text

  • Here's the scenario.

    I have a 10000 record table where each row contains a large amount of text. I want to do a PATINDEX search to determine where a specific word occurs (let's say "dog" as an example). Since the patindex command only finds the first instance, I want to iterate using substrings to keep finding additional hits.

    Right now this is the method I know how to do it (there may be a few little details missing, its mainly to show my thought process):

    1) Create a field that contains my current patindex position.

    2) Run the patindex command, update the table with the current patindex position for each row.

    3) Store the patindex position in another table (or wherever, not really important for this exercise).

    4) Create a dynamic sql query to do a substring on the text, starting at the current patindex position +1.

    5) Run another patindex command on this new substring. Update the current patindex position with the new position.

    6) Rinse and Repeat.

    This method will work, but I was wondering if anyone had a cleaner and better way to do this (perhaps that didn't involve dynamic sql).

  • First things that pop in to my head:

    1. CTE to identify occurrences of the desired string part

    2. Table valued function using the CTE and then using CROSS APPLY

    Perhaps others have better ideas?

  • I'm dreaming up a solution using tally tables... What is your desired output? Number of occurrences, or indexes of occurrences?

  • Ok here we go...

    You'll need to build a tally table. Jeff Moden has an excellent article about tally tables here[/url].

    After you have a tally table, create a test table with test data: (or just plug in your own table I suppose)

    select top 10000 identity(int,1,1) as id, Dog

    into DogTable

    from

    (

    select 'dogcatdogdogbirdape' as Dog

    union all

    select 'fishbirdnotadogoopsthatwasadog'

    union all

    select 'caninecanislupisdomesticus'

    )x

    cross join

    (select sc1.name from master.dbo.syscolumns sc1

    cross join

    master.dbo.syscolumns sc2

    ) y

    Now for the solution:

    select id, count(id)

    from

    (

    select d.id, d.dog, substring(d.dog, case when t.N > len(d.dog) then len(d.dog) else t.n end, 3) as test, n as foundindex

    from dogtable d

    cross join

    tally t

    where t.n < (select max(len(dog)) from dogtable)

    and substring(d.dog, case when t.N > len(d.dog) then len(d.dog) else t.n end, 3) = 'dog'

    )x

    group by id

    Run the inner view if you just want a list of indexes of the tested string.

    This could be easily turned into a table valued function, just read the length of your input string into the last parameter of substring

  • Sorry for necro'ing an old thread, but I'm trying to do something similar. The difference is that I'm looking for multiple strings (matching a pattern) of varying lengths. I've tried to modify the "dog" query, but am falling short of accomplishing what I want to accomplish. Can anyone help shed light?

    FYI, I'm trying to search through stored procs and other objects for instances of linked servers, which follow a standard of "LS-%".

    select object_name(s.id) Object, substring(s.text, n, charindex(' ', text, n) - n) as Target

    from syscomments s

    cross join

    tally t

    where t.n < (select max(len(text)) from syscomments)

    and lower(substring(s.text, case when t.N > len(s.text) then len(s.text) else t.n end, case when t.N > len(s.text) then 5 else len(s.text) - n end)) like 'ls-__'

    order by object, target

  • That code's written for SQL 2000; I suggest completely ignoring it.

    Instead, the code below should give you a list of object names and their full text reasonably quickly.

    If you want to add in the part about showing the specific text that matches, let me know. Just be aware that'll be slower -- perhaps much slower?!

    IF OBJECT_ID('tempdb..#search_strings') IS NOT NULL

    DROP TABLE #search_strings

    CREATE TABLE #search_strings (

    search_id int IDENTITY(1, 1) NOT NULL,

    search_string nvarchar(40) NOT NULL

    )

    INSERT INTO #search_strings

    SELECT 'linked_server_name1'

    --UNION ALL SELECT 'linked_server_name2'

    --UNION ALL ...

    SELECT

    OBJECT_NAME(object_id) AS Object_Name,

    search_string,

    (SELECT definition FROM sys.sql_modules WHERE object_id = derived.object_id) AS Object_Text

    FROM (

    SELECT DISTINCT

    sm.object_id,

    ss.search_string

    FROM sys.sql_modules sm

    CROSS JOIN #search_strings AS ss

    WHERE

    CAST(OBJECTPROPERTYEX(sm.object_id, 'IsProcedure') AS int) = 1 AND

    sm.definition LIKE '%' + ss.search_string + '%'

    ) AS derived

    ORDER BY

    Object_Name

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Thanks for the code! It works great if I know what I'm searching for, but unfortunately to be useful in my current situation it would have to return the actual match. So I could search for "ls-" and it would return "ls-app1" and "ls-app2", etc.

  • The code uses LIKE to compare the string, so just construct your string accordingly.

    For example, you could use 'ls-' as the search string. Or, if you want a closer match, then:

    ls-app[0-9]

    or something similar.

    I'll add code to see the actual part(s) of the text that match if you want.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (8/13/2013)


    That code's written for SQL 2000; I suggest completely ignoring it.

    I wouldn't ignore it because it was written for SQL Server 2000... I'd ignore it because it solves the wrong problem for the latest request.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ScottPletcher (8/14/2013)


    The code uses LIKE to compare the string, so just construct your string accordingly.

    <snip>

    I'll add code to see the actual part(s) of the text that match if you want.</snip>

    Yes, please!

    I don't even know where to begin for that...

  • CleverSQLUserID (8/21/2013)


    ScottPletcher (8/14/2013)


    The code uses LIKE to compare the string, so just construct your string accordingly.

    <snip>

    I'll add code to see the actual part(s) of the text that match if you want.</snip>

    Yes, please!

    I don't even know where to begin for that...

    Here's an example; naturally change the "backup" bytes and the total bytes listed to whatever you prefer:

    IF OBJECT_ID('tempdb..#search_strings') IS NOT NULL

    DROP TABLE #search_strings

    CREATE TABLE #search_strings (

    search_id int IDENTITY(1, 1) NOT NULL,

    search_string nvarchar(40) NOT NULL

    )

    INSERT INTO #search_strings

    SELECT 'linked_server_name1'

    --UNION ALL SELECT 'linked_server_name2'

    --UNION ALL ...

    SELECT

    OBJECT_NAME(object_id) AS Object_Name,

    search_string,

    --list 100 bytes of the definition, starting 40 bytes before the string itself was found

    (SELECT SUBSTRING(definition, derived.string_location - 40, LEN(search_string) + 100) FROM sys.sql_modules WHERE object_id = derived.object_id) AS Object_Text

    FROM (

    SELECT

    sm.object_id,

    ss.search_string,

    PATINDEX('%' + ss.search_string + '%', sm.definition) AS string_location

    FROM sys.sql_modules sm

    CROSS JOIN #search_strings AS ss

    WHERE

    CAST(OBJECTPROPERTYEX(sm.object_id, 'IsProcedure') AS int) = 1 AND

    sm.definition LIKE '%' + ss.search_string + '%'

    ) AS derived

    ORDER BY

    Object_Name

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • CleverSQLUserID (8/5/2013)


    Sorry for necro'ing an old thread, but I'm trying to do something similar. The difference is that I'm looking for multiple strings (matching a pattern) of varying lengths. I've tried to modify the "dog" query, but am falling short of accomplishing what I want to accomplish. Can anyone help shed light?

    FYI, I'm trying to search through stored procs and other objects for instances of linked servers, which follow a standard of "LS-%".

    select object_name(s.id) Object, substring(s.text, n, charindex(' ', text, n) - n) as Target

    from syscomments s

    cross join

    tally t

    where t.n < (select max(len(text)) from syscomments)

    and lower(substring(s.text, case when t.N > len(s.text) then len(s.text) else t.n end, case when t.N > len(s.text) then 5 else len(s.text) - n end)) like 'ls-__'

    order by object, target

    Heh... I guess I'm necro'ing a necro'd thread but let me ask... are you all set on this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply