Query Performance

  • What is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?

    Did you try the code I posted? Note that the index is important.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Thursday, January 3, 2019 7:07 AM

    What is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?

    Did you try the code I posted? Note that the index is important.

    Yes sir. It is looking for similar names and based on the output it picks best matching ones and perform some calculations. 
    I tried running your code(index already created as you suggested), I get an error

    SELECT TOP(1500)
    x.nameCacheId,
    x.count
    FROM #qgram q
    CROSS APPLY (
    SELECT
    g.gram,
    g.nameCacheId,
    SUM(CASE WHEN g.count > q.count THEN q.count ELSE g.count END) AS count
    FROM SearchSuggest_QGram_tbl g
    WHERE g.gram = q.gram
    GROUP BY g.gram, g.nameCacheId
    ) x
    WHERE x.count >= @minLen
    ORDER BY 2 DESC

    Msg 8124, Level 16, State 1, Line 80
    Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.

  • chandan_jha18 - Thursday, January 3, 2019 7:02 AM

    Jeff Moden - Wednesday, January 2, 2019 9:52 AM

    chandan_jha18 - Wednesday, January 2, 2019 6:49 AM

    Hello All,

    Wish you a very happy new year 2019!!

    I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
    Since both queries are similar, I am putting one of them here. 

    Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here. 

    Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.

    Thanks in advance!!

             
    declare @maxResults INT = 25,
    @companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4',
    @issuerCode nvarchar(6)=''   
          
       
    DECLARE @searchText NVARCHAR(300)   
    DECLARE @issuerCodeType int
    DECLARE @searchTextType int

    SELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)

    --SELECT @searchText = @companyName

          
    -- Get company by QGrams    
    DECLARE @minLen int    
    DECLARE @Q int    
    SET @Q = 3    
    SET @minLen = 4    
        
    IF LEN(@searchText) < 6    
    BEGIN    
        SET @minLen = 2    
    END    

    Create table #qgram
    (    
        gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED
    ,    [count] INT NOT NULL    
    )    
     
    DECLARE @SearchTextLenght INT
    SET @SearchTextLenght = LEN(@searchText)
        
    INSERT INTO #qgram    
    SELECT    
        ISNULL(gram, '') as gram
    ,    ISNULL(COUNT(*), 1) AS [count]    
    FROM    
    (    
        SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram    
        FROM dbautil.dbo.numbers_tbl n (NOLOCK)    
        WHERE n.n <= @SearchTextLenght + @Q - 2    
    ) sub    
    GROUP BY gram  
    --option(recompile)

        
    DECLARE @maxMatch INT    
    SELECT @maxMatch = SUM(count) FROM #qgram    

        
        
    SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn    
    FROM
    (    
    SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId,
    ((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId 
        
           
        FROM    
        (
            SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count             
            FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING    
            SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC         
        ) sub    
        INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId    
        ORDER BY ratio DESC    
    ) sub2 

     
    drop table #qgram

    It would really have helped if you had identified what the actual purpose of the query is.  So, for better understanding by all, let me see if I get it right.

    In this particular bit of code, it would appear that you're retrieving 51 "key items" that uniquely identify a company and then storing that information in a temp table.

    After that, you're trying to figure out which rows in a much larger table provide a match  of those 51 "grams" so you can return only those nameCacheId that are actually a match for the given company.

    Is that correct?  If not, please expound.

    Sir, here is what I got from the developer as per his document:

    1)     When the search text is entered, the length of the search text is calculated and then the search string is broken into multiple small strings of 3 characters and stored in a table called qgram
    Example: If I pass the search text as 'Legg Mason Partners', we will get the following results:

    Results:
    gram     count
    Ma         1
     Pa          1
    ##L         1
    #Le         1

    etc.

  • Count indicates the number oftimes the spilt string is present in the search string.
  • 2)     Qgram table and  SearchSuggest_QGram_tbl are comibined based on gram column.
    3)     After Qgram and SearchSuggest_QGram_tbl are combined, the rows are grouped on basis of same namecacheid and count values are summed up. And top 1500 rows are selected.
    Since we have count column in both the tables, the column with least value is used for adding the all the count values while grouping based on namecacheid.
    Example
    nameCacheId                   count
    5359401                                 9
    512010                                   6
    2545081                                 8
    2875285                                 11             
    1934139                                10
    4)     The result fetched in step 4 is combined with SearchSuggest_QGramNameCache_tbl based on ‘nameCacheId’

    5)     From the result set obtained in step 6, top 500 records are fetched which contain company id, ratio, namecache id, strippednamelength (company or fund name).
    6)     Ratio for each company id is calculated as below
    (count (obtained in step 4) * 0.1 /length(companyname fetched in step7) +1) * count (obtained in step 4) * 0.1/ sum of count values obtained earlier.

    Can you see if this matches your observation.
    Regards
    Chandan Jha

    I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are.  If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".

    As for whether or not this is inline with what I thought, yes.  NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.

    I am a little concerned though.  You say in 5) above that you return the "top 500 rows".  It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches.  I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".

    Like I said, I also need to know what the purpose/use of using the ## and # characters are for.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, January 3, 2019 8:45 AM

    I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are.  If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".

    As for whether or not this is inline with what I thought, yes.  NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.

    I am a little concerned though.  You say in 5) above that you return the "top 500 rows".  It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches.  I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".

    Like I said, I also need to know what the purpose/use of using the ## and # characters are for.

    Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much. 
    Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot  change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR' 

    Best Regards
    Chandan

  • chandan_jha18 - Wednesday, January 2, 2019 8:48 AM

    Eirikur Eiriksson - Wednesday, January 2, 2019 8:21 AM

    chandan_jha18 - Wednesday, January 2, 2019 6:49 AM

    Hello All,

    Wish you a very happy new year 2019!!

    I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
    Since both queries are similar, I am putting one of them here. 

    Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here. 

    Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.

    Thanks in advance!!

             
    declare @maxResults INT = 25,
    @companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4',
    @issuerCode nvarchar(6)=''   
          
       
    DECLARE @searchText NVARCHAR(300)   
    DECLARE @issuerCodeType int
    DECLARE @searchTextType int

    SELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)

    --SELECT @searchText = @companyName

          
    -- Get company by QGrams    
    DECLARE @minLen int    
    DECLARE @Q int    
    SET @Q = 3    
    SET @minLen = 4    
        
    IF LEN(@searchText) < 6    
    BEGIN    
        SET @minLen = 2    
    END    

    Create table #qgram
    (    
        gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED
    ,    [count] INT NOT NULL    
    )    
     
    DECLARE @SearchTextLenght INT
    SET @SearchTextLenght = LEN(@searchText)
        
    INSERT INTO #qgram    
    SELECT    
        ISNULL(gram, '') as gram
    ,    ISNULL(COUNT(*), 1) AS [count]    
    FROM    
    (    
        SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram    
        FROM dbautil.dbo.numbers_tbl n (NOLOCK)    
        WHERE n.n <= @SearchTextLenght + @Q - 2    
    ) sub    
    GROUP BY gram  
    --option(recompile)

        
    DECLARE @maxMatch INT    
    SELECT @maxMatch = SUM(count) FROM #qgram    

        
        
    SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn    
    FROM
    (    
    SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId,
    ((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId 
        
           
        FROM    
        (
            SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count             
            FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING    
            SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC         
        ) sub    
        INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId    
        ORDER BY ratio DESC    
    ) sub2 

     
    drop table #qgram

    Start by getting rid of the nolock hints, the implicit conversions and the parallel execution, run the query again and post the execution plan.
    😎

    Implicit conversions are happening in very small tables. I can try removing nolock hint and add maxdop 1 to remove parallelism if that's what you mean by parallel execution.

    The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results. 
    😎
    Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!

  • Eirikur Eiriksson - Thursday, January 3, 2019 9:34 AM

    The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results. 
    😎
    Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!

    Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?

  • chandan_jha18 - Thursday, January 3, 2019 11:02 AM

    Eirikur Eiriksson - Thursday, January 3, 2019 9:34 AM

    The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results. 
    😎
    Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!

    Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?

    No... none of that is actually going to work.  I mean you can try it (add OPTION (MAXDOP 1) to the last query) but removing parallel execution here isn't the key.  The thing that's causing the need for the parallel execution needs to be fixed.  That's the icon in the execution plan you posted than has 90 million rows coming out of it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • chandan_jha18 - Thursday, January 3, 2019 9:23 AM

    Jeff Moden - Thursday, January 3, 2019 8:45 AM

    I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are.  If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".

    As for whether or not this is inline with what I thought, yes.  NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.

    I am a little concerned though.  You say in 5) above that you return the "top 500 rows".  It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches.  I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".

    Like I said, I also need to know what the purpose/use of using the ## and # characters are for.

    Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much. 
    Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot  change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR' 

    Best Regards
    Chandan

    You're not misunderstanding a thing.  I agree that the "500 rows" being returned is trivial compared to the initial code.  What I'm trying to determine is how to make the stuff I know compatible with the stuff you need.  It also helps me define the overall goal of the code.  And, no... we won't need a table change here or any massive functional change.  I'm thinking I can shoehorn all that into place.  The only big thing left is to figure out what the ## and # is meant to infer.  It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, January 3, 2019 3:57 PM

    You're not misunderstanding a thing.  I agree that the "500 rows" being returned is trivial compared to the initial code.  What I'm trying to determine is how to make the stuff I know compatible with the stuff you need.  It also helps me define the overall goal of the code.  And, no... we won't need a table change here or any massive functional change.  I'm thinking I can shoehorn all that into place.  The only big thing left is to figure out what the ## and # is meant to infer.  It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

    Sir, This is the response I got from the developer regarding ##;
    "

    They are filling empty spaces
    Like they want to make strings like ##M
    #MI
    MIC
    ICR
    for Microsoft"

  • chandan_jha18 - Friday, January 4, 2019 6:38 AM

    Jeff Moden - Thursday, January 3, 2019 3:57 PM

    You're not misunderstanding a thing.  I agree that the "500 rows" being returned is trivial compared to the initial code.  What I'm trying to determine is how to make the stuff I know compatible with the stuff you need.  It also helps me define the overall goal of the code.  And, no... we won't need a table change here or any massive functional change.  I'm thinking I can shoehorn all that into place.  The only big thing left is to figure out what the ## and # is meant to infer.  It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

    Sir, This is the response I got from the developer regarding ##;
    "

    They are filling empty spaces
    Like they want to make strings like ##M
    #MI
    MIC
    ICR
    for Microsoft"

    Understood!  Thanks, Chandan.  I think this will all fit.  I don't have your data but I believe I can come up with an example

    And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir".  "Jeff" will be fine.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, January 4, 2019 7:37 AM

    chandan_jha18 - Friday, January 4, 2019 6:38 AM

    Jeff Moden - Thursday, January 3, 2019 3:57 PM

    You're not misunderstanding a thing.  I agree that the "500 rows" being returned is trivial compared to the initial code.  What I'm trying to determine is how to make the stuff I know compatible with the stuff you need.  It also helps me define the overall goal of the code.  And, no... we won't need a table change here or any massive functional change.  I'm thinking I can shoehorn all that into place.  The only big thing left is to figure out what the ## and # is meant to infer.  It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

    Sir, This is the response I got from the developer regarding ##;
    "

    They are filling empty spaces
    Like they want to make strings like ##M
    #MI
    MIC
    ICR
    for Microsoft"

    Understood!  Thanks, Chandan.  I think this will all fit.  I don't have your data but I believe I can come up with an example

    And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir".  "Jeff" will be fine.

    The '#' and '%' characters add weight to the leading and trailing letters of the searched string, Sir.
    I've set up a sample data generator which will make use of any table containing company names, human names or whatever, to be subsequently used for code testing:

     IF OBJECT_ID('tempdb..#SearchSuggest_QGramNameCache_tbl') IS NOT NULL DROP TABLE #SearchSuggest_QGramNameCache_tbl --
     SELECT TOP(9312500) ID, CompanyName = RTRIM(FirstName) + ' ' + RTRIM(LastName) INTO #SearchSuggest_QGramNameCache_tbl FROM [Yourtable] WHERE FirstName > ' ' OR LastName > ' '
     -- (5,456,289 rows affected)
     CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #SearchSuggest_QGramNameCache_tbl (ID)

     IF OBJECT_ID('tempdb..#SearchSuggest_QGram_tbl') IS NOT NULL DROP TABLE #SearchSuggest_QGram_tbl
     DECLARE @Q INT = 3;
     WITH Tally AS (
      SELECT n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
      FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
       (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
     )
     SELECT f.ID, x.n, x.gram
     INTO #SearchSuggest_QGram_tbl
     FROM #SearchSuggest_QGramNameCache_tbl f
     CROSS APPLY (
      SELECT n, CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + f.CompanyName + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram   
      FROM Tally n    
      WHERE n.n <= LEN(f.CompanyName) + @Q - 2     
     ) x -- (74,045,046 rows affected)
     CREATE CLUSTERED INDEX cx_Stuff ON #SearchSuggest_QGram_tbl (gram, ID)
    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:


    DECLARE @searchText VARCHAR(100) = 'Michaela Broughton'
    DECLARE @SearchTextLenght INT = LEN(@searchText)
    DECLARE @Q INT = 3

    SELECT SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1)
    IF OBJECT_ID('tempdb..#qgram') IS NOT NULL DROP TABLE #qgram
    SELECT    
        gram = ISNULL(gram, ''),
     [count] = ISNULL(COUNT(*), 1),
     n = MIN(n) 
    INTO #qgram 
    FROM ( -- sub    
        SELECT n, CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), Tally.n, @Q)) AS gram   
        FROM ( -- Tally
        SELECT n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
        (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
     ) Tally    
        WHERE Tally.n <= @SearchTextLenght + @Q - 2   
    ) sub   
    GROUP BY gram  
    CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #qgram (gram)

    SELECT f.*, pm.cnt
    FROM (
     SELECT TOP(1500)
      g.ID, cnt = COUNT(*)
     FROM #qgram q
     INNER JOIN #SearchSuggest_QGram_tbl g
      ON g.gram = q.gram
     GROUP BY g.ID
     ORDER BY COUNT(*) DESC
    ) pm
    INNER JOIN #SearchSuggest_QGramNameCache_tbl f
     ON f.ID = pm.ID
    ORDER BY cnt DESC

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Friday, January 4, 2019 7:48 AM

    Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:

    Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.

    Best Regards
    Chandan Jha

  • chandan_jha18 - Monday, January 7, 2019 8:11 AM

    ChrisM@Work - Friday, January 4, 2019 7:48 AM

    Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:

    Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.

    Best Regards
    Chandan Jha

    You'll be just fine. 
    The first code chunk splits the variable into 3-character pieces or "grams" (after adding top & tail elements), which with the sample I provided yields this:

    gram count n
    ##M 1 1
    #Mi 1 2
    Mic 1 3
    ich 1 4
    cha 1 5
    hae 1 6
    ael 1 7
    ela 1 8
    la  1 9
    a B 1 10
     Br 1 11
    Bro 1 12
    rou 1 13
    oug 1 14
    ugh 1 15
    ght 1 16
    hto 1 17
    ton 1 18
    on% 1 19

    The second part of the code joins this result set to #SearchSuggest_QGram_tbl (230M rows) on [gram] then does a count of companyid for the matching rows. A high number of matching grams (rows) indicates that the company name (held in #SearchSuggest_QGramNameCache_tbl) is similar or identical to the variable.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply