Query Performance

Question

Query Performance

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

What is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?

Did you try the code I posted? Note that the index is important.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

chandan_jha18 SSChampion Points: 11330 More actions · Answer 2

ChrisM@Work - Thursday, January 3, 2019 7:07 AM
What is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?
Did you try the code I posted? Note that the index is important.

Yes sir. It is looking for similar names and based on the output it picks best matching ones and perform some calculations.
I tried running your code(index already created as you suggested), I get an error

SELECT TOP(1500) x.nameCacheId, x.count FROM #qgram q CROSS APPLY ( SELECT g.gram, g.nameCacheId, SUM(CASE WHEN g.count > q.count THEN q.count ELSE g.count END) AS count FROM SearchSuggest_QGram_tbl g WHERE g.gram = q.gram GROUP BY g.gram, g.nameCacheId ) x WHERE x.count >= @minLen ORDER BY 2 DESC
Msg 8124, Level 16, State 1, Line 80
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 3

chandan_jha18 - Thursday, January 3, 2019 7:02 AM
Jeff Moden - Wednesday, January 2, 2019 9:52 AM
chandan_jha18 - Wednesday, January 2, 2019 6:49 AM
Hello All,
Wish you a very happy new year 2019!!
I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
Since both queries are similar, I am putting one of them here.
Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here.
Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.
Thanks in advance!!
          declare @maxResults INT = 25, @companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4', @issuerCode nvarchar(6)=''               DECLARE @searchText NVARCHAR(300)    DECLARE @issuerCodeType int DECLARE @searchTextType int
SELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)
--SELECT @searchText = @companyName
       -- Get company by QGrams     DECLARE @minLen int     DECLARE @Q int     SET @Q = 3     SET @minLen = 4          IF LEN(@searchText) < 6     BEGIN         SET @minLen = 2     END
Create table #qgram (         gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED ,    [count] INT NOT NULL     )     DECLARE @SearchTextLenght INT SET @SearchTextLenght = LEN(@searchText)      INSERT INTO #qgram     SELECT         ISNULL(gram, '') as gram ,    ISNULL(COUNT(*), 1) AS [count]     FROM     (         SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram         FROM dbautil.dbo.numbers_tbl n (NOLOCK)         WHERE n.n <= @SearchTextLenght + @Q - 2     ) sub     GROUP BY gram   --option(recompile)
     DECLARE @maxMatch INT     SELECT @maxMatch = SUM(count) FROM #qgram
          SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn     FROM (     SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId, ((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId                  FROM         (         SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count                      FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING             SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC              ) sub         INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId         ORDER BY ratio DESC     ) sub2
drop table #qgram
It would really have helped if you had identified what the actual purpose of the query is. So, for better understanding by all, let me see if I get it right.
In this particular bit of code, it would appear that you're retrieving 51 "key items" that uniquely identify a company and then storing that information in a temp table.
After that, you're trying to figure out which rows in a much larger table provide a match of those 51 "grams" so you can return only those nameCacheId that are actually a match for the given company.
Is that correct? If not, please expound.
Sir, here is what I got from the developer as per his document:
1)     When the search text is entered, the length of the search text is calculated and then the search string is broken into multiple small strings of 3 characters and stored in a table called qgram
Example: If I pass the search text as 'Legg Mason Partners', we will get the following results:
Results:
gram     count
Ma         1
Pa          1
##L         1
#Le         1
etc.
Count indicates the number oftimes the spilt string is present in the search string.
2)     Qgram table and SearchSuggest_QGram_tbl are comibined based on gram column.
3)     After Qgram and SearchSuggest_QGram_tbl are combined, the rows are grouped on basis of same namecacheid and count values are summed up. And top 1500 rows are selected.
Since we have count column in both the tables, the column with least value is used for adding the all the count values while grouping based on namecacheid.
Example
nameCacheId count
5359401                                 9
512010 6
2545081                                 8
2875285              11
1934139                                10
4)     The result fetched in step 4 is combined with SearchSuggest_QGramNameCache_tbl based on â€˜nameCacheIdâ€™
5)     From the result set obtained in step 6, top 500 records are fetched which contain company id, ratio, namecache id, strippednamelength (company or fund name).
6)     Ratio for each company id is calculated as below
(count (obtained in step 4) * 0.1 /length(companyname fetched in step7) +1) * count (obtained in step 4) * 0.1/ sum of count values obtained earlier.
Can you see if this matches your observation.
Regards
Chandan Jha

I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".

As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.

I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".

Like I said, I also need to know what the purpose/use of using the ## and # characters are for.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

chandan_jha18 SSChampion Points: 11330 More actions · Answer 4

Jeff Moden - Thursday, January 3, 2019 8:45 AM
I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".
As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.
I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".
Like I said, I also need to know what the purpose/use of using the ## and # characters are for.

Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much.
Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR'

Best Regards
Chandan

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 5

chandan_jha18 - Wednesday, January 2, 2019 8:48 AM
Eirikur Eiriksson - Wednesday, January 2, 2019 8:21 AM
chandan_jha18 - Wednesday, January 2, 2019 6:49 AM
Hello All,
Wish you a very happy new year 2019!!
I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
Since both queries are similar, I am putting one of them here.
Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here.
Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.
Thanks in advance!!
          declare @maxResults INT = 25, @companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4', @issuerCode nvarchar(6)=''               DECLARE @searchText NVARCHAR(300)    DECLARE @issuerCodeType int DECLARE @searchTextType int
SELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)
--SELECT @searchText = @companyName
       -- Get company by QGrams     DECLARE @minLen int     DECLARE @Q int     SET @Q = 3     SET @minLen = 4          IF LEN(@searchText) < 6     BEGIN         SET @minLen = 2     END
Create table #qgram (         gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED ,    [count] INT NOT NULL     )     DECLARE @SearchTextLenght INT SET @SearchTextLenght = LEN(@searchText)      INSERT INTO #qgram     SELECT         ISNULL(gram, '') as gram ,    ISNULL(COUNT(*), 1) AS [count]     FROM     (         SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram         FROM dbautil.dbo.numbers_tbl n (NOLOCK)         WHERE n.n <= @SearchTextLenght + @Q - 2     ) sub     GROUP BY gram   --option(recompile)
     DECLARE @maxMatch INT     SELECT @maxMatch = SUM(count) FROM #qgram
          SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn     FROM (     SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId, ((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId                  FROM         (         SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count                      FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING             SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC              ) sub         INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId         ORDER BY ratio DESC     ) sub2
drop table #qgram
Start by getting rid of the nolock hints, the implicit conversions and the parallel execution, run the query again and post the execution plan.
😎
Implicit conversions are happening in very small tables. I can try removing nolock hint and add maxdop 1 to remove parallelism if that's what you mean by parallel execution.

The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!

chandan_jha18 SSChampion Points: 11330 More actions · Answer 6

Eirikur Eiriksson - Thursday, January 3, 2019 9:34 AM
The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!

Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 7

chandan_jha18 - Thursday, January 3, 2019 11:02 AM
Eirikur Eiriksson - Thursday, January 3, 2019 9:34 AM
The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!
Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?

No... none of that is actually going to work. I mean you can try it (add OPTION (MAXDOP 1) to the last query) but removing parallel execution here isn't the key. The thing that's causing the need for the parallel execution needs to be fixed. That's the icon in the execution plan you posted than has 90 million rows coming out of it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 8

chandan_jha18 - Thursday, January 3, 2019 9:23 AM
Jeff Moden - Thursday, January 3, 2019 8:45 AM
I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".
As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.
I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".
Like I said, I also need to know what the purpose/use of using the ## and # characters are for.
Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much.
Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR'
Best Regards
Chandan

You're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

chandan_jha18 SSChampion Points: 11330 More actions · Answer 9

Jeff Moden - Thursday, January 3, 2019 3:57 PM
You're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.

Sir, This is the response I got from the developer regarding ##;
"

They are filling empty spaces
Like they want to make strings like ##M
#MI
MIC
ICR
for Microsoft"

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 10

chandan_jha18 - Friday, January 4, 2019 6:38 AM
Jeff Moden - Thursday, January 3, 2019 3:57 PM
You're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
Sir, This is the response I got from the developer regarding ##;
"
They are filling empty spaces
Like they want to make strings like ##M
#MI
MIC
ICR
for Microsoft"

Understood! Thanks, Chandan. I think this will all fit. I don't have your data but I believe I can come up with an example

And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir". "Jeff" will be fine.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 11

Jeff Moden - Friday, January 4, 2019 7:37 AM
chandan_jha18 - Friday, January 4, 2019 6:38 AM
Jeff Moden - Thursday, January 3, 2019 3:57 PM
You're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
Sir, This is the response I got from the developer regarding ##;
"
They are filling empty spaces
Like they want to make strings like ##M
#MI
MIC
ICR
for Microsoft"
Understood! Thanks, Chandan. I think this will all fit. I don't have your data but I believe I can come up with an example
And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir". "Jeff" will be fine.

The '#' and '%' characters add weight to the leading and trailing letters of the searched string, Sir.
I've set up a sample data generator which will make use of any table containing company names, human names or whatever, to be subsequently used for code testing:


 IF OBJECT_ID('tempdb..#SearchSuggest_QGramNameCache_tbl') IS NOT NULL DROP TABLE #SearchSuggest_QGramNameCache_tbl -- 
 SELECT TOP(9312500) ID, CompanyName = RTRIM(FirstName) + ' ' + RTRIM(LastName) INTO #SearchSuggest_QGramNameCache_tbl FROM [Yourtable] WHERE FirstName > ' ' OR LastName > ' '
 -- (5,456,289 rows affected)

CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #SearchSuggest_QGramNameCache_tbl (ID)

IF OBJECT_ID('tempdb..#SearchSuggest_QGram_tbl') IS NOT NULL DROP TABLE #SearchSuggest_QGram_tbl
DECLARE @Q INT = 3;

WITH Tally AS (
  SELECT n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
  FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
   (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
)
SELECT f.ID, x.n, x.gram
INTO #SearchSuggest_QGram_tbl
FROM #SearchSuggest_QGramNameCache_tbl f
CROSS APPLY (
  SELECT n, CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + f.CompanyName + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram
  FROM Tally n
  WHERE n.n <= LEN(f.CompanyName) + @Q - 2
) x -- (74,045,046 rows affected)

CREATE CLUSTERED INDEX cx_Stuff ON #SearchSuggest_QGram_tbl (gram, ID)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 12

Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:

DECLARE @searchText VARCHAR(100) = 'Michaela Broughton' DECLARE @SearchTextLenght INT = LEN(@searchText) DECLARE @Q INT = 3

SELECT SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1)

IF OBJECT_ID('tempdb..#qgram') IS NOT NULL DROP TABLE #qgram
SELECT
    gram = ISNULL(gram, ''),
[count] = ISNULL(COUNT(*), 1),
n = MIN(n)
INTO #qgram
FROM ( -- sub
    SELECT n, CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), Tally.n, @Q)) AS gram
    FROM ( -- Tally
    SELECT n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) a,
    (SELECT n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)) b
) Tally
    WHERE Tally.n <= @SearchTextLenght + @Q - 2
) sub
GROUP BY gram

CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #qgram (gram)

SELECT f.*, pm.cnt
FROM (
SELECT TOP(1500)
g.ID, cnt = COUNT(*)
FROM #qgram q
INNER JOIN #SearchSuggest_QGram_tbl g
ON g.gram = q.gram
GROUP BY g.ID
ORDER BY COUNT(*) DESC
) pm
INNER JOIN #SearchSuggest_QGramNameCache_tbl f
ON f.ID = pm.ID
ORDER BY cnt DESC

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

chandan_jha18 SSChampion Points: 11330 More actions · Answer 13

ChrisM@Work - Friday, January 4, 2019 7:48 AM

Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:

Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.

Best Regards
Chandan Jha

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

chandan_jha18 - Monday, January 7, 2019 8:11 AM

ChrisM@Work - Friday, January 4, 2019 7:48 AM
Lastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:

Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.

Best Regards
Chandan Jha

You'll be just fine.
The first code chunk splits the variable into 3-character pieces or "grams" (after adding top & tail elements), which with the sample I provided yields this:

gram count n ##M 1 1 #Mi 1 2 Mic 1 3 ich 1 4 cha 1 5 hae 1 6 ael 1 7 ela 1 8 la 1 9 a B 1 10 Br 1 11 Bro 1 12 rou 1 13 oug 1 14 ugh 1 15 ght 1 16 hto 1 17 ton 1 18 on% 1 19
The second part of the code joins this result set to #SearchSuggest_QGram_tbl (230M rows) on [gram] then does a count of companyid for the matching rows. A high number of matching grams (rows) indicates that the company name (held in #SearchSuggest_QGramNameCache_tbl) is similar or identical to the variable.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden