January 3, 2019 at 7:07 am
What is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?
Did you try the code I posted? Note that the index is important.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
January 3, 2019 at 8:04 am
ChrisM@Work - Thursday, January 3, 2019 7:07 AMWhat is the purpose of this stored procedure? Is it to find companies with similar names to the company passed in as a parameter?Did you try the code I posted? Note that the index is important.
Yes sir. It is looking for similar names and based on the output it picks best matching ones and perform some calculations.
I tried running your code(index already created as you suggested), I get an error
SELECT TOP(1500)
x.nameCacheId,
x.count
FROM #qgram q
CROSS APPLY (
SELECT
g.gram,
g.nameCacheId,
SUM(CASE WHEN g.count > q.count THEN q.count ELSE g.count END) AS count
FROM SearchSuggest_QGram_tbl g
WHERE g.gram = q.gram
GROUP BY g.gram, g.nameCacheId
) x
WHERE x.count >= @minLen
ORDER BY 2 DESC
Msg 8124, Level 16, State 1, Line 80
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
January 3, 2019 at 8:45 am
chandan_jha18 - Thursday, January 3, 2019 7:02 AMJeff Moden - Wednesday, January 2, 2019 9:52 AMchandan_jha18 - Wednesday, January 2, 2019 6:49 AMHello All,Wish you a very happy new year 2019!!
I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
Since both queries are similar, I am putting one of them here.Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here.
Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.
Thanks in advance!!
declare @maxResults INT = 25,
@companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4',
@issuerCode nvarchar(6)=''
DECLARE @searchText NVARCHAR(300)
DECLARE @issuerCodeType int
DECLARE @searchTextType intSELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)
--SELECT @searchText = @companyName
-- Get company by QGrams
DECLARE @minLen int
DECLARE @Q int
SET @Q = 3
SET @minLen = 4
IF LEN(@searchText) < 6
BEGIN
SET @minLen = 2
ENDCreate table #qgram
(
gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED
, [count] INT NOT NULL
)
DECLARE @SearchTextLenght INT
SET @SearchTextLenght = LEN(@searchText)
INSERT INTO #qgram
SELECT
ISNULL(gram, '') as gram
, ISNULL(COUNT(*), 1) AS [count]
FROM
(
SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram
FROM dbautil.dbo.numbers_tbl n (NOLOCK)
WHERE n.n <= @SearchTextLenght + @Q - 2
) sub
GROUP BY gram
--option(recompile)
DECLARE @maxMatch INT
SELECT @maxMatch = SUM(count) FROM #qgram
SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn
FROM
(
SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId,
((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId
FROM
(
SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count
FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING
SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC
) sub
INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId
ORDER BY ratio DESC
) sub2
drop table #qgram
It would really have helped if you had identified what the actual purpose of the query is. So, for better understanding by all, let me see if I get it right.
In this particular bit of code, it would appear that you're retrieving 51 "key items" that uniquely identify a company and then storing that information in a temp table.
After that, you're trying to figure out which rows in a much larger table provide a match of those 51 "grams" so you can return only those nameCacheId that are actually a match for the given company.
Is that correct? If not, please expound.
Sir, here is what I got from the developer as per his document:
1) When the search text is entered, the length of the search text is calculated and then the search string is broken into multiple small strings of 3 characters and stored in a table called qgram
Example: If I pass the search text as 'Legg Mason Partners', we will get the following results:Results:
gram count
Ma 1
Pa 1
##L 1
#Le 1etc.
- Count indicates the number oftimes the spilt string is present in the search string.
2) Qgram table and SearchSuggest_QGram_tbl are comibined based on gram column.
3) After Qgram and SearchSuggest_QGram_tbl are combined, the rows are grouped on basis of same namecacheid and count values are summed up. And top 1500 rows are selected.
Since we have count column in both the tables, the column with least value is used for adding the all the count values while grouping based on namecacheid.
Example
nameCacheId count
5359401 9
512010 6
2545081 8
2875285 11
1934139 10
4) The result fetched in step 4 is combined with SearchSuggest_QGramNameCache_tbl based on ‘nameCacheId’5) From the result set obtained in step 6, top 500 records are fetched which contain company id, ratio, namecache id, strippednamelength (company or fund name).
6) Ratio for each company id is calculated as below
(count (obtained in step 4) * 0.1 /length(companyname fetched in step7) +1) * count (obtained in step 4) * 0.1/ sum of count values obtained earlier.Can you see if this matches your observation.
Regards
Chandan Jha
I can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".
As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.
I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".
Like I said, I also need to know what the purpose/use of using the ## and # characters are for.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2019 at 9:23 am
Jeff Moden - Thursday, January 3, 2019 8:45 AMI can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".
As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.
I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".
Like I said, I also need to know what the purpose/use of using the ## and # characters are for.
Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much.
Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR'
Best Regards
Chandan
January 3, 2019 at 9:34 am
chandan_jha18 - Wednesday, January 2, 2019 8:48 AMEirikur Eiriksson - Wednesday, January 2, 2019 8:21 AMchandan_jha18 - Wednesday, January 2, 2019 6:49 AMHello All,Wish you a very happy new year 2019!!
I had been asked to review performance of a procedure and I am producing a part of it. There are two similar queries in this procedure and both when combined take upto 12-15 seconds on a big server(DR of production with hardly any resource consumption).
Since both queries are similar, I am putting one of them here.Even though it is a nested query but even when I tried to evaluate it independently using temp tables for better visualization, there was hardly any performance improvement. I tried to look into the query to see if I can modify it a bit but did not find much success here.
Can one of the gurus see if there is a way to re-write this better so that the performance can be improved. I am including the table schema along with indexes and execution plan. Please let me know if I can frame my question in a better way or supply more information.
Thanks in advance!!
declare @maxResults INT = 25,
@companyName varchar(100)=N'Northwestern Corp Discount Commercial Paper 4/2 Yrs 3&4',
@issuerCode nvarchar(6)=''
DECLARE @searchText NVARCHAR(300)
DECLARE @issuerCodeType int
DECLARE @searchTextType intSELECT @searchText = dbautil.dbo.stripAutoSuggestCompanyName_fn(@companyName)
--SELECT @searchText = @companyName
-- Get company by QGrams
DECLARE @minLen int
DECLARE @Q int
SET @Q = 3
SET @minLen = 4
IF LEN(@searchText) < 6
BEGIN
SET @minLen = 2
ENDCreate table #qgram
(
gram NCHAR(3) NOT NULL PRIMARY KEY CLUSTERED
, [count] INT NOT NULL
)
DECLARE @SearchTextLenght INT
SET @SearchTextLenght = LEN(@searchText)
INSERT INTO #qgram
SELECT
ISNULL(gram, '') as gram
, ISNULL(COUNT(*), 1) AS [count]
FROM
(
SELECT CONVERT(NCHAR(3), SUBSTRING(SUBSTRING('###', 1, @Q-1) + @searchText + SUBSTRING('%%%', 1, @Q-1), n.n, @Q)) AS gram
FROM dbautil.dbo.numbers_tbl n (NOLOCK)
WHERE n.n <= @SearchTextLenght + @Q - 2
) sub
GROUP BY gram
--option(recompile)
DECLARE @maxMatch INT
SELECT @maxMatch = SUM(count) FROM #qgram
SELECT sub2.fundOrComp, sub2.nameCacheId, sub2.ratio ,row_number() over(partition by sub2.companyid order by ratio desc) as rn
FROM
(
SELECT TOP(@maxResults) 1 as fundOrComp,sub.nameCacheId,
((sub.count * 1.0 /(qn.strippedNameLength + @Q - 2)) * (sub.count * 1.0/ @maxMatch)) AS ratio,qn.companyId
FROM
(
SELECT TOP 1500 g.nameCacheId,SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) AS count
FROM #qgram q INNER JOIN SearchSuggest_QGram_tbl g (NOLOCK) ON g.gram = q.gram GROUP BY g.nameCacheId HAVING
SUM (CASE WHEN g.count > q.count THEN q.count ELSE g.count END ) >= @minLen ORDER BY 2 DESC, 1 ASC
) sub
INNER JOIN SearchSuggest_QGramNameCache_tbl qn (NOLOCK) ON qn.nameCacheId = sub.nameCacheId
ORDER BY ratio DESC
) sub2
drop table #qgram
Start by getting rid of the nolock hints, the implicit conversions and the parallel execution, run the query again and post the execution plan.
😎Implicit conversions are happening in very small tables. I can try removing nolock hint and add maxdop 1 to remove parallelism if that's what you mean by parallel execution.
The thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!
January 3, 2019 at 11:02 am
Eirikur Eiriksson - Thursday, January 3, 2019 9:34 AMThe thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!
Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?
January 3, 2019 at 3:52 pm
chandan_jha18 - Thursday, January 3, 2019 11:02 AMEirikur Eiriksson - Thursday, January 3, 2019 9:34 AMThe thing here is that although the table is small, the rate of the conversion is X * N, where X is the cardinality of the large table and the N is the cardinality of the small table. To solve this kind of problems, one has to isolate the problem as any background noise can skew the perceived results.
😎
Parallel execution can easily cost more than the execution of the query and the allocation scan implied with the nolock hint can further skew things!Sir, I tried removing nolock hint but no improvement, please let me know how to avoid parallel execution here. You mean maxdop hint at the end?
No... none of that is actually going to work. I mean you can try it (add OPTION (MAXDOP 1) to the last query) but removing parallel execution here isn't the key. The thing that's causing the need for the parallel execution needs to be fixed. That's the icon in the execution plan you posted than has 90 million rows coming out of it.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 3, 2019 at 3:57 pm
chandan_jha18 - Thursday, January 3, 2019 9:23 AMJeff Moden - Thursday, January 3, 2019 8:45 AMI can see why (although I don't necessarily agree but "It Depends") they may have a short entry like "L" or "Le" in the NGram patterns but I don't understand what the purpose of the "##" and the "#" are. If they're what I think they're for (NGrams for leading word or NGrams for every word). that could provide one of the "early filters" in the method I use for such "matching".
As for whether or not this is inline with what I thought, yes. NGrams aren't any different than a KeyWord "thumbprint" search if the NGrams have been stored in a similar fashion and they would be for a Keyword "thumbprint" search.
I am a little concerned though. You say in 5) above that you return the "top 500 rows". It would seem that someone would need to do further processing to determine what the actual "best fit" would be to isolate the which company name(s) would make good matches. I believe we might be able to produce the "best fit" without that extra processing but don't know how many out of (for example) 51 Ngrams would have to match before something qualified as a "Best Fit" and doing such a thing is important to being able to quickly reject poor candidates for the "Best Fit".
Like I said, I also need to know what the purpose/use of using the ## and # characters are for.
Sir, I will try to find about ### and # kind of characters used but to your concern about top 500 rows, from the query and the execution plan I posted earlier and by looking at it again, it appears as if it does not matter if i did a select 1500 or select 15000 because the data retrieved after joins produced by index seeks, they are evaluated for the sum and then ordered in a particular fashion(top operator costly in plan) . So most of the cost is spent till here , after that how much we show using top clause from an already ordered set did not matter much.
Sorry, if my understanding is poor here. I am learning how not everytime an index or statistics change can work magic. The query looks simple and if I keep the business logic aside(at times we cannot change table schema for re-designing) , just by looking at the query and plan what problem statement can be framed here just like we used to write in some examples posted by you Gurus- ' Kid this is not a set based operation, you are using RBRAR'Best Regards
Chandan
You're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 4, 2019 at 6:38 am
Jeff Moden - Thursday, January 3, 2019 3:57 PMYou're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
Sir, This is the response I got from the developer regarding ##;
"
January 4, 2019 at 7:37 am
chandan_jha18 - Friday, January 4, 2019 6:38 AMJeff Moden - Thursday, January 3, 2019 3:57 PMYou're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
Sir, This is the response I got from the developer regarding ##;
"They are filling empty spaces
Like they want to make strings like ##M
#MI
MIC
ICR
for Microsoft"
Understood! Thanks, Chandan. I think this will all fit. I don't have your data but I believe I can come up with an example
And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir". "Jeff" will be fine.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 4, 2019 at 7:45 am
Jeff Moden - Friday, January 4, 2019 7:37 AMchandan_jha18 - Friday, January 4, 2019 6:38 AMJeff Moden - Thursday, January 3, 2019 3:57 PMYou're not misunderstanding a thing. I agree that the "500 rows" being returned is trivial compared to the initial code. What I'm trying to determine is how to make the stuff I know compatible with the stuff you need. It also helps me define the overall goal of the code. And, no... we won't need a table change here or any massive functional change. I'm thinking I can shoehorn all that into place. The only big thing left is to figure out what the ## and # is meant to infer. It may be that it could be used as a further shortcut but could also mean they should be totally ignored or replaced.
Sir, This is the response I got from the developer regarding ##;
"They are filling empty spaces
Like they want to make strings like ##M
#MI
MIC
ICR
for Microsoft"Understood! Thanks, Chandan. I think this will all fit. I don't have your data but I believe I can come up with an example
And while I very much appreciate your fine manners and sense of humility, you don't need to call me "sir". "Jeff" will be fine.
IF OBJECT_ID('tempdb..#SearchSuggest_QGramNameCache_tbl') IS NOT NULL DROP TABLE #SearchSuggest_QGramNameCache_tbl --
SELECT TOP(9312500) ID, CompanyName = RTRIM(FirstName) + ' ' + RTRIM(LastName) INTO #SearchSuggest_QGramNameCache_tbl FROM [Yourtable] WHERE FirstName > ' ' OR LastName > ' '
-- (5,456,289 rows affected)
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
January 4, 2019 at 7:48 am
DECLARE @searchText VARCHAR(100) = 'Michaela Broughton'
DECLARE @SearchTextLenght INT = LEN(@searchText)
DECLARE @Q INT = 3
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
January 7, 2019 at 8:11 am
Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.
Best Regards
Chandan Jha
January 7, 2019 at 8:24 am
ChrisM@Work - Friday, January 4, 2019 7:48 AMLastly, here's a chunk of code which models the problem domain, albeit slightly simplified from the original. I suspect that the simplifications could be usefully employed:
Thank you Chris. I have been trying to digest this and will hopefully be able to find a clue from your code above, although my brain won't be as fast as yours to interpret this smoothly at the beginning.
Best Regards
Chandan Jha
You'll be just fine.
The first code chunk splits the variable into 3-character pieces or "grams" (after adding top & tail elements), which with the sample I provided yields this:
gram count n
##M 1 1
#Mi 1 2
Mic 1 3
ich 1 4
cha 1 5
hae 1 6
ael 1 7
ela 1 8
la 1 9
a B 1 10
Br 1 11
Bro 1 12
rou 1 13
oug 1 14
ugh 1 15
ght 1 16
hto 1 17
ton 1 18
on% 1 19
The second part of the code joins this result set to #SearchSuggest_QGram_tbl (230M rows) on [gram] then does a count of companyid for the matching rows. A high number of matching grams (rows) indicates that the company name (held in #SearchSuggest_QGramNameCache_tbl) is similar or identical to the variable.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 14 posts - 16 through 28 (of 28 total)
You must be logged in to reply to this topic. Login to reply