June 22, 2016 at 11:12 pm
Comments posted to this topic are about the item Nasty Fast N-Grams (Part 1): Character-Level Unigrams
-- Itzik Ben-Gan 2001
June 23, 2016 at 4:38 am
Good read. I'm looking forward to the rest of the series.
Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?
Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.
In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.
I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.
June 23, 2016 at 6:30 am
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Gerald Britton, Pluralsight courses
June 23, 2016 at 7:16 am
Nice piece and very good work Alan!
π
June 23, 2016 at 8:18 am
Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.
June 23, 2016 at 10:38 am
David.Poole (6/23/2016)
Good read. I'm looking forward to the rest of the series.Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?
Thanks for the feedback David. There's a couple reasons for the CTE Tally Table. First, putting the logic into a CTE allows poeple to just copy/paste the code. The second is that the CTE Tally tends to be a little faster and certainly generates less reads. That said, a memory-optimized tally table seems to outperform the CTE Tally Table (something I did not mention in the article).
Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.
In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.
Me too. Even though CTE Tally tables are faster and don't produce any reads, there are a some situations where a physical tally table is better. Part of the motivation of this article was also to promote Tally Tables.
I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.
I use the sys.columns method often. The downside, however, is that it creates a busier execution plan and cant be used views and functions with schemabinding. A good numbers ITV [/url]is always nice to have around. π
-- Itzik Ben-Gan 2001
June 23, 2016 at 11:00 am
g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.
-- Itzik Ben-Gan 2001
June 23, 2016 at 11:00 am
Eirikur Eiriksson (6/23/2016)
Nice piece and very good work Alan!π
Thanks you sir! π
-- Itzik Ben-Gan 2001
June 23, 2016 at 11:01 am
Y.B. (6/23/2016)
Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.
Thanks a lot YB!
-- Itzik Ben-Gan 2001
June 23, 2016 at 11:50 am
Alan.B (6/23/2016)
g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.
Really? I couldn't spot that reference (even after cleaning my glasses!) Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function. Then there's the Similarity function. even better I think.
Gerald Britton, Pluralsight courses
June 23, 2016 at 11:58 am
A little off-topic, but Master Data Services comes with CLR-functions that do N-Grams and other nice things. Here's a reference on that:
Regular Expressions, advanced string matching and new Split function SQL Server 2008 R2[/url]
to whet your appetite, the functions include:
mdq.NGrams
mdq.RegexReplace
mdq.RegexExtract
mdq.RegexSplit
mdq.RegexIsMatch
mdq.Similarity
mdq.RegexIsValid
mdq.SimilarityDate
mdq.RegexMask
mdq.Split
They pretty much do what their names imply
Gerald Britton, Pluralsight courses
June 23, 2016 at 1:16 pm
g.britton (6/23/2016)
Alan.B (6/23/2016)
g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.
Really? I couldn't spot that reference (even after cleaning my glasses!)
From the article:
Even Microsoft Data Quality Services includes an N-Grams CLR function.
Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function.
Nope. Part of the problem (and I'll get into this more in Part 3) with mdq.ngrams is the added cost of the padding functionality. The formula to calculate the number of tokens returned by mdq.ngrams is N+(N-1) vs N-(N-1) for NGrams8K (or the variations of it). I originally included test times with parallel execution plans (using make_parallel) and also included mdq.NGrams when I first submitted the article but it was to long so I edited those results out. Here's the original test results (the SQLCLR version is mdq.ngrams):
I also excluded a second test where I compared mdq.NGrams8k to the loop version, rCTE version and mdq.NGrams for a specialized "Remove Duplicate Characters" function:
CREATE FUNCTION dbo.RemoveDupes8K(@string varchar(8000), @preserved varchar(100))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT CleanedString =
( SELECT token+''
FROM dbo.NGrams8K(@string,1)
WHERE token <> SUBSTRING(@string,position+1,1) -- exclude chars equal to the next char
OR token LIKE @preserved -- preserve characters that match the @preserved pattern
FOR XML PATH(''),TYPE
).value('.','varchar(8000)'); -- using Wayne Sheffieldβs concatenation logic
Here's the results of that test:
I suspect that this could be done slightly faster with a better CLR than mdq.ngrams which was the motivation behind this post. The Linq version Con Alexis posted, however, was notably slower.
-- Itzik Ben-Gan 2001
June 23, 2016 at 1:42 pm
Alan.B (6/23/2016)[snip]
Thanks for all the detail, Alan. It really helps. I suspect the mdq.ngrams function could easily be improved (could be multi-threaded, e.g.) though it's probably not a high-priority for the team. I'm also thinking towards the goal of matching and lookup using ngrams (and other, better algorithms). For that goal, I think I'd recommend the mdq.Similarity function over anything in pure T-SQL since it has more algorithm choices and gets to that goal.
I'll probably stick with doing most of the work in SSIS with the fuzzy matching transformations or in MDS/DQS for other stuff.
YMMV!
Gerald Britton, Pluralsight courses
June 23, 2016 at 2:26 pm
Thanks for this, highly informative post.
I had no idea this was possible in SQL Server.
I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.
Very excited to see the next posts in this series
June 28, 2016 at 12:46 pm
johnmackintosh (6/23/2016)
Thanks for this, highly informative post.I had no idea this was possible in SQL Server.
I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.
Very excited to see the next posts in this series
Thanks John! (I'm sorry I missed this post).
I had no idea this was possible in SQL Server.
That was the point of the article. π
-- Itzik Ben-Gan 2001
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply