Nasty Fast N-Grams (Part 1): Character-Level Unigrams

Question

Nasty Fast N-Grams (Part 1): Character-Level Unigrams

Alan Burstein

SSC Guru

Points: 61152
More actions
June 22, 2016 at 11:12 pm

#317361

Comments posted to this topic are about the item Nasty Fast N-Grams (Part 1): Character-Level Unigrams
"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."
-- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

David.Poole SSC Guru Points: 76353 More actions · Answer 1

Good read. I'm looking forward to the rest of the series.

Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.

In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

LinkedIn Profile

g.britton SSChampion Points: 13863 More actions · Answer 2

virtually nothing about the subject in SQL Server

Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

It's been around for a while. These components also use Jaccard index and Levenshtein distance.

MDS and DQS also leverage these algorithms.

see mdq.Similarity (Transact-SQL)

Then there's the SSIS Data Profiiing task which has been using these techniques for a while

Gerald Britton, Pluralsight courses

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 3

Eirikur Eiriksson

SSC Guru

Points: 182951

June 23, 2016 at 7:16 am

#1886521

Nice piece and very good work Alan!

😎

Y.B. SSChampion Points: 11652 More actions · Answer 4

Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.

SELECT quote FROM brain WHERE original = 1
0 rows returned

Alan Burstein SSC Guru Points: 61152 More actions · Answer 5

David.Poole (6/23/2016)
Good read. I'm looking forward to the rest of the series.
Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

Thanks for the feedback David. There's a couple reasons for the CTE Tally Table. First, putting the logic into a CTE allows poeple to just copy/paste the code. The second is that the CTE Tally tends to be a little faster and certainly generates less reads. That said, a memory-optimized tally table seems to outperform the CTE Tally Table (something I did not mention in the article).

Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.
In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

Me too. Even though CTE Tally tables are faster and don't produce any reads, there are a some situations where a physical tally table is better. Part of the motivation of this article was also to promote Tally Tables.

I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

I use the sys.columns method often. The downside, however, is that it creates a busier execution plan and cant be used views and functions with schemabinding. A good numbers ITV [/url]is always nice to have around. 😉

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 6

g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while

Thanks for the comment.

Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 7

Eirikur Eiriksson (6/23/2016)
Nice piece and very good work Alan!
😎

Thanks you sir! 😎

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 8

Y.B. (6/23/2016)
Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.

Thanks a lot YB!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

g.britton SSChampion Points: 13863 More actions · Answer 9

Alan.B (6/23/2016)
g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

Really? I couldn't spot that reference (even after cleaning my glasses!) Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function. Then there's the Similarity function. even better I think.

Gerald Britton, Pluralsight courses

g.britton SSChampion Points: 13863 More actions · Answer 10

A little off-topic, but Master Data Services comes with CLR-functions that do N-Grams and other nice things. Here's a reference on that:

Regular Expressions, advanced string matching and new Split function SQL Server 2008 R2[/url]

to whet your appetite, the functions include:

mdq.NGrams

mdq.RegexReplace

mdq.RegexExtract

mdq.RegexSplit

mdq.RegexIsMatch

mdq.Similarity

mdq.RegexIsValid

mdq.SimilarityDate

mdq.RegexMask

mdq.Split

They pretty much do what their names imply

Gerald Britton, Pluralsight courses

Alan Burstein SSC Guru Points: 61152 More actions · Answer 11

g.britton (6/23/2016)
Alan.B (6/23/2016)
g.britton (6/23/2016)
virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.
Really? I couldn't spot that reference (even after cleaning my glasses!)

From the article:

Even Microsoft Data Quality Services includes an N-Grams CLR function.

Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function.

Nope. Part of the problem (and I'll get into this more in Part 3) with mdq.ngrams is the added cost of the padding functionality. The formula to calculate the number of tokens returned by mdq.ngrams is N+(N-1) vs N-(N-1) for NGrams8K (or the variations of it). I originally included test times with parallel execution plans (using make_parallel) and also included mdq.NGrams when I first submitted the article but it was to long so I edited those results out. Here's the original test results (the SQLCLR version is mdq.ngrams):

I also excluded a second test where I compared mdq.NGrams8k to the loop version, rCTE version and mdq.NGrams for a specialized "Remove Duplicate Characters" function:

CREATE FUNCTION dbo.RemoveDupes8K(@string varchar(8000), @preserved varchar(100))

RETURNS TABLE WITH SCHEMABINDING AS RETURN

SELECT CleanedString =

( SELECT token+''

FROM dbo.NGrams8K(@string,1)

WHERE token <> SUBSTRING(@string,position+1,1) -- exclude chars equal to the next char

OR token LIKE @preserved -- preserve characters that match the @preserved pattern

FOR XML PATH(''),TYPE

).value('.','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic

Here's the results of that test:

I suspect that this could be done slightly faster with a better CLR than mdq.ngrams which was the motivation behind this post. The Linq version Con Alexis posted, however, was notably slower.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

g.britton SSChampion Points: 13863 More actions · Answer 12

Alan.B (6/23/2016)[snip]

Thanks for all the detail, Alan. It really helps. I suspect the mdq.ngrams function could easily be improved (could be multi-threaded, e.g.) though it's probably not a high-priority for the team. I'm also thinking towards the goal of matching and lookup using ngrams (and other, better algorithms). For that goal, I think I'd recommend the mdq.Similarity function over anything in pure T-SQL since it has more algorithm choices and gets to that goal.

I'll probably stick with doing most of the work in SSIS with the fuzzy matching transformations or in MDS/DQS for other stuff.

YMMV!

Gerald Britton, Pluralsight courses

johnmackintosh SSC Rookie Points: 46 More actions · Answer 13

Thanks for this, highly informative post.

I had no idea this was possible in SQL Server.

I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.

Very excited to see the next posts in this series

Alan Burstein SSC Guru Points: 61152 More actions · Answer 14

johnmackintosh (6/23/2016)
Thanks for this, highly informative post.
I had no idea this was possible in SQL Server.
I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.
Very excited to see the next posts in this series

Thanks John! (I'm sorry I missed this post).

I had no idea this was possible in SQL Server.

That was the point of the article. 😛

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001