Nasty Fast N-Grams (Part 1): Character-Level Unigrams

  • Comments posted to this topic are about the item Nasty Fast N-Grams (Part 1): Character-Level Unigrams

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Good read. I'm looking forward to the rest of the series.

    Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

    Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.

    In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

    I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

  • virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Gerald Britton, Pluralsight courses

  • Nice piece and very good work Alan!

    😎

  • Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • David.Poole (6/23/2016)


    Good read. I'm looking forward to the rest of the series.

    Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

    Thanks for the feedback David. There's a couple reasons for the CTE Tally Table. First, putting the logic into a CTE allows poeple to just copy/paste the code. The second is that the CTE Tally tends to be a little faster and certainly generates less reads. That said, a memory-optimized tally table seems to outperform the CTE Tally Table (something I did not mention in the article).

    Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.

    In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

    Me too. Even though CTE Tally tables are faster and don't produce any reads, there are a some situations where a physical tally table is better. Part of the motivation of this article was also to promote Tally Tables.

    I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

    I use the sys.columns method often. The downside, however, is that it creates a busier execution plan and cant be used views and functions with schemabinding. A good numbers ITV [/url]is always nice to have around. 😉

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Eirikur Eiriksson (6/23/2016)


    Nice piece and very good work Alan!

    😎

    Thanks you sir! 😎

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Y.B. (6/23/2016)


    Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.

    Thanks a lot YB!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (6/23/2016)


    g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    Really? I couldn't spot that reference (even after cleaning my glasses!) Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function. Then there's the Similarity function. even better I think.

    Gerald Britton, Pluralsight courses

  • A little off-topic, but Master Data Services comes with CLR-functions that do N-Grams and other nice things. Here's a reference on that:

    Regular Expressions, advanced string matching and new Split function SQL Server 2008 R2[/url]

    to whet your appetite, the functions include:

    mdq.NGrams

    mdq.RegexReplace

    mdq.RegexExtract

    mdq.RegexSplit

    mdq.RegexIsMatch

    mdq.Similarity

    mdq.RegexIsValid

    mdq.SimilarityDate

    mdq.RegexMask

    mdq.Split

    They pretty much do what their names imply

    Gerald Britton, Pluralsight courses

  • g.britton (6/23/2016)


    Alan.B (6/23/2016)


    g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    Really? I couldn't spot that reference (even after cleaning my glasses!)

    From the article:

    Even Microsoft Data Quality Services includes an N-Grams CLR function.

    Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function.

    Nope. Part of the problem (and I'll get into this more in Part 3) with mdq.ngrams is the added cost of the padding functionality. The formula to calculate the number of tokens returned by mdq.ngrams is N+(N-1) vs N-(N-1) for NGrams8K (or the variations of it). I originally included test times with parallel execution plans (using make_parallel) and also included mdq.NGrams when I first submitted the article but it was to long so I edited those results out. Here's the original test results (the SQLCLR version is mdq.ngrams):

    I also excluded a second test where I compared mdq.NGrams8k to the loop version, rCTE version and mdq.NGrams for a specialized "Remove Duplicate Characters" function:

    CREATE FUNCTION dbo.RemoveDupes8K(@string varchar(8000), @preserved varchar(100))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    SELECT CleanedString =

    ( SELECT token+''

    FROM dbo.NGrams8K(@string,1)

    WHERE token <> SUBSTRING(@string,position+1,1) -- exclude chars equal to the next char

    OR token LIKE @preserved -- preserve characters that match the @preserved pattern

    FOR XML PATH(''),TYPE

    ).value('.','varchar(8000)'); -- using Wayne Sheffields concatenation logic

    Here's the results of that test:

    I suspect that this could be done slightly faster with a better CLR than mdq.ngrams which was the motivation behind this post. The Linq version Con Alexis posted, however, was notably slower.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (6/23/2016)[snip]

    Thanks for all the detail, Alan. It really helps. I suspect the mdq.ngrams function could easily be improved (could be multi-threaded, e.g.) though it's probably not a high-priority for the team. I'm also thinking towards the goal of matching and lookup using ngrams (and other, better algorithms). For that goal, I think I'd recommend the mdq.Similarity function over anything in pure T-SQL since it has more algorithm choices and gets to that goal.

    I'll probably stick with doing most of the work in SSIS with the fuzzy matching transformations or in MDS/DQS for other stuff.

    YMMV!

    Gerald Britton, Pluralsight courses

  • Thanks for this, highly informative post.

    I had no idea this was possible in SQL Server.

    I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.

    Very excited to see the next posts in this series

  • johnmackintosh (6/23/2016)


    Thanks for this, highly informative post.

    I had no idea this was possible in SQL Server.

    I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.

    Very excited to see the next posts in this series

    Thanks John! (I'm sorry I missed this post).

    I had no idea this was possible in SQL Server.

    That was the point of the article. 😛

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply