the use of CTEs

  • Hugo Kornelis (1/6/2016)


    In theory it is also possible that the optimizer decides to first get the results of the CTE, store them in a worktable (called a Spool in the execution plan), and then reuse the data from there.

    In my years of SQL programming I have never seen a plan where would be a split of data flow after Spool. And I never submit a piece of code to Prod without checking its execution plan.

    So, in my world - it never happens.

    _____________
    Code for TallyGenerator

  • erics44 (1/6/2016)


    and even if you have a senior developer, everyone writes code that they go back to later and think, what was I thinking when I wrote that pile of poo

    A-ha!

    That's what they name "a senior moment"!

    :hehe:

    _____________
    Code for TallyGenerator

  • erics44 (1/6/2016)


    Alan.B (1/6/2016)

    This is just silly. Every MVP and Certified Master whose article and/or book I have read uses CTEs. And I use them and I know how to design DBs and program in SQL. Quit while you are behind.

    Honest mate, I've worked with qualified and non qualified developers and in my experience you can edge you're bets on quality either way

    I use ctes regularly but the original question was about over use of ctes

    And I'll be honest I don't think using the cte for an update is more eligant, I would say it's less so and a bit uneccesary

    But it is preference as mentioned before 🙂

    Read a Ben-Gan book or a Jeff Moden article. Ask these guys why they use CTEs. Reviewing your original question, you appeared to be soliciting other people's views on CTEs; that's what I thought you meant by "I'd like people's views on CTEs". It appears that you were looking for a specific answer and not really trying to learn anything. If you're not trying to learn or teach anything then what are you trying to accomplish? Just to argue with people and waste people's time? :sick:

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • GilaMonster (1/6/2016)


    No, and that's not what Lynn said.

    Subqueries and CTEs are both part of a larger statement. Compilation is at the batch level, each statement in a batch gets compiled. If a subquery appears multiple times in a query, it's executed multiple times. If a CTE is referenced multiple times in a query, it's executed multiple times.

    Not just executed.

    It's "opened" and parsed in every place where it's mentioned in the bigger query.

    As many times as it's mentioned.

    Just like in your code example.

    And execution plan for different entities of CTE may be different within the same query.

    These, for example, are identical, will have identical execution plans and will perform identically.

    WITH ACTE AS

    (SELECT Col1, Col2 FROM SomeTable WHERE SomeOtherColumn = 1)

    SELECT * FROM tbl1 t

    INNER JOIN ACTE a1 on t.ColA = a1.Col1

    INNER JOIN ACTE a2 on t.ColB = a2.Col1

    INNER JOIN ACTE a3 on t.ColC = a3.Col1

    and

    SELECT * FROM tbl1 t

    INNER JOIN (SELECT Col1, Col2 FROM SomeTable WHERE SomeOtherColumn = 1) a1 on t.ColA = a1.Col1

    INNER JOIN (SELECT Col1, Col2 FROM SomeTable WHERE SomeOtherColumn = 1) a2 on t.ColB = a2.Col1

    INNER JOIN (SELECT Col1, Col2 FROM SomeTable WHERE SomeOtherColumn = 1) a3 on t.ColC = a3.Col1

    That's one of the reasons why I prefer to minimise the use of CTE.

    When you see/write such a horrible copy-paste code you know - something is not good about it.

    CTE makes it look neat and pretty for a developer's eye. Not addressing all the overheads associated with this coding approach.

    _____________
    Code for TallyGenerator

  • erics44 (1/6/2016)


    Sergiy (1/6/2016)


    jaime.simancas 27970 (1/6/2016)


    but CTEs have their place in making these "Sets" easier to read..

    Very questionable. Very.

    CTE's break the structure of Structured Query Language, moving part of FROM effectively out of the query.

    To me - it makes it much more difficult to read and debug.

    I'm with you on that

    Using ctes in place of a sub query is like reading a book and having to keep skipping to the front to read the story in the correct order 🙂

    Same could be said if you populate temporary tables with the results of a complex query and then use that table in another query. You have to look back at the query used to populate the temporary table to "read the story in the correct order."

  • Been following this discussion, but I didn't feel like getting in it.

    I believe that over use of CTEs is as bad as the lack of CTEs. I'd rather hire a developer that knows how to use CTEs than one that doesn't, just because he seems more up to date and willing to learn (They've been available for over 10 years now).

    I'll just want some help from those who prefer subqueries over CTEs to transform one of our favorites functions.

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS ( --==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS ( --==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    ),

    cteLen(N1,L1) AS( --==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    And last, I'm sure that the ones that over use CTEs, would certainly code like this in previous versions:

    SELECT *

    FROM (

    SELECT column1, column2, column3

    FROM MyTable

    ) t

    And yes, I've seen code like that.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Alan.B (1/6/2016)


    erics44 (1/6/2016)


    Alan.B (1/6/2016)

    This is just silly. Every MVP and Certified Master whose article and/or book I have read uses CTEs. And I use them and I know how to design DBs and program in SQL. Quit while you are behind.

    Honest mate, I've worked with qualified and non qualified developers and in my experience you can edge you're bets on quality either way

    I use ctes regularly but the original question was about over use of ctes

    And I'll be honest I don't think using the cte for an update is more eligant, I would say it's less so and a bit uneccesary

    But it is preference as mentioned before 🙂

    Read a Ben-Gan book or a Jeff Moden article. Ask these guys why they use CTEs. Reviewing your original question, you appeared to be soliciting other people's views on CTEs; that's what I thought you meant by "I'd like people's views on CTEs". It appears that you were looking for a specific answer and not really trying to learn anything. If you're not trying to learn or teach anything then what are you trying to accomplish? Just to argue with people and waste people's time? :sick:

    I'm not arguing with anybody

    And I have already mentioned in this thread that I have since used ctes in places I wouldn't have previously so I am definately learning 🙂

    If my text is coming across confrontational then that isn't how I intended it

    My opinion differs from yours that's all

  • Lynn Pettis (1/6/2016)


    erics44 (1/6/2016)


    Sergiy (1/6/2016)


    jaime.simancas 27970 (1/6/2016)


    but CTEs have their place in making these "Sets" easier to read..

    Very questionable. Very.

    CTE's break the structure of Structured Query Language, moving part of FROM effectively out of the query.

    To me - it makes it much more difficult to read and debug.

    I'm with you on that

    Using ctes in place of a sub query is like reading a book and having to keep skipping to the front to read the story in the correct order 🙂

    Same could be said if you populate temporary tables with the results of a complex query and then use that table in another query. You have to look back at the query used to populate the temporary table to "read the story in the correct order."

    Very true

    I tend to only use temporary tables for performance reasons

    And (like ctes) i also think that temporary tables are over used at times

  • erics44 (1/6/2016)


    Alan.B (1/6/2016)


    erics44 (1/6/2016)


    Alan.B (1/6/2016)

    This is just silly. Every MVP and Certified Master whose article and/or book I have read uses CTEs. And I use them and I know how to design DBs and program in SQL. Quit while you are behind.

    Honest mate, I've worked with qualified and non qualified developers and in my experience you can edge you're bets on quality either way

    I use ctes regularly but the original question was about over use of ctes

    And I'll be honest I don't think using the cte for an update is more eligant, I would say it's less so and a bit uneccesary

    But it is preference as mentioned before 🙂

    Read a Ben-Gan book or a Jeff Moden article. Ask these guys why they use CTEs. Reviewing your original question, you appeared to be soliciting other people's views on CTEs; that's what I thought you meant by "I'd like people's views on CTEs". It appears that you were looking for a specific answer and not really trying to learn anything. If you're not trying to learn or teach anything then what are you trying to accomplish? Just to argue with people and waste people's time? :sick:

    I'm not arguing with anybody

    And I have already mentioned in this thread that I have since used ctes in places I wouldn't have previously so I am definately learning 🙂

    If my text is coming across confrontational then that isn't how I intended it

    My opinion differs from yours that's all

    Fair enough.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Luis Cazares (1/6/2016)


    Been following this discussion, but I didn't feel like getting in it.

    I believe that over use of CTEs is as bad as the lack of CTEs. I'd rather hire a developer that knows how to use CTEs than one that doesn't, just because he seems more up to date and willing to learn (They've been available for over 10 years now).

    I'll just want some help from those who prefer subqueries over CTEs to transform one of our favorites functions.

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS ( --==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS ( --==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT 1 UNION ALL

    SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    ),

    cteLen(N1,L1) AS( --==== Return start and length (for use in substring)

    SELECT s.N1,

    ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

    FROM cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    And last, I'm sure that the ones that over use CTEs, would certainly code like this in previous versions:

    SELECT *

    FROM (

    SELECT column1, column2, column3

    FROM MyTable

    ) t

    And yes, I've seen code like that.

    I'd guess everyone contributing to this thread is willing to learn to have got to the point at which they are able to validly contribute to the discussion 🙂

    But that's a hefty bit of code so without really looking at it it seems your ctes are used multiple times which suggests ctes would be nore appropriate than sub queries

  • Lynn Pettis (1/6/2016)


    Same could be said if you populate temporary tables with the results of a complex query and then use that table in another query. You have to look back at the query used to populate the temporary table to "read the story in the correct order."

    Not quite the same.

    You can simply select from a temp table - it's already created and populated. No need to go back to the place where it was created. Unless you're not happy with the data set in it.

    In terms of readability CTE's are more close to table variables.

    But even this comparison is not totally correct.

    _____________
    Code for TallyGenerator

  • Luis Cazares (1/6/2016)


    I'll just want some help from those who prefer subqueries over CTEs to transform one of our favorites functions.

    You know, people been splitting delimited lists for many years in SQL2000 and earlier versions.

    You may find it hard to believe - but I may assure you - it's true. 🙂

    I did it myself.

    Some of those split functions I created back then must be running in Prod systems till now.

    :w00t:

    Edit: Yes, Tally was a static table back then.

    I'm not sure it's a bad thing.

    _____________
    Code for TallyGenerator

  • Sergiy (1/6/2016)


    Lynn Pettis (1/6/2016)


    Same could be said if you populate temporary tables with the results of a complex query and then use that table in another query. You have to look back at the query used to populate the temporary table to "read the story in the correct order."

    Not quite the same.

    You can simply select from a temp table - it's already created and populated. No need to go back to the place where it was created. Unless you're not happy with the data set in it.

    In terms of readability CTE's are more close to table variables.

    But even this comparison is not totally correct.

    Really? If you treat the CTE in the FROM clause as a table, I see no difference between it and a temporary table. Of course we all know that you are always correct and will never admit when you are wrong. So don't worry about reply to this post.

  • Hmmm, not sure if this is smart, but okay. I'll bite.

    Sergiy (1/6/2016)


    Hugo Kornelis (1/6/2016)


    A CTE allows you to define a complex subquery once and refer to it multiple times in the same query without repeating the code. Technically possible before CTE, but cumbersome.

    Really?

    Ever heard of views?

    You might want to read CREATE VIEW article from BOL. Very educating.

    :hehe: Just kidding.

    Isn't that what I said? Technically possible but cumbersome. I do not want to create a view, use it, then drop it again if I can avoid it.

    If seriously, referring to the same CTE multiple times, especially when it is a complex subquery is not a very smart decision.

    CTE re-executed every time it's mentioned in query. Re-executing a complicated subquery creates an unnecessary overhead and affects performance not in a good way.

    If you want to reuse a recordset multiple times use good old temporary objects.

    Both techniques are viable in some situations. And both techniques are a terrible idea in other situations.

    A CTE can be recursive. Impossible before CTE was introduced.

    Nice try Hugo!

    But you know, I still do not believe your knowledge of SQL is so limited.

    🙂

    Yeah? Take a look at the sample code at https://technet.microsoft.com/en-us/library/ms186243%28v=sql.105%29.aspx (for an employee hierarchy). How do you write that query without using CTE?

    (And if you're going to come with a cursor solution, I will slap you with a fish (https://www.youtube.com/watch?v=IhJQp-q1Y1s) and tell you that iterative code should be written in your C# front-end. Especially if you care about not allowing constructions that are foreign to SQL.

    Strictly speaking, the previous statement must be terminated with ";".

    ONLY if the next statement starts with CTE.

    No. The T-SQL language has always had a semicolon terminator for all statements, but it was optional.

    Yes. it must be terminated (emphasis on *must*) only before CTE, THROW and other constructions copied from other languages.

    Which indicates that they are - yes, foreign to SQL.

    I think I already mentioned that omitting semicolons is deprecated. It still works (for now!!) for constructions that existed before this was deprecated.

    Most newer constructions don't understand unterminated logic. That includes CTEs (which are not copied from another language but from the ISO/ANSI standard for this very language), MERGE (also straight from the ISO/ANSI standard), and THROW CATCH (probably not from the ANSI standard though I didn't check).

    With at least two out of three examples coming from the ISO/ANSI standard, I cannot understand how you still think that these constructions are foreign to SQL. Go reaed the standard yourself if you don't believe me.

    The concept of CTEs and the WITH keyword to introduce it are both defined in the ANSI standard for SQL. (Unfortunately, Microsoft has decided to only partially implement this standard; ANSI defined it as WITH [RECURSIVE], making recursive CTEs easier to recognise).

    It does not change the fact - the keyword WITH was already occupied in SQL by query options. There would not be any need to use the same keyword for totally different application if not for the purpose of exact repeating of the construction used in other languages.

    No, WITH is and was the only keyword in ANSI SQL for CTE expressions. It was not occupied in SQL for other options. It was, however, used for other options in Transact-SQL.

    Could they have chosen to stray from the standard and use something else? Sure - but they would have taken a lot of negative feedback and probably lost customers. There is a good reason that standards exist, and that manufacturers confom to standards. Would you want to drive a car with the brake pedal on the right?


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Luis Cazares (1/6/2016)


    Been following this discussion, but I didn't feel like getting in it.

    I believe that over use of CTEs is as bad as the lack of CTEs. I'd rather hire a developer that knows how to use CTEs than one that doesn't, just because he seems more up to date and willing to learn (They've been available for over 10 years now).

    I'll just want some help from those who prefer subqueries over CTEs to transform one of our favorites functions.

    <JM Splitter code here...>

    Hahahaha... Funny you should ask. I was screwing around some time ago and came up with a PATINDEX function that, among other things, accepts a delimited list of patterns instead of just one (not very tested, just screwing around)...

    CREATE FUNCTION dbo.MPatindex8K

    (

    @patterns varchar(200),

    @string varchar(8000),

    @delimiter varchar(50),

    @start int

    )

    /****************************************************************************************

    Purpose:

    Works like PATINDEX except that MPatindex8K accepts a delimited list of patterns

    (@patterns) with a user-defined delimiter up to 50 varchar characters long (@delimiter).

    Returns the the index value of the pattern that is closest to the beginning of the

    string. If there is no match for the pattern(s) defined by @patterns then a 0 is

    returned. The last last parameter, @start, defines where the search should start.

    Compatibility:

    SQL Server 2008+ and Azure SQL Database

    Parameteters:

    @patterns = A delimited list of patterns to seach for within the input string (@string);

    the delimiter is defined by @delimiter.

    @string = A string or expression to be searched.

    @delimiter = The delimiter that is used to define where each pattern in @patterns starts

    and ends. If this is NULL or blank then the function will assume that

    @patterns represents a single pattern.

    @start = same behavior as T-SQL CHARINDEX's start_location parameter.

    According to BOL (https://msdn.microsoft.com/en-us/library/ms186323.aspx):

    "Is [an integer] expression at which the search starts. If start_location is

    not specified, is a negative number, or is 0, the search starts at the

    beginning of expressionToSearch."

    Returns:

    PIndex = int; the matching position closest to the beginning of the string. If there is no

    matching pattern then a 0 is returned.

    Developer Notes:

    1. This is an iTVF (Inline Table Valued Function) that performs the same task as a

    scalar user defined function (UDF) accept that it requires the APPLY table operator.

    Note the usage examples below and See this article for more details:

    http://www.sqlservercentral.com/articles/T-SQL/91724/

    The function will be slightly more complicated to use than a scalar UDF but will yeild

    much better performance. For example - unlike a scalar UDF, this function does not

    restrict the query optimizer's ability generate a parallel query plan. Initial testing

    showed that the function generally gets a

    2. The splitter logic is a tweeked version of Jeff Moden's DelimitedSplit8K. What makes

    this splitter logic different is that it accepts a delimiter longer than CHAR(1). The

    special splitter logic is built into the code.

    3. This function generally gets a parallel query plan and my testing has shown that it

    actually performs better than PATINDEX on systems with four or more logical CPUs.When

    used with APPLY against a table MPatindex generally gets a parallel plan and can be 2

    to 2.5 times faster than just using Patindex.

    Usage:

    --===== Example #1: using MPatindex8K like Patindex

    SELECT PIndex FROM dbo.MPatindex8K('%[^0-9]%', '999ABC123', NULL, NULL);

    -- Returns 4...

    -- note that when @patterns or @delimiter are blank or NULL they are just ignored

    --===== Example #2: No delimited pattern, just using the start functionality:

    DECLARE @string1 varchar(100) = 'abc123abc123';

    SELECT PIndex

    FROM dbo.MPatindex8K('%[^0-9]%', @string1, '', 4);

    -- Returns 7...

    --===== Example #3:

    DECLARE @string2 varchar(100) = 'abc1223!!!';

    SELECT PIndex FROM dbo.MPatindex8K('%[cde]%|%[5-9]%|%[^0-9a-z]%', @string2, '|', 1);

    -- Returns: 3; the letter "c" matches the first pattern '%[cde]%'

    --===== Example #4:

    DECLARE @string3 varchar(100) = 'a+bc12,23!!!++',

    @pattern varchar(100) = '%[cde]%+++%[0-9]%+++%[^0-9a-z]%';

    SELECT PIndex FROM dbo.MPatindex8K(@pattern, @string3, '+++', 1);

    -- Returns: 2; the + matched the pattern '%[^0-9a-z]%'

    ---------------------------------------------------------------------------------------

    Performance Test:

    SET NOCOUNT ON;

    SET ANSI_WARNINGS OFF;

    -- Create the sample data

    IF OBJECT_ID('tempdb..#strings') IS NOT NULL DROP TABLE #strings;

    SELECT TOP (100000) string = REPLICATE(newid(),5) INTO #strings

    FROM sys.all_columns a, sys.all_columns b;

    GO

    -- Simple Patindex test against 100K rows

    PRINT '==== Using traditional Patindex ====';

    DECLARE @st datetime = getdate(), @x varchar(100);

    SELECT @x = PATINDEX('%AB[^0-9]%',string) FROM #strings;

    PRINT DATEDIFF(MS,@st,getdate());

    GO 5

    -- Simple MPatindex8K test against 100K rows

    PRINT '==== Using MPatindex8K ====';

    DECLARE @st datetime = getdate(), @x varchar(100);

    SELECT @x=pindex FROM #strings CROSS APPLY dbo.MPatindex8K('%AB[^0-9]%', string, '', 4);

    PRINT DATEDIFF(MS,@st,getdate());

    GO 5

    Results:

    Avg of ~250 MS for Patindex (serial plan)

    Avg of ~98 MS for MPatindex8K (Parallel plan)

    ---------------------------------------------------------------------------------------

    Revision History:

    Rev 00 - 20151030 Initial Development - Alan Burstein

    ****************************************************************************************/

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    L1(N) AS

    (

    SELECT 1

    FROM (VALUES

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)

    ) t(N) -- 90 values

    ),

    iTally(N) AS

    (

    SELECT TOP(DATALENGTH(ISNULL(@patterns,''))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM L1 a CROSS JOIN L1 b -- cartesian product produces 8100 rows (90^2)

    )

    SELECT pindex =

    ISNULL

    (

    MIN

    (

    NULLIF(PATINDEX(item,string),0)+

    ISNULL(NULLIF(@start+ABS(@start-1)-1,0)-(@start-1),0)

    ), 0

    )

    FROM (VALUES (SUBSTRING(@string,ISNULL(@start,1),8000))) t(string)

    -- Used to determine from where to begin searching in the string (@start)

    CROSS APPLY

    (

    SELECT Item = SUBSTRING(@patterns, l.N1, l.L1)

    FROM

    (

    SELECT s.N1, L1 = ISNULL(NULLIF(CHARINDEX(@delimiter,@patterns,s.N1),0)-s.N1,8000)

    FROM

    (

    SELECT N1 = 1 UNION ALL

    SELECT t.N+LEN(@delimiter)

    FROM iTally t

    WHERE SUBSTRING(@patterns,t.N,LEN(NULLIF(@delimiter,''))) = @delimiter

    ) s

    ) l

    ) splitter; -- Logic borrowed from Jeff Moden's DelimitedSplit8K but allows for

    -- variable-length delimiter

    GO

    I also did the same thing with Eirikur's DelimitedSplit8K_LEAD for a word-level N-Grams function.

    CREATE FUNCTION dbo.WNGrams8K(@pString varchar(8000), @k int)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH t0(N) AS

    ( --

    SELECT 0 UNION ALL -- we need to start with 0

    SELECT position

    FROM dbo.nGrams8K(REPLICATE(0,@K-1),1) -- this is my tally table

    )

    SELECT TOP (SELECT COUNT(*)-(@K-2) FROM dbo.nGrams8K(@pString,1) WHERE token = CHAR(32))

    ItemNumber = ROW_NUMBER() OVER (ORDER BY N1),

    Token = Item

    FROM t0

    CROSS APPLY

    (

    SELECT

    s.N1,

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1)

    OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))

    FROM

    (

    SELECT N1

    FROM

    (

    SELECT ROW_NUMBER() OVER (ORDER BY t.N), t.N+1

    FROM

    (

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) position

    FROM dbo.nGrams8K(@pString,1)

    ) t(N)

    WHERE (SUBSTRING(@pString,t.N,1) = CHAR(32) OR t.N = 0)

    ) cteStart(KF, N1)

    WHERE t0.N = KF % @k -- get me the @Kth instance of the delimiter

    ) s --cteNext

    ) AS KSplit

    GO

    In each case BTW I'm using a character-level N-Grams function as my tally table... this is the code:

    CREATE FUNCTION dbo.NGrams8k

    (

    @string varchar(8000), -- Input string

    @N int -- requested token size

    )

    /****************************************************************************************

    Purpose:

    A character-level @N-Grams function that outputs a contiguous stream of @N-sized tokens

    based on an input string (@string). Accepts strings up to 8000 varchar characters long.

    For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.

    Compatibility:

    SQL Server 2008+ and Azure SQL Database

    Syntax:

    --===== Autonomous

    SELECT position, token FROM dbo.NGrams8k(@string,@N);

    --===== Against a table using APPLY

    SELECT s.SomeID, ng.position, ng.string

    FROM dbo.SomeTable s

    CROSS APPLY dbo.NGrams8K(s.SomeValue,@N) ng;

    Parameters:

    @string = The input string to split into tokens.

    @N = The size of each token returned.

    Returns:

    Position = bigint; the position of the token in the input string

    token = varchar(8000); a @N-sized character-level N-Gram token

    Examples:

    --===== Turn the string, 'abcd' into unigrams, bigrams and trigrams

    SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (@N=1)

    SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (@N=2)

    SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (@N=1)

    --===== How many times the substring "AB" appears in each record

    DECLARE @table TABLE(stringID int identity primary key, string varchar(100));

    INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');

    SELECT string, occurances = COUNT(*)

    FROM @table t

    CROSS APPLY dbo.NGrams8k(t.string,2) ng

    WHERE ng.token = 'AB'

    GROUP BY string;

    Developer Notes:

    1. This function is not case sensitive

    2. Many functions that use NGrams8k will see a huge performance gain when the optimizer

    creates a parallel query plan. One way to get a parallel query plan (if the optimizer

    does not chose one) is to use make_parallel by Adam Machanic which can be found here:

    sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

    3. When @N is less than 1 or greater than the datalength of the input string then no

    tokens (rows) are returned.

    4. This function can also be used as a tally table with the position column being your

    "N" row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to

    split it into unigrams then only return the position column. NGrams8k will get you up

    to 8000 numbers. There will be no performance penalty for sorting by position in

    ascending order but there is for sorting in descending order. To get the numbers in

    descending order without forcing a sort in the query plan use the following formula:

    N = <highest number>-position+1.

    Pseudo Tally Table Examples:

    --===== (1) Get the numbers 1 to 100 in ascending order:

    SELECT N = position FROM dbo.NGrams8k(REPLICATE(0,100),1);

    --===== (2) Get the numbers 1 to 100 in descending order:

    DECLARE @maxN int = 100;

    SELECT N = @maxN-position+1

    FROM dbo.NGrams8k(REPLICATE(0,@maxN),1)

    ORDER BY position;

    -- note that you don't need a variable, I used one to make this easier to understand.

    ----------------------------------------------------------------------------------------

    Revision History:

    Rev 00 - 20140310 - Initial Development - Alan Burstein

    Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added

    conversion to bigint in the TOP logic to remove implicit conversion

    to bigint - Alan Burstein

    Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less

    than the length of @string. Updated comment section. - Alan Burstein

    Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N

    parameters to prevent a NULL string or NULL @N from causing "an

    improper value" being passed to the TOP clause. - Alan Burstein

    ****************************************************************************************/

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    L1(N) AS

    (

    SELECT 1

    FROM (VALUES -- 90 NULL values used to create the CTE Tally table

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)

    ) t(N)

    ),

    iTally(N) AS -- my cte Tally table

    (

    SELECT TOP(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@N,1)-1)),0)))

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort

    FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)

    )

    SELECT

    position = N, -- position of the token in the string(s)

    token = SUBSTRING(@string,N,@N) -- the @N-Sized token

    FROM iTally

    WHERE @N > 0 AND @N <= DATALENGTH(@string); -- force the function to ignore a "bad @N"

    GO

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 76 through 90 (of 161 total)

You must be logged in to reply to this topic. Login to reply