Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • The addition of try_DoSomething functions to SQL is like the worst route to take in combating the problem of unwanted exceptions.

    Instead of implementing these simple two steps:

    1. Force filter operations before select expressions.

    2. Handle exceptions as 'NaR' (not a result) and only trigger exceptions on them when they are still in the result set after filtering (the select part of a statement).

    This can be implemented in days if the code is written well and correct like all of the tricky edge cases that sporadically generate errors when you least expect them. This has the added benefit of simplifying the optimizer somewhat as less plans need to be examined and it presents a very simple model programmers will understand in a natural way.

    Instead they pushed for more complicated code, knowing it will not suffice and won't fix any existing lingering problems. NaR should be as natural to DBAs as NaN (not a value) is to mathematicians and NULL is to DBA's.

  • Jeff Moden (9/10/2014)


    I wonder when they'll deprecate and discontinue SQLCLR.

    Bwhahahahahahaha! Just imagine how much you'd make renting out your frozen pork chop launcher if they did!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Solomon Rutzky (9/9/2014)


    Regarding the two items you mentioned, Tally function and splitter, both are available for free in the SQL#[/url] library, along with quite a few others. There are several versions of the "tally" functionality: Util_GenerateDateTimeRange, Util_GenerateDateTimes, Util_GenerateFloatRange, Util_GenerateFloats, Util_GenerateIntRange, and Util_GenerateInts. There is also: String_Split and RegEx_Split. All of these (and all String, RegEx, and Util functions) work with the Assembly set to SAFE so there is no "good" reason to not use them, or write your own, outside of working with Azure SQL Database which does not support CLR Integration at this time.

    Well the string splitter I can't speak about, but there is a great reason not to use util.GenerateIntRange/DateRange... it's slower than just using a cross join in memory tally table. You are better off writing a schemabound TVF to generate a tally table. The reason is that the query analyzer has no clue about sTVFs and just flatly assumes 1000 rows in sql 2014 (I think it was lower in 2008). Unfortunately, stvfs are probably the best feature of the CLR integration, but there are generally implemented poorly. I've issued a connect items asking for hints to help with this but all have been marked as won't fix.

    https://connect.microsoft.com/SQLServer/feedbackdetail/view/874114

    I tried a similar experiment when trying to come up with a faster splitter, and I found that as a row source it was worse than the usual t0->e2->e4 approach. It was still slower going into the millions of rows as well.

    I've already written my own hybrid string splitter which suits my needs well (you can get its implementation from the thread).

  • Jeff Moden (9/10/2014)


    Solomon Rutzky (9/9/2014)


    In the end, isn't this why Microsoft gave us CLR Integration?

    Heh... yeah... just like they gave us the ability to write extended stored procedures. 😉 I wonder when they'll deprecate and discontinue SQLCLR.

    Not likely ever. This is internally used to implement things like HierarchyID, Spatial types, etc. The CLR is an exposure of an internal system for us to use.

  • Steve Jones - SSC Editor (9/10/2014)


    Jeff Moden (9/10/2014)


    Solomon Rutzky (9/9/2014)


    In the end, isn't this why Microsoft gave us CLR Integration?

    Heh... yeah... just like they gave us the ability to write extended stored procedures. 😉 I wonder when they'll deprecate and discontinue SQLCLR.

    Not likely ever. This is internally used to implement things like HierarchyID, Spatial types, etc. The CLR is an exposure of an internal system for us to use.

    MDS and DQS also rely heavily on CLR - those are not going away anytime soon.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Steve Jones - SSC Editor (9/10/2014)


    Jeff Moden (9/10/2014)


    Solomon Rutzky (9/9/2014)


    In the end, isn't this why Microsoft gave us CLR Integration?

    Heh... yeah... just like they gave us the ability to write extended stored procedures. 😉 I wonder when they'll deprecate and discontinue SQLCLR.

    Not likely ever. This is internally used to implement things like HierarchyID, Spatial types, etc. The CLR is an exposure of an internal system for us to use.

    Way back when, extended procs were used internally; so they are something that was used internally, exposed for us to use, and deprecated (but not yet discontinued) so why shouldn't SQLCLR eventually suffer the same fate?

    Tom

  • Alan.B (9/10/2014)


    MDS and DQS also rely heavily on CLR - those are not going away anytime soon.

    Some time in the 90s someone probably said "ODS relies heavily on extended stored procedures, and that's not going away soon". 😀

    Strangely enough, extended stored procedures have already outlasted ODS by 6 SQL Server releases. :w00t:

    It's possible that SQLCLR will outlast MDS and DQS by a few releases too. But I suspect that MDS and DQS will not be disappearing anything like as quickly as ODS did.

    Tom

  • Extended stored procedures weren't used internally to implement functionality, but rather to extend the platform to do things that the didn't want to code.

    The CLR is embedded in the internal workings of SQL Server, at least I'd argue it's different. Could they remove it? Sure. Not likely though since they (MS) continue to use it to build new things.

  • TomThomson (9/10/2014)


    Steve Jones - SSC Editor (9/10/2014)


    Jeff Moden (9/10/2014)


    Solomon Rutzky (9/9/2014)


    In the end, isn't this why Microsoft gave us CLR Integration?

    Heh... yeah... just like they gave us the ability to write extended stored procedures. 😉 I wonder when they'll deprecate and discontinue SQLCLR.

    Not likely ever. This is internally used to implement things like HierarchyID, Spatial types, etc. The CLR is an exposure of an internal system for us to use.

    Way back when, extended procs were used internally; so they are something that was used internally, exposed for us to use, and deprecated (but not yet discontinued) so why shouldn't SQLCLR eventually suffer the same fate?

    Tom, thanks for bringing this up. I was just about to mention XPs since while I agree with the point that Steve and Alan.B (in another post regarding CLR being used in MDS and DQS) are making, we simply don't know what the future holds. But Extended Stored Procedures are a good reference point.

    So to answer the related questions from Tom (i.e. "so why shouldn't SQLCLR eventually suffer the same fate?") and Jeff (i.e. "I wonder when they'll deprecate and discontinue SQLCLR"): SQLCLR might get deprecated someday and so what if it does? As it has been pointed out already, XPs were deprecated when SQL Server 2005 came out (9 years ago!) and are still functioning. Any technology that we use can become obsolete at any time, including SQL Server itself. The idea that maybe someday in the future this particular feature might be unavailable is no reason at all (or actually, a bad reason) to not use it today, especially seeing how long it would likely continue to function after the initial deprecation announcement.

    CLR Integration was introduced NINE years ago and while not perfect, does allow for some very useful and interesting things to be done without waiting for Microsoft to decide to implement something. In terms of longevity of these solutions, for those of us who have been using it for even close to the past nine years, even if it is deprecated in the next 4 years and continues to function for at least 10 years after that, how can a 23 year time frame be seen as a negative? And that is only 23 years IF they deprecate it, which is not looking likely. Nothing we build is going to last forever, especially within the context of computers, yet we all have a job to get done so why not get it done and move onto the next project? I certainly don't think that those who wrote XPs (prior to adopting SQL Server 2005 or newer) wasted their time or acted contrary to the best interests of their employers given that those solutions should still be functioning. And again, so what if something changes and that project needs to get refactored? What is to say that the project won't need to get refactored anyway due to some other circumstance, prior to XPs (or even SQL CLR) ceasing to function?

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Great stuff!

    but as to Jeff's notes on collation: we do need a function that can handle compound delimiters, e.g. [|;] or [","] as in David Data's case of ["John Smith","23, The High Street","Sometown"].

    This is not so difficult. Just change the type of pDelimiter to varchar(3) (I don't think you'll never need a delimiter longer than that) and change the definition of cteStart(N1) to:

    cteStart(N1) AS (--==== This returns N+DATALENGTH(@pDelimiter)

    SELECT 1 UNION ALL

    SELECT t.N+DATALENGTH(@pDelimiter)

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,DATALENGTH(@pDelimiter)) = @pDelimiter

    ),

    Now we can solve David Data's case, by using [","] as delimiter and removing the initial and final double quotes from the item's returned, since such quotes cannot appear within the quoted strings:

    select ItemNumber, REPLACE(Item, '"', '')

    from dbo.DelimitedSplit8k('"John Smith","23, The High Street","Sometown"','","')

  • stefan.mulder (9/24/2014)


    Great stuff!

    but as to Jeff's notes on collation: we do need a function that can handle compound delimiters, e.g. [|;] or [","] as in David Data's case of ["John Smith","23, The High Street","Sometown"].

    This is not so difficult. Just change the type of pDelimiter to varchar(3) (I don't think you'll never need a delimiter longer than that) and change the definition of cteStart(N1) to:

    cteStart(N1) AS (--==== This returns N+DATALENGTH(@pDelimiter)

    SELECT 1 UNION ALL

    SELECT t.N+DATALENGTH(@pDelimiter)

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,DATALENGTH(@pDelimiter)) = @pDelimiter

    ),

    Now we can solve David Data's case, by using [","] as delimiter and removing the initial and final double quotes from the item's returned, since such quotes cannot appear within the quoted strings:

    select ItemNumber, REPLACE(Item, '"', '')

    from dbo.DelimitedSplit8k('"John Smith","23, The High Street","Sometown"','","')

    There is a skeleton of the functionality here [/url]if you are on 2012 or later

    😎

  • The "," is not a compound delimiter. According to CSV Standards RFC 4180 This indicates an escaped field. A field containing an escape sequence of *(TEXTDATA / COMMA / CR / LF / 2DQUOTE). A CSV parser should handle the escape sequences. Try Microsoft Excel CSV format and put a CRLF, Double quote or comma in a cell and save it. Open the CSV in notepad.

    If the parser does not handle the escape sequence; it is just an implementation of delimited field parser.

  • Can anyone help me understand why the second and third selects in the following generate an error?

    -- create a table with 3 UUIDs

    DECLARE @tbl TABLE (id UNIQUEIDENTIFIER);

    INSERT INTO @tbl VALUES

    ('339FCE92-B595-4DFE-9F44-73BAB867DB53'),

    ('790CD7F2-63E6-4D2A-BDE8-22344DDD1AD4'),

    ('CDFC3B98-E00C-4CB6-BBC2-935A9AD9E1AE');

    -- a comma-separated list

    DECLARE @pString VARCHAR(8000)=

    '339FCE92-B595-4DFE-9F44-73BAB867DB53,790CD7F2-63E6-4D2A-BDE8-22344DDD1AD4';

    -- works

    SELECT * FROM @tbl t1

    INNER JOIN dbo.DelimitedSplit8K(@pString,',') d ON d.Item = t1.id;

    -- error - "Conversion failed when converting from a character string to uniqueidentifier."

    SELECT * FROM @tbl t1

    WHERE id IN (SELECT Item FROM dbo.DelimitedSplit8K(@pString,','));

    -- error - "Conversion failed when converting from a character string to uniqueidentifier."

    SELECT * FROM @tbl t1

    WHERE id IN (SELECT CAST(Item AS UNIQUEIDENTIFIER) FROM dbo.DelimitedSplit8K(@pString,','));

  • sfrostx (9/25/2014)


    Can anyone help me understand why the second and third selects in the following generate an error?

    The optimizer is free to reorder expression evaluation. In the second and third cases, it happens to push the implicit conversion to uniqueidentifier down to a place where candidate substrings are not fully-formed results yet. If you are using SQL Server 2012 or later, you could use TRY_CONVERT (or TRY_CAST):

    SELECT * FROM @tbl t1

    WHERE id IN

    (

    SELECT TRY_CONVERT(uniqueidentifier, Item)

    FROM dbo.DelimitedSplit8K(@pString,',')

    );

    Otherwise, break the results of the split into a table before performing the join.

  • sfrostx (9/25/2014)


    Can anyone help me understand why the second and third selects in the following generate an error?

    I don't understand it at all. An interesting twist is that whether the failing select produces a row before producing its error message depends on the order in which the rows were inserted into @tbl. The conversion fails for the string beginning 7 and also for the string beginning C when the value it is to be produced is part of an IN list, but the same conversion works in that context for the string beginning 3. However, I can't see any way of making the conversion fail for either value except when operating inside an IN list. There's nothing special about the GUIDs, all three are version 4 GUIDs (assuming that MS still uses standard quartet order for the text representation of its GUIDs, which it certainly used to do). The two that don't work are MS GUID version 4 standard. The one that does work is MS GUID version 4 for NCS backward compatability, and I can't see MS treating that as a particularly privileged sort of GUID. So to me it looks like a bug, but maybe someone who knows more about how MS in general and SQL Server in particular treat GUIDs will come up with a non-bug explanation. If you don't get a better response from someone, maybe you should raise a connect item for it.

    I haven't played with this on any SQL Server version other than 2014 - is that the version you hit the problem on?

    edit: I see Paul has a different, probably more useful, answer. As I read his answer, this definitely is a bug. But MS would deny that because the optimiser is a sacred cow. It's certainly interesting that TRY_CONVERT and TRY_CAST work when neither CONVERT nor CAST does, so at least there's a workaround.

    Tom

Viewing 15 posts - 676 through 690 (of 990 total)

You must be logged in to reply to this topic. Login to reply