Processing strings.

  • Jeff Moden (4/1/2016)


    Sachin Nandanwar (3/31/2016)


    Why would you want to do string processing using TSQL ? Please see the attached image. I wrote this in C# in less than 20 minutes.It can be then easily integrated as CLR .

    I can't see the "ELSE" on that and correct me if I'm wrong, but that looks like it might not handle trailing delimiters correctly... which is why I do it in T-SQL. That way I don't have to rely on someone doing it possibly wrong in C#. Heh... and no... not going to take the time to learn C# for the edge cases where I might need it. 😛 I'm too busy fixing things in T-SQL that were written by experts in C#. :hehe:

    Jeff,

    Thanks for this well formulated Anwser. Exactly my sentiment.

    Sachin,

    Offcourse you are right, if you are familiar with C# AND familiar with the interface of CLR to the database. Performance Wise I think seperation of elements in CLR will work much faster, but doubt that inserting 10 milion elements into a database with CLR will work fast. Even if the 10 milion are inserted into the database as half a milion rows with 20 fields I do not know what the actual speed is with CLR.

    As for the complexity, the simple parsing only takes 3 lines of SQL code. (Thanks to Jeff Moden's DelimitedSplit8K). See Below Signature:

    Reality.

    Most .CSV files have a 'simple' structure and there is not problem parsing them with a 'simple' solutions. And then transforming them into a table. (Filesize might become a problem:crazy:)

    A small number of .CSV files contain free texts in the fields, with no restriction to the free text. Delimitorsymbols, quotes, double quotes and the new lines can be used in an element.

    The delimitor is always the semicolon (in my example files).

    Only a single Text Qualifier is used (in my example files) wich is the double quote.

    (So the complexity of usage of both Text Qualifiers (quote and double quote) is not there).

    But ofcourse the 'free texts' can contain both Text Qualifiers.

    Thanks for your time and support,

    Ben

    ;With lines as (select Itemnumber LineNumber, item lineitem FROM (Select @s-2 S) S cross apply master.dbo.DelimitedSplit_Long(S,@CR) D )

    ,elements as (select LineNumber, itemnumber elementnr, item element FROM Lines cross apply master.dbo.DelimitedSplit8K(lineitem,@Delim) D2 )

    select * into #elements from elements

  • Jeff Moden (4/1/2016)


    I can't see the "ELSE" on that and correct me if I'm wrong, but that looks like it might not handle trailing delimiters correctly... which is why I do it in T-SQL. That way I don't have to rely on someone doing it possibly wrong in C#. Heh... and no... not going to take the time to learn C# for the edge cases where I might need it. 😛 I'm too busy fixing things in T-SQL that were written by experts in C#. :hehe:

    Had a very busy Friday so could not reply.Below is the entire code including the ELSE part.

    using System;

    using System.Collections.Generic;

    using System.Linq;

    using System.Text;

    namespace ConsoleApplication1

    {

    class Program

    {

    static void Main(string[] args)

    {

    string s = @"a,b,""'help, me',my name is 'ben'"",c,d,e,""another string, with 'those' quotes"",g";

    List<string> stringsplit = new List<string>();

    var splitvalue = s.Split('"').ToList<string>();

    splitvalue.RemoveAll(a => a == ",");

    foreach (var val in splitvalue)

    {

    if (val.StartsWith(",") || val.EndsWith(","))

    {

    var cnt = val.TrimEnd(',').TrimStart(',');

    stringsplit.AddRange(cnt.Split(','));

    }

    else

    {

    if (val.ToString() != null) stringsplit.Add(val);

    }

    }

    }

    }

    }

    Between I am not a C# expert, not even close by the slightest of margin :hehe:. But when I consulted one of my friend whom I do consider a C# expert he directed me to this sweet six line piece of C# code that references the Visual Basic TextFieldParser Object.Throw anything at and it will split it for you :unsure:

    using System;

    using System.Collections.Generic;

    using System.Linq;

    using System.Text;

    namespace ConsoleApplication1

    {

    class Program

    {

    static void Main(string[] args)

    {

    string s = @"a,b,""'help, me',my name is 'ben'"",c,d,e,""another string, with 'those' quotes,"",g,ab,c,,,MSDN,";

    List<string> stringsplit = new List<string>();

    using (System.IO.MemoryStream mStream = new System.IO.MemoryStream(Encoding.ASCII.GetBytes(s.TrimEnd(','))))

    {

    using (Microsoft.VisualBasic.FileIO.TextFieldParser TxtParser = new Microsoft.VisualBasic.FileIO.TextFieldParser(mStream))

    {

    TxtParser.Delimiters = new string[] { "," };

    TxtParser.HasFieldsEnclosedInQuotes = true;

    string[] splitvalue = TxtParser.ReadFields();

    for (int i = 0; i < splitvalue.Length; i++)

    {

    stringsplit.Add(splitvalue);

    }

    }

    }

    }

    }

    }

    EDIT: Deleted the code screenshots and replaced it with the actual C# code text.

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • Between just stumbled across a very fantastic discussion related to different Split methods for SQL Server.Paul White,Jeff Moden,Aaron Bertrand,Barry Young,Phil Factor(always wondered what his real name is) all these great SQL gurus had shared their views.Really worth reading.

    http://www.sqlservercentral.com/Forums/Topic695508-338-1.aspx

    There are some test results posted here http://www.sqlservercentral.com/Forums/Topic695508-338-10.aspx

    EDIT:Added the second URL.

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • Sachin Nandanwar (4/2/2016)


    Between just stumbled across a very fantastic discussion related to different Split methods for SQL Server.Paul White,Jeff Moden,Aaron Bertrand,Barry Young,Phil Factor(always wondered what his real name is) all these great SQL gurus had shared their views.Really worth reading.

    http://www.sqlservercentral.com/Forums/Topic695508-338-1.aspx

    There are some test results posted here http://www.sqlservercentral.com/Forums/Topic695508-338-10.aspx

    EDIT:Added the second URL.

    Ah... be real careful about using those test results. I've proven many times that using the same row of data a million times as test data (I call it "Grooved Data") will cause things like XML splitters to seem to run in a nearly instantaneous fashion. Then, when you play them against real data that has a high cardinality, it becomes absolutely slothful. Even if you use 5 different rows that have been multiplied to occupy a million rows, you end up with a similar situation.

    If you want to see splitter testing that has been done correctly, please see the following article on the subject.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I also tried to explain a bit of the problem throughout the later parts of those threads but it fell on deaf ears. I've also tried to warn people on other articles that do performance testing but that also met with deaf ears. That why some people think that an XML based string splliter that has to convert the CSV to XML is faster than any Tally Table based method.

    In fact, I'm in the final edit of an article that explains that the "Devil's in the Data".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • P.S. Please stop posting pictures of code. You can't copy the code from those for test.

    Also, instead of using C# to call Visual Basic, stop the tower of Babel and just make the CLR using Visual Basic.

    I'll also warn that the built in splitter for C# is relatively slow. It could be that the similar splitter in VB is just as slow. If it were me, I'd test the hell out of it for performance using the code from the Tally OH article I just referenced.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/2/2016)


    Ah... be real careful about using those test results. I've proven many times that using the same row of data a million times as test data (I call it "Grooved Data") will cause things like XML splitters to seem to run in a nearly instantaneous fashion. Then, when you play them against real data that has a high cardinality, it becomes absolutely slothful. Even if you use 5 different rows that have been multiplied to occupy a million rows, you end up with a similar situation.

    Thanks for sharing that.It would be really great if you could share some of your "Grooved Data".

    If you want to see splitter testing that has been done correctly, please see the following article on the subject.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I went through the article.At the end there is a performance graph http://www.sqlservercentral.com/Images/9319.gif and the following conclusion

    Make no doubt about it. The current best way to split delimited strings in SQL Server is to use a CLR splitter. However, if, for whatever reason, you cannot use a CLR splitter, the new DelimitedSplit8K function provides a close second with both linear and stable performance across a wide range of string and individual element size.

    Does the above statement still stand ?

    I also tried to explain a bit of the problem throughout the later parts of those threads but it fell on deaf ears. I've also tried to warn people on other articles that do performance testing but that also met with deaf ears. That why some people think that an XML based string splliter that has to convert the CSV to XML is faster than any Tally Table based method.

    In fact, I'm in the final edit of an article that explains that the "Devil's in the Data".

    Would really love to read the article.

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • Jeff Moden (4/2/2016)


    P.S. Please stop posting pictures of code. You can't copy the code from those for test.

    Apologies.I will fix it.

    Also, instead of using C# to call Visual Basic, stop the tower of Babel and just make the CLR using Visual Basic.

    I don't think it should matter because that is what .Net interoperability is all about.All managed and unmanaged code can co exists within the same assembly.

    I'll also warn that the built in splitter for C# is relatively slow. It could be that the similar splitter in VB is just as slow. If it were me, I'd test the hell out of it for performance using the code from the Tally OH article I just referenced.

    On any given day I would rather believe your words more than my tests...:-)

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • Sachin Nandanwar (4/2/2016)


    EDIT: Deleted the code screenshots and replaced it with the actual C# code text.

    You, Sir, are a very good man. Well done and thank you very much.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin Nandanwar (4/3/2016)


    Jeff Moden (4/2/2016)


    Ah... be real careful about using those test results. I've proven many times that using the same row of data a million times as test data (I call it "Grooved Data") will cause things like XML splitters to seem to run in a nearly instantaneous fashion. Then, when you play them against real data that has a high cardinality, it becomes absolutely slothful. Even if you use 5 different rows that have been multiplied to occupy a million rows, you end up with a similar situation.

    Thanks for sharing that.It would be really great if you could share some of your "Grooved Data".

    The entire thread that you've made reference to is filled with such examples. But, for convenience, here's the test data generation code from a site that made the mistake of using "grooved" data.

    SET NOCOUNT ON;

    GO

    CREATE TABLE #x(s NVARCHAR(MAX));

    INSERT #x SELECT N'a,id,xyz,abcd,abcde,sa,foo,bar,mort,splunge,bacon,';

    GO

    INSERT dbo.strings SELECT 1, s FROM #x;

    GO 10000

    INSERT dbo.strings SELECT 2, REPLICATE(s,10) FROM #x;

    GO 1000

    INSERT dbo.strings SELECT 3, REPLICATE(s,100) FROM #x;

    GO 100

    INSERT dbo.strings SELECT 4, REPLICATE(s,1000) FROM #x;

    GO 10

    INSERT dbo.strings SELECT 5, REPLICATE(s,10000) FROM #x;

    GO

    DROP TABLE #x;

    GO

    -- then to clean up the trailing comma, since some approaches treat a trailing empty string as a valid element:

    UPDATE dbo.strings SET string_value = SUBSTRING(string_value, 1, LEN(string_value)-1) + 'x';

    If you want to see splitter testing that has been done correctly, please see the following article on the subject.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    I went through the article.At the end there is a performance graph http://www.sqlservercentral.com/Images/9319.gif and the following conclusion

    Make no doubt about it. The current best way to split delimited strings in SQL Server is to use a CLR splitter. However, if, for whatever reason, you cannot use a CLR splitter, the new DelimitedSplit8K function provides a close second with both linear and stable performance across a wide range of string and individual element size.

    Does the above statement still stand ?

    If it didn't, I'd correct it in the article. 😉 It's my nature to be a bit pedantic about such things. 😛

    There is a modification of the DelimitedSplit8K splitter that Eirikur Eiriksson published that uses some additional functionality that came out in 2012 to nearly double the performance, yet again. I've not personally tested it but most agree on the performance findings in his article. You can find his good code at the following URL.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    I also tried to explain a bit of the problem throughout the later parts of those threads but it fell on deaf ears. I've also tried to warn people on other articles that do performance testing but that also met with deaf ears. That why some people think that an XML based string splliter that has to convert the CSV to XML is faster than any Tally Table based method.

    In fact, I'm in the final edit of an article that explains that the "Devil's in the Data".

    Would really love to read the article.

    Thank you, kind Sir. I've been tweaking it and should have it in Steve Jones hands before the end of the week.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin Nandanwar (4/3/2016)


    Jeff Moden (4/2/2016)


    P.S. Please stop posting pictures of code. You can't copy the code from those for test.

    Apologies.I will fix it.

    Again, thank you. That will make it a whole lot easier for any that want to test your code examples. Who knows what folks will come up with when given such an excellent opportunity.

    Also, instead of using C# to call Visual Basic, stop the tower of Babel and just make the CLR using Visual Basic.

    I don't think it should matter because that is what .Net interoperability is all about.All managed and unmanaged code can co exists within the same assembly.

    It's not the .Net interoperability that I'm concerned with, in such cases. That usually works just fine. Although the code is not likely to need a change, the code does rely on someone not only knowing about T-SQL and SQLCLR using C# but now they also need to know about VB to be able to maintain it properly. This is what Ben and I have been taking some good exception to in this thread.

    I'll also warn that the built in splitter for C# is relatively slow. It could be that the similar splitter in VB is just as slow. If it were me, I'd test the hell out of it for performance using the code from the Tally OH article I just referenced.

    On any given day I would rather believe your words more than my tests...:-)

    You're way too kind in that area. I gave up working with most any kind of code other than DOS, WMI, and T-SQL way back in 2002 so that I could concentrate on what I love and to get away from the arm-chair GUI designers that had me pushing objects by pixel or required me to do ridiculous things like putting White lettering on a Pale Yellow background for a warning field on screen.

    I tell you that because I've not personally done the tests. I don't even have Visual Studio installed on my machines. I can only advise that people that I trust have demonstrated the comparative slowness but, as with all else, people should do their own testing for confirmation. As a very wise man once said, "One good test is worth a thousand 'expert' opinions". 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you very much Sir Jeff.You are way to humble given your high stature in the world of SQL Server.

    I am glad that I chimed in this post.Made me understand lot of things that I was quite ignorant about.But I still stand with my previous statement :

    On any given day I would rather believe Sir Jeff's words more than my tests...:-)

    And thank you Eirikur Eiriksson for this wonderful article http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • First of all thanks to all the contributors of this thread.

    Allthough this thread and all the links (and indirect links) is a bit overwelming.

    I have allready learned a lot and haven't 'processed' all the info 'generated' by this thread.

    So for me what is next?

    A good starting point is:

    Reaping the benefits of the Window functions in T-SQL[/url]

    As said, I have not completely processed this info, but I encountered some issue's.

    1. It is still based on 8K string lengths.

    2. It uses the new LEAD and LAG functions.

    3. It does not handle multi line values in a multi row CSV. (Or did I mis this).

    1.

    Asked about this in:

    DelimitedSplit8K versus DelimitedSplit_Long

    2.

    Have to study these constructions and/or maybe use equivalent 2008 constructions.

    (But this will take me some time.)

    Any Advise on this is appreciated.

    3.

    A possibility will be to use the coding twice, first to split the total CSV into Lines. (And ignoring the crlf's within Values).

    And then to split the Lines into values.

    Or create a 'split' table 'intelligent' enough to do both in one go.

    (Maybe I missed something in the article which does solve this problem).

    At the moment my best option (I think) is to split the CSV into lines using the described technique, with an extended split function which can handle long strings, and then split the lines into values. Maybe the second 'round' can be done with the 8K version provided that there are no lines longer than 8K. (Some multi line values can be pretty large I have seen, they can contain complete 'letters' and or result overviews).

    Am I on the right track ? Any Advise ?

    The code I build (somewhere previous in this thread) did help me to understand (part of) the problem. But probably has les performance than the 'Reaping' version. So I would build my solution based on the 'Reaping' version but with the 3 points above 'solved'.

    All thanks for your contributions so far,

    and for teaching me,

    Ben

  • Jeff Moden (4/2/2016)


    Also, instead of using C# to call Visual Basic, stop the tower of Babel and just make the CLR using Visual Basic.

    Heck I've been able to use multiple languages for ages. Microsoft has always been good about standardizing their object files for separate compilation and linking, even back in old stone aged VMS, you could pretty much link object files if you knew calling conventions.

    This doesn't even touch on client / server arrangements, browser agnostic web pages, any number of scripting strategies, unix piping philosophies etc,

    But whatever floats your boat, if you can't see two inches beyond your TSQL experience, the least you could do is admit that separate language linking is at least a valid approach.

    (edited truncated submission)

    further edit:

    What happens when you REALLY have a decent pile of work in one language, another pile of work in another language, and an easy way to link them? Are you REALLY going to rewrite a significant amount of work just to satisfy this "tower of babel" argument? C# often gets features first, but some work is still available in VB that someone might want to take advantage of.

  • patrickmcginnis59 10839 (4/4/2016)


    But whatever floats your boat, if you can't see two inches beyond your TSQL experience, the least you could do is admit that separate language linking is at least a valid approach.

    You don't need to get snotty about it, Patrick. My point is that I've worked in shops where Active X, Perl, VBS, VB, C#, DOS, and a couple of other "tools" were all used in a single system that not only worked at a snail's pace (took 45 minutes just to get 1 file ready for import, never mind actually importing it) but no one that was currently employed knew about many of the tools and so no one did anything to improve the situation. What "floats my boat" is people being able to fix other people's code and, if you have "Tower of Babel" code, that task is made much more complex even for people that have more than one oar in the water.

    To coin a phrase... "Just because you can do something in a dozen different languages, doesn't mean you should and if you do... especially all at once." 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/4/2016)


    patrickmcginnis59 10839 (4/4/2016)


    But whatever floats your boat, if you can't see two inches beyond your TSQL experience, the least you could do is admit that separate language linking is at least a valid approach.

    You don't need to get snotty about it, Patrick. My point is that I've worked in shops where Active X, Perl, VBS, VB, C#, DOS, and a couple of other "tools" were all used in a single system that not only worked at a snail's pace (took 45 minutes just to get 1 file ready for import, never mind actually importing it) but no one that was currently employed knew about many of the tools and so no one did anything to improve the situation. What "floats my boat" is people being able to fix other people's code and, if you have "Tower of Babel" code, that task is made much more complex even for people that have more than one oar in the water.

    Right, just like there is no dot net programmer who has ever studied sql servers. Forgive me for even bothering to comment.

    edit: My point still stands, and I believe it, and even if your point has merit, two languages is hardly a Tower of Babel.

    Snotty indeed.

Viewing 15 posts - 16 through 30 (of 62 total)

You must be logged in to reply to this topic. Login to reply