String manipulation options

  • Hello,

    I am performing extensive string manipulation in a T-SQL script. Basically I'm parsing a single text string with an entire email chain in it. I need to parse out To:, From:, etc. and add all From:'s to a single column, all To:'s to a single column, etc.

    I have this working in T-SQL at a rate of approx. 15 seconds per 1000 rows. At the data volumes I am working with, this won't work.

    So, I'm considering a string parsing CLR function in C#. I believe this would improve speed but don't know how significantly. Any ideas?

    Are there any other alternatives?

    I'm looping through the dataset and for each row, I'm parsing the string for "To:", "From:", "CC:", and "BCC". I've got to consider:

    1) Multiple email addresses on a single line (i.e. person1@co.com, person2@co2.com, person3@abc.com)

    2) Single email per line.

    (i.e. person1@co.com,

    person2@co2.com,

    person3@abc.com)

    Here is a snippet of my code showing the string manipulation section:

    IF PATINDEX('%To:%', @ExtractedTextLine) > 0

    AND PATINDEX('%mailto%', @ExtractedTextLine) = 0

    BEGIN

    IF SUBSTRING(@ExtractedTextLine, LEN(@ExtractedTextLine) - 1, 1) = ','

    BEGIN

    SET @ToMultiLineFlag = 1;

    END;

    SET @To = @To + SUBSTRING(@ExtractedTextLine, PATINDEX('%To:%', @ExtractedTextLine) + 3, 10000) + ';';

    IF SUBSTRING(@ExtractedTextLine, LEN(@ExtractedTextLine) - 1, 1) <> ','

    BEGIN

    SET @To = @To + ';';

    END;

    END;

    IF @ToMultiLineFlag = 1

    AND @Location = 'To'

    AND PATINDEX('%To:%', @ExtractedTextLine) = 0

    BEGIN

    SET @To = @To + @ExtractedTextLine + ';';

    END;

    IF @ToMultiLineFlag = 1

    BEGIN

    SET @ToMultiLineFlag = 0;

    IF @Location = 'To'

    AND SUBSTRING(@ExtractedTextLine, LEN(@ExtractedTextLine) - 1, 1) = ','

    BEGIN

    SET @ToMultiLineFlag = 1;

    END;

    END;

    SET @Location = '';

    I'm looking for ideas on 1) how to speed up the T-SQL query or 2) how to use CLR for this.

    Thanks,

    Brett

  • CLR is by far the best way to do this, at least until SQL 2016 is released. You can use Delimited8KSplit found here on SSC.com to do this, although the 8K characters could be a limitation.

    There is a massive article/forum thread on SSC.com, recently updated with SQL 2016 split command, that has an amazing collection of options (and benchmarking) for string splitting, including some very spiffy CLR code. I think Adam Machanic has some posted online too.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Can you share some sample data and expected output?

    When you mention single email per line, do you mean using a carriage return or setting a new row?

    Which of your variables are parameters? What types are they?

    This might not require CLR or splitting.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • How are you with regular expressions? If you are going the clr route, look into .net's regex classes, i think you could find pre-defined patterns on the web for what you are trying to do.

  • Hi Luis,

    I've took some data and changed the information but the format, etc. is there. This data is from a single column (nvarchar(max)) with an entire email chain. So there are CR's and LF's, etc. Since there is no guaranteed format of the data (email from Outlook, Thunderbird, iPhone, etc.), it's a little tricky.

    ******* Start sample data ******************

    Subject: RE: Important info

    From: "Johnson, Al" <al.johnson@biz.com>

    Date: Thu, 1 Oct 2015 21:25:01

    To: "Smith, Hector" <hector.smith@company.com>,

    "Stiles, Harry" <harry.stiles@company.com>,

    "Monk, Thelonious" <thelonious.monk@goodjazz.com>,

    Email text. Bla bla bla.

    Thoughts on important info...

     

    Al Johnson

    Office: 111-222-3333

    Mobile: 777-888-9999

    al.johnson@biz.com

    From: Bill Smith

    Sent: Thursday, October 10, 2015 5:24 PM

    To: Bach, Johann; Mozart, Amadeaus; Smith, Bob; Walters, Sally; Stewart, Marge

    Subject: Important info

     

    Dear valued person,

     

    Email text goes here.

    Very important info...

    Thank you,

    Bill

    ******* End sample data ******************

    From the above example, the output should be 4 variables:

    FROM = "Johnson, Al" <al.johnson@biz.com>; Bill Smith

    TO = "Smith, Hector" <hector.smith@company.com>; "Stiles, Harry" <harry.stiles@company.com>; "Monk, Thelonious" <thelonious.monk@goodjazz.com>; Bach, Johann; Mozart, Amadeaus; Smith, Bob; Walters, Sally; Stewart, Marge

    CC = ''

    BCC = ''

    Hope this helps.

    Thanks,

    Brett

  • If you are just pulling out the email addresses this is certainly something that can be resolved with a simple splitter, CHARINDEX, PATINDEX and a couple other simple tricks. I'll talk a look at what you posted a little later today.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply