September 30, 2015 at 9:28 am
Hi Everyone,
I am extracting data from a SharePoint list object and one of the columns that I am pulling from SharePoint has free form text inside various <div> and/or <p> tags. I have the data extracted from SharePoint and stored in a SQL Server 2012 table and I've gotten the basic string parsing working but I can see that my design is a bit problematic and I am trying to see if there is a way to re-write this so that the SQL statement can handle changing amounts of data in the future. I will not know ahead of time if the data will have 1 or 100 <p> tags so I figured I'd reach out for some guidance.
Right now, in the data that I have to test against I have the following 4 patterns:
Text data between 2 div tags:
<div class="ExternalClassFB2223956DB545BC820BD55A20925FD7">Estimated measurements of all braid & sleeving. Cut to length. Applied 2 layers of braid as required per drawing and sleeving. </div>
Text data inside a single paragraph opening/closing tag between the div tags:
<div class="ExternalClassF7C7EB4C8ED147B9AC7EDEEE90B9E6EE"><p>sn 150286-01-07, -08, -09, -13, -14, -15</p></div>
Text data inside 2 different paragraph opening/closing tags between the div tags:
<div class="ExternalClass6C6D6B8FA7CB43218E63A65B555A1A91"><p>sn changed to150078-02-1</p><p>was325-14-299-001</p></div>
Text data inside 3 different paragraph opening/closing tags between the div tags:
<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>
The case statement that I wrote to parse this out is here (I'm also using Cade Bryant's dbo.charindex2 function http://www.sqlservercentral.com/scripts/Miscellaneous/30497/):
select CASE
WHEN len(notes) - len(replace(notes, '<p>', '')) = 0 --No <p> tags in the Notes column
THEN substring(notes, (dbo.charindex2('">', notes, 1) + 2), (dbo.charindex2('</div>', notes, 1) - (dbo.charindex2('">', Notes, 1) + 2)))
WHEN len(notes) - len(replace(notes, '<p>', '')) = 3 --1 <p> tag in the Notes column
THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3)))
WHEN len(notes) - len(replace(notes, '<p>', '')) = 6 --2 <p> tags in the Notes column
THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 2) + 3), (dbo.charindex2('</p>', notes, 2) - (dbo.charindex2('<p>', notes, 2) + 3)))
WHEN len(notes) - len(replace(notes, '<p>', '')) = 9 --3 <p> tags in the Notes column
THEN substring(notes, (dbo.charindex2('<p>', notes, 1) + 3), (dbo.charindex2('</p>', notes, 1) - (dbo.charindex2('<p>', notes, 1) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 2) + 3), (dbo.charindex2('</p>', notes, 2) - (dbo.charindex2('<p>', notes, 2) + 3))) + ' ' + substring(notes, (dbo.charindex2('<p>', notes, 3) + 3), (dbo.charindex2('</p>', notes, 3) - (dbo.charindex2('<p>', notes, 3) + 3)))
ELSE NULL --unhandled
END parsed_notes
from stg.SP_ProdControl_Time
where notes is not null
The above SQL works and successfully parses out the free form entry text that is placed into the SharePoint text box but the maintenance of this script is going to be a pain in the neck. I'm certain there is a better/more efficient way of accomplishing this task and I wanted to see if anyone had any suggestions.
Thanks for taking the time to read this.
Josh
September 30, 2015 at 10:45 am
HTML is a subset of XML, it's probably more efficient to treat it as XML and use xquery.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
September 30, 2015 at 11:27 am
Yikes. Here's an alternative that should perform better for you:
select STUFF(
cast((select ' ' + SUBSTRING(spt.notes, t.tally + 3, CHARINDEX('</p', spt.notes, t.tally + 3) - (t.tally + 3))
from (
select CHARINDEX('<p>', spt.notes) as first_p_tag
) as ca1
inner join dbo.tally t on first_p_tag > 0 and t.tally >= first_p_tag and SUBSTRING(spt.notes, t.tally, 3) = '<p>'
for xml path('')
) as varchar(max))
, 1, 1, '')
from ( --stg.SP_ProdControl_Time
select '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>' as notes union all
select '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p><p>tag4blah</p><p>tag5blahblah</p><p>tag6blahblahblah</p></div>'
) spt
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
September 30, 2015 at 12:17 pm
Thank you both for the replies.
Scott, can you explain dbo.tally and t.tally. I'm not sure I understand what they are or what I need to do to create that table. Sorry if this is a stupid question.
September 30, 2015 at 12:46 pm
jguffey (9/30/2015)
Thank you both for the replies.Scott, can you explain dbo.tally and t.tally. I'm not sure I understand what they are or what I need to do to create that table. Sorry if this is a stupid question.
No, sorry, my bad. I meant to add something about it but didn't.
A "standard tally" table is just a table of sequential numbers, from 0 to however high you want to go; 1 million rows is typical. It's really remarkable how useful such a simple table can be for many SQL tasks.
I was going to post code to create one but a filter at work thinks it matches a "sql injection" pattern.
At any rate, do a Google search for creating a permanent tally table using CROSS JOINs. I believe Itzik Ben-Gan first demo'd the technique, but anyone can help you create one.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
September 30, 2015 at 1:17 pm
ScottPletcher (9/30/2015)
Yikes. Here's an alternative that should perform better for you:
Part of the reason that XML string splitters are inefficient is converting character separated strings to a format that can be interpreted as XML. Since this string is already in XML format, that overhead is not present.
I haven't had a chance to do a comparison of the two methods.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
September 30, 2015 at 1:22 pm
drew.allen (9/30/2015)
ScottPletcher (9/30/2015)
Yikes. Here's an alternative that should perform better for you:Part of the reason that XML string splitters are inefficient is converting character separated strings to a format that can be interpreted as XML. Since this string is already in XML format, that overhead is not present.
I haven't had a chance to do a comparison of the two methods.
Drew
Sorry, I meant better than the original code, not necessarily better than the XML method. I don't really know how the XML method might perform.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
September 30, 2015 at 2:29 pm
So I created the tally table and executed your code (thanks for sharing/educating).
I'm a little confused as to why the leading ? character is present in the results but I'll keep digging around with the script and see if I can figure it out. Just wanted to post some feedback.
Query:
SELECT STUFF(cast((
SELECT ' ' + SUBSTRING(spt.notes, t.N + 3, CHARINDEX('</p', spt.notes, t.n + 3) - (t.n + 3))
FROM (
SELECT CHARINDEX('<p>', spt.notes) AS first_p_tag
) AS ca1
INNER JOIN dbo.Numbers t ON first_p_tag > 0
AND t.n >= first_p_tag
AND SUBSTRING(spt.notes, t.n, 3) = '<p>'
FOR XML path('')
) AS VARCHAR(max)), 1, 1, '')
FROM (
--stg.SP_ProdControl_Time
SELECT '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p></div>' AS notes
UNION ALL
SELECT '<div class="ExternalClass4A79E8DF5C67405CA0FA7385F7A9A618"><p>?wrap 4 harnesses with nylon tape 1.75 hrs</p><p>wrap 4 harnesses with foil 4.25 hrs</p><p>apply 2 layers of braid 2 hrs.</p><p>tag4blah</p><p>tag5blahblah</p><p>tag6blahblahblah</p></div>'
) spt
Query Results:
(No column name)
?wrap 4 harnesses with nylon tape 1.75 hrs wrap 4 harnesses with foil 4.25 hrs apply 2 layers of braid 2 hrs.
?wrap 4 harnesses with nylon tape 1.75 hrs wrap 4 harnesses with foil 4.25 hrs apply 2 layers of braid 2 hrs. tag4blah tag5blahblah tag6blahblahblah
September 30, 2015 at 3:05 pm
I wondered about that too, but I noticed there's an odd character before "wrap" that looks like a blank, but really isn't. When you copy it in the SSMS editor, it looks like a (red) dash, but it returns to a blank if you try to put quotes around it :crazy:
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
September 30, 2015 at 3:22 pm
jguffey (9/30/2015)
So I created the tally table and executed your code (thanks for sharing/educating).[/code]
Make sure that the only column in the Tally Table is NOT NULL and is the PK of the table so that a unique clustered index is formed on it. Otherwise, it'll be quite slow with table scans all the time.
Here's an introductory article on what Tally Tables are and how they can be used to replace certain types of loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply