Extract Data and XML nodes from XML datatype

Question

Extract Data and XML nodes from XML datatype

adirudik

Mr or Mrs. 500

Points: 581
More actions
December 7, 2011 at 2:40 am

#253038

hey.
i've been dabbling with xml for the past few days and from the vast searches i have made i almost have a solution to my issue, but not quite. I hope someone can help me go that extra mile.
As a matter of course, I need to import MS Word documents to the SQL server on a daily basis. For reasons i cant specify we need to do this on an XML basis which Word can provide. I save the document data to an XML field in SQL. The data looks like this:
<?mso-application progid="Word.Document"?>
<w:wordDocument xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wsp="http://schemas.microsoft.com/office/word/2003/wordml/sp2" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no" xml:space="preserve">
<w:body>
<wx:sect>
<w:p wsp:rsidR="00DB06F9" wsp:rsidRPr="00E27331" wsp:rsidRDefault="00DB06F9" wsp:rsidP="00DB06F9">
<w:pPr>
<w:pStyle w:val="a3" />
<w:jc w:val="center" />
<w:rPr>
<w:rFonts w:ascii="Arial" w:h-ansi="Arial" />
<wx:font wx:val="Arial" />
<w:b />
<w:b-cs />
<w:sz w:val="28" />
<w:sz-cs w:val="28" />
<w:u w:val="single" />
<w:rtl />
</w:rPr>
</w:pPr>
<w:r wsp:rsidRPr="00E27331">
<w:rPr>
<w:rFonts w:ascii="Arial" w:h-ansi="Arial" />
<wx:font wx:val="Arial" />
<w:b />
<w:b-cs />
<w:sz w:val="28" />
<w:sz-cs w:val="28" />
<w:u w:val="single" />
<w:rtl />
</w:rPr>
<w:t>MY DATA</w:t>
</w:r>
</w:p>
as you can see, the Word XML is quite complex. I have identified that the information i need is in sections that end in w:p/w:r/w:t
Here is where it gets tricky. Since a Word document can change and is not a constant, there is a difference between documents as to how the data is sectioned. sometimes it's with tables:
w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t
and sometimes it has sections and sub sections
w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:sectPr/w:ftr/w:p/w:r/w:t
w:wordDocument/w:body/wx:sect/wx:pBdrGroup/w:p/w:r/w:t
So I have 2 solutions that get me close, but not quite there.
1: A code snippet i got in my searcehs for a routine to list all elements in the document. The problem with this one is that it doens't give me the namespaces and if i dont have those, i cant proceed.
DECLARE @idoc INT,
@xml XML SET @xml = (SELECT mytest from TestTable Where ID = 2)
EXEC sp_xml_preparedocument @idoc OUTPUT, @xml;
WITH E AS (SELECT * FROM OPENXML(@idoc,'/',3)), P AS ( -- anchor member
SELECT id, parentid, localname AS [Path] FROM E WHERE parentid IS NULL
UNION ALL -- recursive member
SELECT E.id, E.parentid, P.[Path] + '/' + localname AS [Path]
FROM P INNER JOIN E ON E.parentid = P.id ) SELECT [Path] FROM P
where Path like '%p/r/t'
group by path
EXEC sp_xml_removedocument @idoc
Results:
wordDocument/body/sect/p/pPr/sectPr/ftr/p/r/t
wordDocument/body/sect/p/r/t
wordDocument/body/sect/pBdrGroup/p/r/t
wordDocument/body/sect/sectPr/ftr/p/r/t
2. another code snippet i found somewhere:
declare@xml xml
select@xml =
'
<person>
<first_name>Sam</first_name>
<last_name>Watson</last_name>
<age>43</age>
<gender>Male</gender>
</person>
'
-------------------------------------------
----------------------------------------------------------------
SELECT x.query('
for $node in /descendant::node()[local-name() != ""]
return <node>
<namespace>{ namespace-uri($node) }</namespace>
<localname>{ local-name($node) }</localname>
<value>{ $node }</value>
<parent>{ local-name($node/..) }</parent>
</node>') AS nodes
FROM @xml.nodes('/person') x(x)
SELECT DISTINCT
T.n.value('namespace[1]', 'varchar(100)') AS Namespace,
T.n.value('localname[1]', 'varchar(100)') AS Localname,
T.n.value('value[1]', 'VARCHAR(100)') AS val,
T.n.value('parent[1]', 'VARCHAR(100)') AS parent
FROM ( SELECT
x.query('
for $node in /descendant::node()[local-name() != ""]
return <node>
<namespace>{ namespace-uri($node) }</namespace>
<localname>{ local-name($node) }</localname>
<value>{ $node }</value>
<parent>{ local-name($node/..) }</parent>
</node>') AS nodes
FROM @xml.nodes('/person') x(x)
) q1
CROSS APPLY q1.nodes.nodes('/node') AS T ( n )
this works unless i apply
SET @xml = (SELECT mytest from TestTable Where ID = 2)
in which case it returns nothing. Again, i believe it has a lot to do with the namespaces. Yes, i also see the reoccuring theme here.
I need that last stretch in order to get my entire software to work. Any help here would be great.
Thanx a lot.
Adi
p.s. I'm very new to XML and what i had managed to gather is through a lot of searching. I dont really know how most of this code works so i would appreciate it if in the solution you post there will be some kind of code element. Much obliged.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

sturner SSC-Insane Points: 22411 More actions · Answer 1

It may not be entirely doable with TSQL. I would probably opt for manipulating a word document via CLR in C# via the system.xml namespace. It gives you much finer granularity in dealing with XML documents, especially when it comes to namespaces.

The probability of survival is inversely proportional to the angle of arrival.

adirudik Mr or Mrs. 500 Points: 581 More actions · Answer 2

unfortunatly that's not an option for me. I'm still looking for a solution in SQL...

thx

adirudik Mr or Mrs. 500 Points: 581 More actions · Answer 3

Ok, found a kind of solution that helps me.

as part of solution 1 per the above, I have found a brilliant site with all the XML knowledge i needed

http://beyondrelational.com/blogs/jacob/archive/2008/06/26/xquery-labs-a-collection-of-xquery-sample-scripts.aspx

and the solution was to add the TEXT column herewith to the output i already had of the sections i needed and presto i have all the information required.

SET @xml = (SELECT mytest from TestTable Where ID = 2)

EXEC sp_xml_preparedocument @idoc OUTPUT, @xml

;WITH cte AS (

-- shreds XML to rowset

SELECT * FROM OPENXML(@idoc,'/',3)

), rcte AS (

-- anchor part of recursive query

SELECT 0 AS Level, id, parentid, nodetype, localname, prev,

CAST(text AS NVARCHAR(max)) AS Text,

CAST(localname AS NVARCHAR(100)) AS path,

CAST(id AS VARBINARY(128)) AS Sort

FROM cte WHERE id = 0

-- recursive part

UNION ALL

SELECT p.level + 1, c.id, c.parentid, c.nodetype, c.localname, c.prev,

CAST(c.text AS NVARCHAR(max)),

CAST(p.path + CASE WHEN c.nodetype = 3 THEN '' ELSE '\' END +

CASE WHEN c.nodetype = 2 THEN '@' ELSE '' END +

CASE WHEN c.nodetype = 3 THEN '' ELSE c.localname END AS NVARCHAR(100)),

CAST(p.Sort + CAST(c.id AS VARBINARY(4)) AS VARBINARY(128))

FROM cte c

INNER JOIN rcte p ON p.id = c.parentid)

-- the final query

insert into #XmlData (Path,text)

SELECT path, text FROM rcte

WHERE text IS NOT NULL

and path like 'worddocument\body\%\t'

ORDER BY sort